Word segmentation

Word segmentation or tokenization is the division of a string of written language into its component words. However, some languages do not have a word segmentation process. Word segmentation is a difficulty for these languages:

  • Chinese or Japanese, sentences but not words are delimited;
  • Thai and Lao, phrases and sentences but not words are delimited;
  • Vietnamese syllables but not words are delimited.

There are two types of ambiguities in word segmentation for the Vietnamese language. The first ambiguity is called “overlap” where adjacent syllables can have different word segmentation and their validity cannot be determined without looking at the entire sentence. The second ambiguity is called the “combination” ambiguity where two adjacent syllables can either be divided or combined to make words.