Vietnamese: n-grams

For language, most NLP algorithms have focused on the frequency of an individual word as the fundamental unit of analysis, which I think is not accurate because all languages evolved.

While words are the building block of language, we only rely on counting as a primary means of analysis. The problem with this approach is my chat-bot misidentifying meaningful phrases, or multi-word expressions, in natural language. While partitioning a text into words is straightforward, partitioning into meaningful phrases would require human involvement. I tried to utilize on n-grams, which are a now common and fast approach for parsing a text. Vietnamese creates different issues altogether.

The official Vietnamese language is a complex language with many accents (including acute, grave, hook, tilde, and dot-below) and Latin alphabets. These are two components in Vietnamese that cannot be separated. However, many Vietnamese choose to use accentless Vietnamese because it is easier and quicker to type creates a problem for the chat-bot.

Luan-Nghia Pham et al propose a combination of n-gram method and phrase dictionary. This method considers the accent predicting as a statistical machine translation (SMT) problem with source language as accentless texts and target language as accent text.