Vietnamese: n-grams

For language, most NLP algorithms have focused on the frequency of an individual word as the fundamental unit of analysis, which I think is not accurate because all languages evolved.

While words are the building block of language, we only rely on counting as a primary means of analysis. The problem with this approach is my chat-bot misidentifying meaningful phrases, or multi-word expressions, in natural language. While partitioning a text into words is straightforward, partitioning into meaningful phrases would require human involvement. I tried to utilize on n-grams, which are a now common and fast approach for parsing a text. Vietnamese creates different issues altogether.

The official Vietnamese language is a complex language with many accents (including acute, grave, hook, tilde, and dot-below) and Latin alphabets. These are two components in Vietnamese that cannot be separated. However, many Vietnamese choose to use accentless Vietnamese because it is easier and quicker to type creates a problem for the chat-bot.

Luan-Nghia Pham et al propose a combination of n-gram method and phrase dictionary. This method considers the accent predicting as a statistical machine translation (SMT) problem with source language as accentless texts and target language as accent text.

Gaming startup VNG aims to launch Vietnam’s answer to ChatGPT

How many small business – ask ChatGPT

Business Personalization – by ChatGPT

Gaming startup VNG aims to launch Vietnam’s answer to ChatGPT

How many small business – ask ChatGPT

Business Personalization – by ChatGPT

Meal Kits – Lower Carbon Footprint

Gaming startup VNG aims to launch Vietnam’s answer to ChatGPT

How many small business – ask ChatGPT

Business Personalization – by ChatGPT

Meal Kits – Lower Carbon Footprint

Multi-Agent System