What is so different about Chinese NLP?

1_KPLRlpBiIqzgsULlD7lw7w

Before all else, I would like to address that, in this short article, I will be discussing about Natural Language Processing on Simplified Chinese which is the official and the most widely used writing system in Mainland China, and occasionally in Malaysia, and Singapore, although many techniques mentioned here are also suited for Traditional Chinese.

There are roughly 1.5 billion Chinese speakers write or type daily in Simplified Chinese. According to a survey conducted back in 2019, 37% of the global active users on Steam (the most popular gaming platform) had their language set to Simplified Chinese.

Some of you might prone to use machine translation tools such as Google Translate or DeepL for a quick solution. This may serve well as a sloppy translator between two European languages, but they perform very poorly on Chinese. This is dangerous because wrong interpretations often lead to wrong decisions.

 

ResearchGate

Let’s take a glimpse on some of the linguistic and technical aspects of Natural Language Processing on Simplified Chinese.

 

  1. Punctuation

Linguistic explanation:

The Chinese language has a unique set of punctuation marks. Unlike in European languages, punctuations in Chinese often occupy the same space as the characters.

Technical NLP :

In case we need to use sentence tokenization, keep in mind that the delimiters are configured to these punctuation marks.

 

  1. Space

Linguistic explanation :

There is no space between words in a sentence in Chinese. Words in Chinese can be one character, or much more often, several characters. Besides, the grammatical relations are indicated by word order.

Technical NLP:

Simple word tokenization is not applicable here. We need to separate the words by a manually defined glossary with grammatical relations taken into consideration.

 

  1. Letter case

Linguistic explanation:

Lowercase, uppercase and capitalization do not exist in Chinese.

Technical NLP:

No prior data preparation regarding this matter.

 

  1. Tense & gender

Linguistic explanation:

There is no such concept as conjugation in Chinese. Once we indicate a timeframe in a sentence, we do not have to change the form of the verbs.

It’s a gender-neutral language. The only circumstance where we need to specify the gender is when we are using a third-person singular pronoun, like him or her in English.

Technical NLP:

We do not need to stem or lemmatize words in Chinese.

 

However, the other com mon NLP techniques are still relevant to Chinese, such as removal of stop words, vectorization, part of speech tagging, and named entity recognition, etc.

Leave a Reply

Your email address will not be published. Required fields are marked *