Piyush Kumar

Tokenization in NLP

nlp

As we all know that NLP or Natural Language Processing has become a relevant field of research nowadays. Natural language processing is one of the fields in which we teach computers how to process natural language.

To make dumb machines like computers understand something as complex as a language, we need to make sure that we give them information in small portions and that also in a well-structured form. Just like how a human learns.

We need to break the text information down in such a way that our machine can understand it. For ensuring this, we use a method or a process which is called “Tokenization” in NLP.

It is similar to preprocessing data before feeding it to our machine learning models. There are various ways to preprocess the text data:

Tokenization
Stop Word Removal
Stemming
Lemmatization
And more …

What is Tokenization?

The first thing we need to do in any NLP project is to text preprocessing. It means that we’re putting the data into a predictable and analyzable form.

One of the preprocessing steps, and most important, is tokenization.

In this process, we break the stream of text information into smaller chunks of textual data like words, symbols, sentences, terms, or something meaningful elements called tokens. Tokens can be thought of as words in a sentence, or sentences in a paragraph.

This helps us transform unstructured textual information into an organized data structure that is suitable for machine learning.

Example:

Sentence:- Keep progressing.

Word tokenization: [‘Keep’, ‘progressing’]

Word: bigger

Character tokens: [‘b’, ‘i’, ‘g’, ‘g’, ‘e’, ‘r’]

Similarly, there are various ways of tokenization like:

Text into sentences tokenization
Sentences into word tokenization
Sentences using regular expression tokenization
Space tokenization, etc.

Challenges in tokenization?

One biggest challenge in this process is finding the boundary of the words. One can think of it using the example of different languages. In English, we use full stops, commas, spaces to define the boundary of a word or a sentence but in other languages like Chinese, Japanese, Korean, it is different. They use symbols to represent the words and mark different parts of their sentences using those symbols. So, it becomes difficult to get the boundary of the words.

Even in English, there are symbols like £, $, € followed by numerical to represent money and there are a lot of scientific symbols such as µ, α, etc. which create challenges in tokenization. A lot of short forms such as I’m (I am), couldn’t (could not), etc, needs to be resolved or else they cause trouble during the process of natural language processing.

Don’t forget that NLP is a field in which a lot of research is still going on.