The First Step of NLP: Preprocessing

When immersed into the world of data science everyone hears the three magical letters NLP, which stands for Natural Language Processing, not Neuro-Linguistic Processing. When learning NLP, I never got the chance to do a full on project on it, but I’m going to soon. This blog is going to be a quick glimpse of some NLP processes so I can refresh my knowledge and be prepared.


If computer’s can do quick calculations and understand numbers why can’t they understand words? Language is hard in that there are so many words, versions of words, and meanings that a computer can’t read a sentence and make full sense of it that easily. Natural Language Processing is quite literally a computer processing the natural language of people, or at least attempting to. Even if you haven’t heard of NLP, you’ve definitely heard of some applications where it has been used before like for chatbots, virtual assistants like Siri and Alexa,as well as auto-correct functionality. For data scientists out there, NLP can also be used for sentiment analysis, speech-recognition, market-intelligence, and more. The list goes on and on.

Step 0, get your data! Whatever text data and document data you need should be stored in a place you can access

Next, just like other modeling processes, we need to process our raw data of text and store it in a way for the machine to be ready to understand it. Here’s some methods of processing your data.

  • Tokenization: breaking down each word in your document and makes a list with each word as an element. Also, you can have elements in the list that are a combination of words appearing next to each other. This is referred to as an n-gram. For example, [‘he’, ‘walked’, ‘down’, ‘the’, ‘stairs’] can also have elements like ‘he walked’ and ‘walked down’. Since this is two words this would be bigrams and three words would be trigrams.
  • Stop-word removal: some words that are filtered out before analysis. These words are taken out so that words with more meaning are focused on. For example, in a sentence you can have the word ‘the’ appear the most, but that doesn’t give us any insight on the meaning of a sentence. Your words that your remove can really be any words and there is no real set word bank. However, if you’re working in Python, there are some prefilled dictionaries for stop words.
  • Lexicon Normalization: in English, there are many forms of words like running, runs, runner for the word run. Normalizing words would change those words to all result in one word “run”. The two methods of normalization I’m familiar with is stemming and lemmatization. The one difference between the two is lemmatization returns a real word, while stemming can could produce a part of one. An example would be the word “moving” would be broken down to be “mov” and “move” for the two. Moreover, lemmatization would take the root or dictionary word so words like “was” would be turned into “be” or “better” would be turned to “good”.

Overall, this was a good start to learning NLP. The three pre-processing techniques are common and not too difficult to implement. The next step of an NLP project is feature engineering, modeling, and evaluating, so doing this first bit is already a big! This was a great beginning refresher for me and I will definitely be getting to writing a refresher on the next steps!

Learning Data Science