An introduction to phases of NLP pipeline
Natural Language Processing (NLP) is one of the fastest growing field in the world. It is a subfield of artificial intelligence dealing with human interactions with computers. Main challenges in NLP involve speech recognition, natural language understanding, and natural language generation. NLP is making its way into a number of products and services that we use everyday. This article gives an overview of common end-to-end NLP pipeline.
The common NLP pipeline consists of three stages:
- Text Processing
- Feature Extraction
Each stage transforms text in some way and produces an intermediate result that the next stage needs. For example,
- Text Processing — take raw input text, clean it, normalize it, and convert it into a form that is suitable for feature extraction.
- Feature Extraction: Extract and produce feature representations that are appropriate for the type of NLP task you are trying to accomplish and the type of model you are planning to use.
- Modeling: Design a model, fit its parameters to training data, use an optimization procedure, and then use it to make predictions about unseen data.
This process is a very simplified view of NLP pipeline and depending on the application you are building, it might need additional steps.
A small example of some stages in NLP pipeline can be found in my GitHub. (more examples coming soon)
Text processing is first stage of NLP pipeline that discusses how text data extracted from different sources is prepared for the next stage — feature extraction.
- Cleaning — The first step in text processingis to clean the data. i.e., removing irrelevant items, such as HTML tags. This can be done in many ways. Example includes using regular expressions, beautiful soup library, CSS selector, etc.
- Normalization — The cleaned data is then normalized by converting all words to lowercase and removing punctuation and extra spaces
- Tokenization — The normalized data is split into words, also known as tokens
- Stop Words removal — After splitting the data into words, the most common words (a, an, the, etc.), also known as stop words are removed
- Parts of Speech Tagging — The parts of speech are identified for the remaining words
- Named Entity Recognition — The next step is to recognize the named entities in the data
- Stemming and Lemmatization — Converting words into their canonical / dictionary forms, using stemming and lemmatization.
* Stemming is a process in which a word is reduced to its stem/root form. i.e., the word running, runs, etc.. can all be reduced to “run”.
* Lemmatization is another technique used to reduce words to a normalized form. In this case, the transformation actually uses a dictionary to map different variants of a word to its root. With this approach, the non-trivial inflections such as is, are, was, were, are mapped back to root ‘be’.
After performing these steps, the text will look very different from the original data, but it captures the essence of what was being conveyed in a form that is easier to work with.
Now the text is normalized, can it be fed into a statistical or machine learning model? Not exactly. Here’s why:
Text data is represented on modern computers using an encoding such as ASCII or Unicode that maps every character to a number. Computer stores and transmits these values as binary, zeros and ones, which have an implicit ordering. Individual characters don’t carry much meaning at all and can mislead the NLP algorithms.
It is words, that are to be considered , but computers don’t have a standard representation for words. Internally words are just sequences of of ASCII or Unicode values, but they don’t capture the relationship between words.
Compare text with computer representation of image data. For image data, pixels are used where each pixel has relative intensity of light at that spot in image. Two pixels with similar values are perceptually similar. So how a similar approach for text data can be developed, so that text can be used as features for modeling?
This depends on what model is being used and the problem/task. For graph based models, the words can be represented as symbolic nodes with relationships between them e.g. WordNet. For statistical models, a numerical representation is required. Again, it depends on task.
For document level tasks such as sentiment analysis, per document representations such as bag-of-words (BOW) or doc2vec representations can be used. For tasks involving individual words or phrases such as text generation or machine translation, a word level representation such as word2vec or glove can be used.
There are many ways to represent textual information, and only through practice one can learn and use respective techniques based on tasks.
Bag of words (BOW) model
A bag of words model treats each document as an un-ordered list or bag of words. The word document refers to a unit of text that is being analyzed. For example, while performing a sentiment analysis on tweets, each tweet is considered as a document.
After applying text processing on the document, the resulting tokens are treated as an un-ordered collection or set. Each document produces a set of words. But keeping these as separate sets is very inefficient. They are of different sizes, contain different words and are hard to compare.
The more useful approach is to turn each document into a vector of numbers representing how many times each word occurs in a document. A set of words is a corpus and this gives the context for vectors to be calculated.
- First, all the unique words in the corpus are collected to form a vocabulary
- Arrange these words in some order to form vector element positions or columns of a table and each row is assumed as a document
- Count the occurrence of each word in each document and enter the value in respective column. This can be called as Document-term matrix which contains documents in rows and terms in columns, interpreting each element as term frequency
This representation can be used in several tasks. One possible task is to compare documents based on term frequencies. This can be done by calculating dot product/cosine similarity between the two row vectors.
Term Frequency — Inverse Document Frequency (TF-IDF)
One limitation of bag of words approach is that it treats every word as being equally important. Whereas, some words occur very frequently in a corpus. Consider a financial document for example. “Cost” or “price” is a very common term.
This limitation can be compensated for by counting number of documents in which each word occurs, known as document frequency, and then dividing the term frequency by document frequency of that term.
This gives us a metric that is proportional to frequency of a term in document, but inversely proportional to number of documents it appears in. This highlights the words that are more unique to a document, thus better for characterizing it.
This approach is called Term Frequency — Inverse Document Frequency (TF-IDF).
More about calculation of TF-IDF can be learnt from here.
Another way to represent words is to use one-hot encoding. It’s just like bag of words but only that each word is kept in each bag and a vector is built for it.
One-hot encoding doesn’t work in every situation. It breaks down when there is a large vocabulary to deal with, because the size of word representation grows with number of words. It is required that word representation is limited to a fixed-size vector.
In other words, an embedding for each word is to be found in vector space that is exhibiting some desired properties. i.e. if two words are similar in meaning, they should be closer to each others compared to the words that are not. And if two pairs of words have similar difference in meanings, they should be approximately equally separated in the embedded space.
This representation can be used for various purposes like finding analogies, synonyms and antonyms, classifying words as positive, negative, neutral, etc..
The final stage of the NLP pipeline is modeling, which includes designing a statistical or machine learning model, fitting its parameters to training data, using an optimization procedure, and then using it to make predictions about unseen data.
The nice thing about working with numerical features is that it allows you to choose from all machine learning models or even a combination of them.
Once you have a working model, you can deploy it as a web app, mobile app, or integrate it with other products and services. The possibilities are endless!
In this article, we have seen brief introduction to NLP pipeline, and overview of each step. A small example of some stages in NLP pipeline can be found in my GitHub. Have a look at it to understand some phases of NLP pipeline. (more examples coming soon)
Hope you gained some knowledge reading this article. Please remember that this article is just an overview and my understanding of NLP pipeline that I read from various online sources.