Getting Started with Natural Language Processing (Part1)

8 min readDec 22, 2018

Natural Language Processing is a sub field of artificial-intelligence that helps in processing and analyzing natural language like text,speech and so on.In this article I will try to explain various techniques used in NLP.

We can divide whole NLP pipeline into 3 parts

Text preprocessing
Feature engineering
Model building and evaluation

In this article I will explain about the first two parts.Later on you can build your model based on it.

Text Preprocessing

Text is mostly in unstructured form. Lot of noises will be present in it.In data preprocessing we will remove the noises associated with it. It is not possible to analyse the data without properly preprocessing it.

Steps in text cleaning:

Noise entity removal
Text Normalization
Word standardization

1. Noise entity removal:

In this step we will remove html tags,stop words, punctuation's, white spaces etc. In a sentence, there are many extra words. Example: to, is,and etc. These words don’t add any meaning to the sentence.They are called stop words.So, we can remove them.In the following code we will remove stop words.Also we will remove punctuation's, numbers and unnecessary white spaces.We use nltk library to remove stop words and re library to remove punctuation's, numbers and white spaces.

Before noise entity removal I will explain about tokenization which is an important aspect in NLP.

Tokenization: Tokenization is the process of converting text to tokens.Tokens can be sentences or words.We have sentence tokenization and word tokenization. I will demonstrate it using python.

dataset = """I have to  thank everyone from the very onset of my career … To my parents; none of this would be possible without you. And to my friends, I love you dearly; you know who you are. Making The Revenant was aboutman's relationship to the natural world. A world that we collectively felt in 2015 as the hottest year in recorded history#."""sentences = nltk.sent_tokenize(dataset)
words = nltk.word_tokenize(dataset)
print('sentences: {}'.format(sentences))
print('************************')
print('words: {}'.format(words))

Output:

sentences: ['I have to  thank everyone from the very onset of my career … To my parents; none of this would be possible without you.', 'And to my friends, I love you dearly; you know who you are.', "Making The Revenant was aboutman's relationship to the natural world.", 'A world that we collectively felt in 2015 as the hottest year in recorded history#.']
************************
words: ['I', 'have', 'to', 'thank', 'everyone', 'from', 'the', 'very', 'onset', 'of', 'my', 'career', '…', 'To', 'my', 'parents', ';', 'none', 'of', 'this', 'would', 'be', 'possible', 'without', 'you', '.', 'And', 'to', 'my', 'friends', ',', 'I', 'love', 'you', 'dearly', ';', 'you', 'know', 'who', 'you', 'are', '.', 'Making', 'The', 'Revenant', 'was', 'aboutman', "'s", 'relationship', 'to', 'the', 'natural', 'world', '.', 'A', 'world', 'that', 'we', 'collectively', 'felt', 'in', '2015', 'as', 'the', 'hottest', 'year', 'in', 'recorded', 'history', '#', '.']

In the following code we will remove punctuation's,stop words and extra spaces and thus removes the noise present.

nltk.download('stopwords')
from nltk.corpus import stopwords
import re
dataset = nltk.sent_tokenize(dataset)
for i in range(len(dataset)):
    dataset[i] = dataset[i].lower()
    dataset[i] = re.sub(r'\W',' ',dataset[i])# will remove non-word charecters like #,*,% etc
    dataset[i] = re.sub(r'\d',' ',dataset[i])#will remove digits
    dataset[i] = re.sub(r'\s+',' ',dataset[i])#will remove extra spaces
    words = nltk.word_tokenize(dataset[i])
    new = []
    for word in words:
        if word not in stopwords.words('english'):
            new.append(word)
    dataset[i] = ' '.join(new)
print(dataset)

Output:

['thank everyone onset career parents none would possible without',
 'friends love dearly know',
 'making revenant aboutman relationship natural world',
 'world collectively felt hottest year recorded history']

2. Text Normalization:

Another type of noise is the repetitions by single word multiple times.for example run,running,runs etc are different variations of term run. Normalization will helps reduce such words to a single word.Thus helps in reducing the dimension.There are mainly two normalization techniques.

Stemming:

Stemming is a rule based approach which strips suffixes(ing,ly,s etc). some of the examples are:

There are many algorithms that will do stemming. but the most common in English is Porter stemmer.I will illustarate an example for it using python

# Stemming
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
for i in range(len(dataset)):
    words = nltk.word_tokenize(dataset[i])
    words = [stemmer.stem(word) for word in words]
    dataset[i] = ' '.join(words)
print(dataset)

output:

['thank everyon onset career parent none would possibl without', 'friend love dearli know', 'make reven aboutman relationship natur world', 'world collect felt hottest year record histori']

We can see that words like very changed to veri,history to histori and so on.This is the amazing work of Porter Stemmer algorithm.

Lemmatization:

Like stemming Lemmatization will also convert word to its root form.It is a step by step procedure of obtaining the root form of the word, it makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and grammar relations).The key to this method is linguistics.In lemmatization root word is called lemma.The output of stemmers can be meaningless, but the output of lemmatizers will always be meaningful.some of exmaples of lemmatization of words are:

We will apply lemmatization to the same dataset we used before using python.

from nltk.stem.wordnet import WordNetLemmatizer 
lem = WordNetLemmatizer()
for i in range(len(dataset)):
    words = nltk.word_tokenize(dataset[i])
    words = [lem.lemmatize(word,pos='v') for word in words]
    dataset[i] = ' '.join(words)
print(dataset)

output:

['thank everyone onset career parent none would possible without', 'friends love dearly know', 'make revenant aboutman relationship natural world', 'world collectively felt hottest year record history']

3. Word standardization:

Text may contains words that are not in dictionary.for example in tweets or comments , it can contain words like ‘re’ representing are,’s’ for is,’awsm’ for awesome and so on.such words will not recognized by our model.so we have to fix it.I will try to demonstrate it by using a tweet from twitter.We will create a lookup table for that.

tweet = 'its an awsm day and my friends re superb'
lookup = {'s': 'is', 're':'are', 'awsm': 'awesome', 'superb': 'super'}
data = []
for word in dataset.split():
    if word in lookup.keys():
        word = lookup[word]
    data.append(word)
tweet = ' '.join(data)
print(tweet)

output:

its an awesome day and my friends are super

Feature Engineering

To analyse a preprocessed data, it needs to be converted into features.Inorder to convert it in to features we use differnet techniques like bag of words model, ngrams-model, tf-idf, word2vector and so on.We will go through some of the methods.

Bag of words model:

We are working with a large set of data.We can’t directly feed this to our model.Computers will only understand numbers.So we have to convert these text to vectors.We can perform mathematical operations on these vectors.One of those method is bag of words.

Consider the following sentences as reviews about a movie.

'Movie is good and movie is worth watch'
'Movie is average but story is really good'
'I like the movie and the fight'

After data-preprocessing we can represent it as:

'movie good movie worth watch'
'movie average story really good'
'i like movie fight'

The three reviews can be represented as a collection of words.

['movie', 'good', 'movie', 'worth', 'watch']
['movie', 'average', 'story', 'really', 'good']
['I', 'like', 'movie', 'fight']

We treat each sentence as a separate document and we make a list of all words from all the four documents excluding the punctuation and repetition.

'movie', 'good', 'worth', 'watch','average', 'story', 'really', 'I', 'like', 'fight'

The next step is the create vectors. Vectors convert text that can be used by the machine learning algorithm.We will take the first document ‘movie is good and movie is worth watching’.Now we will check the frequency of the each word from the 10 unique words we listed above,

'movie' : 2
'good' : 1 
'worth': 1
'watch' : 1
'average' : 0
'story': 0
'really': 0
'I': 0
'like': 0
'fight' : 0

Thus for review 1 our vector representation is [2,1,1,1,0,0,0,0,0,0]

similarly other reviews are represented as:

'movie good worth watch'' = [2,1,1,1,0,0,0,0,0,0]'movie average story really good' =  [1,1,0,0,1,1,1,0,0,0]'I like movie fight' = [1,0,0,0,0,0,0,1,1,1]

In this approach, each word or token is called a “gram”. Combinations of these documents form a corpus. A big document where the generated vocabulary is huge may result in a vector with lots of 0 values. This is called a sparse vector. Sparse vectors require more memory and computational resources when modeling. Sparce vectors together combine to form Sparce matrix.Creating a vocabulary of two-word pairs is called a bigram model. For example bi-gram of first review will be:

'movie good'
'worth watch'

Based on these bi-grams we will create sparce vector for each document.Similarly there are tri-grams and n-grams(n-grams is nothing but grams with n>1)

CountVectorizer works on Terms Frequency, i.e. counting the occurrences of tokens and building a sparse matrix of documents. We will use CountVectorizer to convert our reviews to vectors. CountVectorizer is available in Sci-Kit learn library (sklearn).

dataset = ['movie good movie worth watch',
 'movie average story really good',
 'I like movie fight']from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(ngram_range=(0,1)) #in scikit-learn
final_counts = count_vect.fit_transform(dataset)
print(count_vect.get_feature_names())
print(final_counts.toarray())

Output:

['average', 'fight', 'good', 'like', 'movie', 'really', 'story', 'watch', 'worth']
[[0 0 1 0 2 0 0 1 1]
 [1 0 1 0 1 1 1 0 0]
 [0 1 0 1 1 0 0 0 0]]

Limitations of Bag Of Words model:

Bag of word models don’t respect semantics of the word. For example: words ‘car’ and ‘automobile’ are often used in the same context. However, the vectors corresponding to these words are orthogonal in bag of words model. The problem become more while modeling sentences using bag of words.
While modeling phrases using bag-of-words the order of words in the phrase is not respected. Ex: “This is bad” and “Is this bad” have exactly the same vector representation.

TF-IDF Vectorizer

TF-IDF stands for term frequency-inverse document frequency. TF-IDF value is obtained by multiplying TF score and IDF score.

Term Frequency(TF):

Term frequency of a word is the frequency of the word in the document. The term frequency is often divided by the document length to normalize.

Inverse Document Frequency (IDF): is a scoring of how rare the word is across documents. It reflects how important a word is to a document in a collection or corpus.

We will use TfidfVectorizer to convert our reviews to Tfidf vectors. TfidfVectorizer is available in Sci-Kit learn library (sklearn).

from sklearn.feature_extraction.text import TfidfVectorizer
vector = TfidfVectorizer()
count = vector.fit_transform(dataset)
print(vector.get_feature_names())
print(count.toarray())

Output:

['average', 'fight', 'good', 'like', 'movie', 'really', 'story', 'watch', 'worth']
[[0.         0.         0.38151877 0.         0.59256672 0.
  0.         0.50165133 0.50165133]
 [0.50461134 0.         0.38376993 0.         0.29803159 0.50461134
  0.50461134 0.         0.        ]
 [0.         0.65249088 0.         0.65249088 0.38537163 0.
  0.         0.         0.        ]]

Limitations of TF- IDF model:

TF-IDF is based on the bag-of-words (BoW) model, therefore it does not capture position in text and semantics of words.

In order to solve this problem we can use word2vector model which i will explain in other article.After feature engineering we can build model based on this.We can use algorithms such as navien bias for that.

Getting Started with Natural Language Processing (Part1)

Text Preprocessing

Feature Engineering

Bag of words model:

TF-IDF Vectorizer

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Arun Mohan

Responses (2)