Understanding CountVectorizer, Tfidftransformer & Tfidfvectorizer with Calculation

Dharmendra Sahani
4 min readOct 11, 2020

Often in our NLP related Projects we come across these terms, in this Article let us try to understand from where the numbers come from at the end of calculation and why this is done. CountVectorizer, Tfidftransformer & Tfidfvectorizer are Frequency based Word Embedding technique which is used to convert text into numeric form which can be consumed by Machine Learning Model as they cannot be trained on raw text with variable length. We will go through one by one try to understand with Python code.

1. CountVectorizer

In simple terms countvectorizer converts text into matrix of word counts which is called Document Term Matrix(DTM) where Terms are represented as columns and Documents as rows. This is also referred as Bag of Words (BoW) which in laymen terms mean counting the words and putting that into bag irrespective of their order or structure. Lets see the Python code below:

So here what has been done is we imported countvectorizer function from sklearn.feature_extraction.text module to convert document “data_set” into Document Term Matrix (DTM). All together we have four documents. First we have instantiated countvectorizer followed by fit_transform function where it learned the vocabulary and transformed it into 4*10 sparse matrix. If we just see the result of vectorizer_transform(data_set), we find that it has eliminated all zeros from the matrix to improve memory utilization. This we can analyze from the below screenshot

When we convert dense matrix into proper DataFrame we clearly see that we have zero elements of features if that specific word is not present in any other document e.g, “cuisine” is present in only document 2 and nowhere else hence all are zeros which uses memory, that is the reason end product of fit_transform is sparse matrix which shows position on the left hand side (0, 6) and count on the right hand side (1) for word “pizza”.

2. TF-IDF

TF-IDF stands for term frequency — inverse document frequency, where TF is just the frequency of the term in document term matrix and IDF is =

log [ (1 + D) / (1 + df(d, t)) ] + 1, where D is the number of documents and df(d,t) is number of documents a term t has appeared in the DTM. TF-IDF Algorithm is extensively used in information retrieval system and document search. Let’s calculate IDF using same data above. I have calculated IDF for word “pizza” manually which comes to 1.223 as per the above formula.

We can perform this using Scikitlearn’s TfidfTransformer() also as shown below

So 1.223 is the same value what we have got by manual calculation. Now let’s calculate TF * IDF

Here tf is an array of first row of our DTM and idf is the value calculated by TfidfTransformer() for all words in the DTM i.e, for total 10 words that is why in the list it shows 10 float numeric.

This is not the end, there is one more transformation applied to get the final result, which is called Normalization. This is done when we use “.transform” after “.fit” or simultaneously also we can do. Tfidftransformer does the above calculation and L2 normalization (by default) on the sparse_matrix. We need not to do it manually.

L2 normalization is usually preferred method to normalize and it is by default in Scikitlearn. We can see the result below

Let’s try to calculate the above calculation by hand for first list of DTM

So we have got the same result, looks good. One important thing I would like to add here is both Tfidfvectorizer() and Tfidftransformer() perform the same calculation only difference is Tfidftransformer acts on sparse matrix and Tfidfvectorizer acts on raw text data. In our case the raw text data is “data_set”.

Interpretation of TF-IDF:

Larger value of word suggests more important that word is in the document. E.g, The word “Cuisine” has got more weightage 0.538 then word “pizza” 0.4480. If we include stopwords like is, an, the which is very frequent in any documents they will get very less weightage.

Hope this Article will be helpful in understanding CountVectorizer, Tfidftransformer & Tfidfvectorizer. Any Feedback or Suggestion is highly welcome.

Code cab be found in Github.

--

--

Dharmendra Sahani

For Mentorship on learning AI/ML with Azure/AWS/GCP Cloud Tec, Please feel free to connect with me here https://in.linkedin.com/in/dharmendra-sahani-bb92b11b6