While learning the NLP the one question must comes everyone’s mind that why We choose the WORD2VC over Bag Of Words and TF-IDF?
First of all let’s see what is Bag of Words and TF-IDF?
Bag Of words : The bag-of-words model is a simplifying representation used in natural language processing and information retrieval and is used method to extract features from text documents.. Which represents the text that describes the occurrence of words within a document.
TFI-DF: TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.
Till now everything is fine let’s come to the real world…
In some cases, if you are using Bag of Words or TFIDF You might have missed some important key words. Now ,
Why not bag of words and TF-IDF?
In both of this case the Semantic Information is not stored which is much more important in Natural Language Processing.
What Semantic Informatic actually means ?
The concept of semantic information refers to information which is in some sense meaningful for a system, rather than merely correlational.
On the other sides TF-IDF also gives importance of some uncommon words which is also sometimes useless.
There is also some chances of overfitting of the data
Reason behind this — in bag of words or TF-IDF the models learns from the training dat rather than generalizing it.
Multiple Problem One Solution!!!
And here it comes word2vec
What is word2vec?
In WORD2VEC instead giving the values to the words, it represents the words as a vector which is equal to 32 or more than that where Sematic Information and relation between the different words is preserved.
Lets make it simple!
Lets take two dimension:::::
Here if we put this condition to the machine like Queen-Woman + Man, the output will be King, the reason is that King and Queen is interrelated.
You can see that every words are separately vectorized and are gave equal importance. Nothing worry about neither data lose nor overfitting.
Hope You all understand!!
hAPPY pYTHONING!!