A Journey from One to 1 with Word Embeddings
Humans naturally communicate through reading, writing, listening, and speaking languages. However, for computers and machines to understand human language effectively, it needs to be converted into a form that is representative and comprehensible to them. This is where word embeddings come into play.
But why do we need word embeddings, and what is their purpose? Word embeddings are a critical component of Natural Language Processing (NLP), designed to help computers comprehend human language more intuitively. Language presents challenges in translation due to its various senses, interpretations, styles, implicit meanings, and slangs. NLP addresses these language barriers by employing techniques such as data preprocessing, dictionary and thesaurus construction, corpus building, and semantic analysis. These processes enable computers to perform a range of machine learning tasks, including classification, clustering, collaborative filtering, and other information retrieval and text mining approaches.
In simple terms, word embedding is a technique used to represent text as numerical vectors. By converting words into numbers, computers can efficiently process and analyze text data. Essentially, word embedding transforms text into a quantified vector space, reducing its dimensionality while preserving meaningful relationships between words. I often refer to this process as ‘vectorization’.
One-hot Encoding
The most basic form of word embedding is known as one-hot encoding, sometimes referred to as 1-of-N encoding. In this technique, each word in a vocabulary is represented by a vector of zeros, except for one element which is set to 1, corresponding to the index of the word in the vocabulary. The length of the vector is equal to the size of the vocabulary, making it a high-dimensional representation.
+-----------+---------------+
| word | vector |
+-----------+---------------+
| Stark | [1,0,0,0] |
| Lannister | [0,1,0,0] |
| Targaryen | [0,0,1,0] |
| Greyjoy | [0,0,0,1] |
+-----------+---------------+
Each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.
For the evaluation we can say that one-hot encoding is the simplest method to implement, the result is binary rather than ordinal and that everything sits in an orthogonal vector space. But there are drawbacks we should consider for one-hot encoding. The number of dimensions will linearly increasing (high cardinality). Also, no semantic between words, it means all words are in the same n-dimensional distance. To overcome the curse of dimensionality, we can apply feature selection and feature extraction for dimension reduction.
Bag of Words
Bag of Words (BoW) is a simple model used in text mining and information retrieval. Imagine a sentence or a document as a bag containing words. It only takes the words with their frequency of occurrence in the sentence without considering semantic relationship between words.
So, the document “A is better than B” is identical to the document “B is better than A”. Nevertheless, it seems intuitive that two documents with similar bag of words representations are similar in content. And to map each word in numerical score, Term Frequency (TF) is the next step.
N-Gram
As we know that Bag of Words model is an orderless representation, as an alternative, the N-Gram model can be used to store spatial information in a document. As N represents the parser unit :
Corpus : You know nothing John SnowUnigram :
{“you”, “know”, “nothing”, “John”, “Snow”}Bigram :
{“you know”, “know nothing”, “nothing John”, “John Snow”}Trigram :
{“you know nothing”,”know nothing John”,”nothing John Snow”}4-grams :
{“you know nothing John”,”know nothing John Snow”}
TF-IDF
Another popular vectorization example in text processing is TF-IDF (Term Frequency — Inverse Document Frequency). The main goal of TF-IDF is to give numerical weight/score of each term/word and considering the importance of each word.
The background of TF-IDF comes from TF (Term Frequency). The idea is when a document mentions a word more often, then that word should receive a higher score.
But the problem is when the frequent words are not the important ones (in the most cases — it depends on the case). Words like : the, and, or, to, from, a, an, in, at, on (stop words) will affect the relevance result. Some approaches to tackle the limitation :
Noise removal ; Data cleaning approach by removing stop words (unimportant words) in a predefined stop list at the first place before the weighting process.
Scale down or reduce the term weights of terms with high frequency by a factor that grows with its collection frequency — Document Frequency (DF): the number of documents in the collection that contain that term.
Inverse Document Frequency (IDF) ; Define the logarithmically scaled weight (non linear weight) of the DF. IDF of a rare term is high, where the IDF of a frequent term is likely to be low.
GloVe
After Tomas Mikolov et al. released the word2vec, there was a boom of articles about word vector representations. One of the best of these articles is Stanford’s GloVe: Global Vectors for Word Representation, which explained why such algorithms work and reformulated word2vec optimizations as a special kind of factoriazation for word co-occurence matrices.
GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
GloVe creates a continuous N-dimensional representation of a word that is learned from its surrounding context words in a training corpus.
GloVe algorithm consists of following steps: Collect word co-occurence statistics in a form of word co-ocurrence matrix X. Each element Xij of such matrix represents how often word i appears in context of word j. Usually we scan our corpus in the following manner: for each term we look for context terms within some area defined by a window_size before the term and a window_size after the term. Also we give less weight for more distant words, usually using this formula:
Define soft constraints for each word pair:
wi — vector for the main word, wj — vector for the context word, bi,bj are scalar biases for the main and context words.
Define a cost function
f is a weighting function which help us to prevent learning only from extremely common word pairs. The GloVe authors choose the following function:
Nearest Neighbors
The main approach of GloVe is Nearest Neighbors that using distance measurement to get the semantic relationship between word vector.
The semantic questions are typically analogies about people or places, like “Athens is to Greece as Berlin is to ?”. The syntactic questions are typically analogies about verb tenses or forms of adjectives, for example “dance is to dancing as fly is to ?”. To correctly answer the question, the model should uniquely identify the missing term, with only an exact correspondence counted as a correct match. We answer the question “a is to b as c is to ?” by finding the word d whose representation wd is closest to wb − wa + wc according to the cosine similarity
There are two popular approaches to get vector distance, Eucledian Distance and Cosine Similarity.
Eucledian Distance
The notion of Euclidean distance, which works well in the two-dimensional and three-dimensional worlds studied by Euclid, has some properties in higher dimensions that are contrary to geometric intuition which is also an extrapolation from two and three dimensions.
And due to the squared terms, it is sensitive to noise.
But when transforming texts to vectors, we will have many dimensions, and Euclidean distance is not very good for very high dimensional data.
Cosine Similarity
Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any other angle in the interval [0,0.5π).
Cosine similarity is better at catching the semantic of each text because it keeps the phase (direction) information, the direction vector can be count as its meaning so texts with similar meanings will be similar.
Cosine is also popular because of its efficiency in evaluate, especially for sparse vectors, as only the non-zero dimensions need to be considered.
For probability metric space, Entropy based distance can be the choice.
We have explored basic and count-based word embeddings. And next, Neural Network will take place for prediction based models.
Word2Vec
Word2vec was created by a team of researchers led by Tomas Mikolov.
Word2vec is not a single algorithm but a combination of two distinct models : CBoW (Continuous bag of words) and Skip-gram. Both of these are shallow neural networks (that is why it is not considered as deep learning) which map word to the target variable which is also a word. Both of these techniques learn weights which act as word vector representations.
Continuous Bag of Words (CBoW)
Assume that the current word in a sentence is wi. The input to the model could be [wi−2, wi−1, wi+1, wi+2] the preceding and following words of the current word we are at. The output of the neural network will be wi.
Skip-gram
The input to the model is wi, and the output will be [wi−1, wi−2, wi+1, wi+2]. The idea behind skip-gram is to take in a word and predict the context words. So, we can say that CBoW predicts target from context and Skip-gram predicts context words from target.
For simple instance, the sentence “hi oswin how was your day?” becomes:
CBoW: 3-grams {“hi oswin how”, “oswin how was”, “how was your”, …}
Skip-gram: 1-skip 3-grams {“hi oswin how”, “hi oswin was”, “oswin how was”, “oswin how your”, …}
According to Mikolov:
Skip-gram: works well with small amount of the training data, represents well even rare words or phrases.
CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words
This can get even a bit more complicated if you consider that there are two different ways how to train the models: the normalized hierarchical softmax, and the un-normalized negative sampling. Both work quite differently.
In his paper, Mikolov recommends using the skip-gram model with negative sampling (SGNS), as it outperformed the other variants on analogy tasks.