
Word Embedding is a technique in NLP which maps the words or phrases from the vocabulary to vectors of real numbers. Word Embeddings help represent words in a vector space of D dimensions, where D can be chosen by you. This vector representation can be used to perform mathematical operations on words, find word analogies, perform sentiment analysis, etc.

The most basic embedding that is widely used is the One Hot Encoding technique, which represents categorical features in vector space by dedicating each word a column. This One-Hot Encoded vector is of size N x V, where N is the number of observations and V is the vocabulary size.

Word embeddings have been shown to boost the performance fn NLP tasks such as syntactic parsing and sentiment analysis.

There are many techniques to create Word Embeddings. Some of the popular ones are:

  • Binary Encoding
  • TF Encoding
  • TF-IDF Encoding
  • Word2Vec Encoding
    • Skip-Gram
    • CBOW (Continuous Bag of Words)
  • FastText

Different Ways of Using Word Embeddings:

  1. Learning the Embedding The embeddings can be learnt from the corpus but a large amount of text data is required to ensure that useful embeddings are learned. Word Embeddings can either be trained using a standalone language model algorithm like Word2Vec, GLoVe, etc., which proves more useful in case we want to use the embeddings in multiple models, or we can train the embeddings as a part of a task-specific model like classification, the main issue of this method is that the learnt embeddings are only specific to the task at hand and thus can’t be reused.

  2. Reusing Pretrained Embedding Most of the word embeddings trained by researchers using the above-mentioned algorithms are available for download and can be used in projects depending on the license of embeddings. The embeddings can be reused either by keeping them as non-trainable in your models if you want to use for general tasks for which these embeddings have been trained for, or you can allow the embeddings to be updated which gives better results for the task at hand.

Lets step into live demo for to create word embeddings.

Loading the Data

We will be working with the Amazon Reviews dataset that was downloaded from kaggle. The data has labels assigned for sentiment of the review i.e. Positive or Negative Review. We will be going through the reviews and try creating embeddings for the reviews. We will only be using 1K observations for this exercise. You can use as much as your machine permits you to run without crashing.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import spacy
label review
0 pos Stuning even for the non-gamer: This sound tra...
1 pos The best soundtrack ever to anything.: I'm rea...
2 pos Amazing!: This soundtrack is my favorite music...
3 pos Excellent Soundtrack: I truly like this soundt...
4 pos Remember, Pull Your Jaw Off The Floor After He...
from sklearn.preprocessing import LabelEncoder

Basic Preprocessing

Creating Stopwords Corpus

## Combining Spacy, NLTK, and WordCloud Stopword List
from nltk.corpus import stopwords
from wordcloud import STOPWORDS

Preprocessing Text

  • Removing Punctuations
  • Lemmatizing
  • Convertig to Lower Case
  • Removing Pronouns
  • Removing Urls
for t in X:
    reviews.append([i.lemma_ for i in nlp(t.lower()) if not i.is_punct and 
                    i.pos_!='PRON' and 
                    not i.like_url and 
                    i.text not in stopword_corpus])

Tokenizing the Reviews

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


word2id = tokenizer.word_index

id2word={v:k for k,v in word2id.items()}

max_len=max([len(s) for s in reviews])

for r in reviews:
    tokenized_reviews.append([tokenizer.word_index[w.lower()] for w in r])

TF (Count Vectorizer)

Word vectors by counting contexts

So how do we turn this insight from the Distributional Hypothesis into a system for creating general-purpose vectors that capture the meaning of words? Maybe you can see where I’m going with this. What if we made a really big spreadsheet that had one column for every context for every word in a given source text. Let’s use a small source text to begin with, such as this excerpt from Dickens:

It was the best of times, it was the worst of times.

Such a spreadsheet might look something like this:

dickens contexts

The spreadsheet has one column for every possible context, and one row for every word. The values in each cell correspond with how many times the word occurs in the given context. The numbers in the columns constitute that word’s vector, i.e., the vector for the word of is

[0, 0, 0, 0, 1, 0, 0, 0, 1, 0]

Because there are ten possible contexts, this is a ten dimensional space! It might be strange to think of it, but you can do vector arithmetic on vectors with ten dimensions just as easily as you can on vectors with two or three dimensions, and you could use the same distance formula that we defined earlier to get useful information about which vectors in this space are similar to each other. In particular, the vectors for best and worst are actually the same (a distance of zero), since they occur only in the same context (the ___ of):

[0, 0, 0, 1, 0, 0, 0, 0, 0, 0]

Of course, the conventional way of thinking about “best” and “worst” is that they’re antonyms, not synonyms. But they’re also clearly two words of the same kind, with related meanings (through opposition), a fact that is captured by this distributional model.

from sklearn.feature_extraction.text import CountVectorizer
# Initialize a CountVectorizer object: count_vectorizer
count_vec = CountVectorizer(stop_words="english", analyzer='word', 
                            ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None)

# Transforms the data into a bag of words
count_train =[" ".join(r) for r in reviews])
bag_of_words = count_vec.transform([" ".join(r) for r in reviews])

# Print the first 10 features of the count_vec
# print("Every feature:\n{}".format(count_vec.get_feature_names()))
# print("\nEvery 3rd feature:\n{}".format(count_vec.get_feature_names()[::3]))
print("Vocabulary size: {}".format(len(count_train.vocabulary_)))
print("Vocabulary content:\n {}".format(count_train.vocabulary_))
Vocabulary size: 6687
Vocabulary content:
The goal of using tf-idf is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.

tf-idf(d, t) = tf(t) * idf(d, t)
  • tf(t)= the term frequency is the number of times the term appears in the document
  • idf(d, t) = the document frequency is the number of documents ‘d’ that contain term ‘t’
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(smooth_idf=False, sublinear_tf=False, norm=None, analyzer='word')
txt_fitted =[" ".join(r) for r in reviews])
txt_transformed = txt_fitted.transform([" ".join(r) for r in reviews])
# print ("The text: ", txt1)
idf = tf.idf_
rr = dict(zip(txt_fitted.get_feature_names(), idf))
token_weight = pd.DataFrame.from_dict(rr, orient='index').reset_index()
token_weight = token_weight.sort_values(by='weight', ascending=False)[:10]
import seaborn as sns
sns.barplot(x='token', y='weight', data=token_weight)            
plt.title("Inverse Document Frequency(idf) per token")


# get feature names
feature_names = np.array(tf.get_feature_names())
sorted_by_idf = np.argsort(tf.idf_)
print("Features with lowest idf:\n{}".format(
print("\nFeatures with highest idf:\n{}".format(
Features with lowest idf:
['book' 'read' 'good']

Features with highest idf:
['nc' 'nathanial' 'zzzzzzzzzzzz']
TF-IDF - Maximum token value throughout the whole dataset
# find maximum value for each of the features over all of dataset:
max_val = txt_transformed.max(axis=0).toarray().ravel()

#sort weights from smallest to biggest and extract their indices 
sort_by_tfidf = max_val.argsort()

print("Features with lowest tfidf:\n{}".format(

print("\nFeatures with highest tfidf: \n{}".format(
Features with lowest tfidf:
['second' 'finish' 'course']

Features with highest tfidf: 
['sword' 'profanity' 'cookie']

Custom Trained Embeddings


Word2Vec is a more recent model that embeds words in a lower-dimensional vector space using a shallow neural network. The result is a set of word-vectors where vectors close together in vector space have similar meanings based on context, and word-vectors distant to each other have differing meanings. For example, strong and powerful would be close together and strong and Paris would be relatively far. There are two versions of this model based on skip-grams (SG) and continuous-bag-of-words (CBOW), both implemented by the gensim Word2Vec class.

Word2Vec uses a trick you may have seen elsewhere in machine learning. We’re going to train a simple neural network with a single hidden layer to perform a certain task, but then we’re not actually going to use that neural network for the task we trained it on! Instead, the goal is actually just to learn the weights of the hidden layer–we’ll see that these weights are actually the “word vectors” that we’re trying to learn.

The network is going to learn the statistics from the number of times each pairing shows up. So, for example, the network is probably going to get many more training samples of (“Soviet”, “Union”) than it is of (“Soviet”, “Sasquatch”). When the training is finished, if you give it the word “Soviet” as input, then it will output a much higher probability for “Union” or “Russia” than it will for “Sasquatch”.

  • Word2Vec - Skip-gram Model The skip-gram word2vec model, for example, takes in pairs (word1, word2) generated by moving a window across text data, and trains a 1-hidden-layer neural network based on the synthetic task of given an input word, giving us a predicted probability distribution of nearby words to the input. A virtual one-hot encoding of words goes through a ‘projection layer’ to the hidden layer; these projection weights are later interpreted as the word embeddings. So if the hidden layer has 300 neurons, this network will give us 300-dimensional word embeddings.

  • Word2Vec - Continuous-bag-of-words Model Continuous-bag-of-words Word2vec is very similar to the skip-gram model. It is also a 1-hidden-layer neural network. The synthetic training task now uses the average of multiple input context words, rather than a single word as in skip-gram, to predict the center word. Again, the projection weights that turn one-hot words into averageable vectors, of the same width as the hidden layer, are interpreted as the word embeddings.


Defining Context Word Pairs

from tensorflow.keras.utils import to_categorical
def generate_context_word_pairs(corpus, window_size, vocab_size):
    context_length = window_size*2
    for words in corpus:
        sentence_length = len(words)
        for index, word in enumerate(words):
            context_words = []
            label_word   = []            
            start = index - window_size
            end = index + window_size + 1
                                 for i in range(start, end) 
                                 if 0 <= i < sentence_length 
                                 and i != index])

            x = pad_sequences(context_words, maxlen=context_length)
            y = to_categorical(label_word, vocab_size)
            yield (x, y)

Sample Inputs and Outputs

i = 0
for x, y in generate_context_word_pairs(corpus=tokenized_reviews, window_size=context_size, vocab_size=V):
    if 0 not in x[0]:
        print('Context (X):', [id2word[w] for w in x[0]], '-> Target (Y):', id2word[np.argwhere(y[0])[0][0]])
        if i == 10:
        i += 1
Context (X): ['stun', 'non', 'sound', 'track'] -> Target (Y): gamer
Context (X): ['non', 'gamer', 'track', 'beautiful'] -> Target (Y): sound
Context (X): ['gamer', 'sound', 'beautiful', 'paint'] -> Target (Y): track
Context (X): ['sound', 'track', 'paint', 'senery'] -> Target (Y): beautiful
Context (X): ['track', 'beautiful', 'senery', 'mind'] -> Target (Y): paint
Context (X): ['beautiful', 'paint', 'mind', 'recomend'] -> Target (Y): senery
Context (X): ['paint', 'senery', 'recomend', 'people'] -> Target (Y): mind
Context (X): ['senery', 'mind', 'people', 'hate'] -> Target (Y): recomend
Context (X): ['mind', 'recomend', 'hate', 'vid'] -> Target (Y): people
Context (X): ['recomend', 'people', 'vid', 'game'] -> Target (Y): hate
Context (X): ['people', 'hate', 'game', 'music'] -> Target (Y): vid

Defining CBOW Architecture

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Lambda
from tensorflow import keras

# build CBOW architecture
cbow = Sequential()
cbow.add(Embedding(input_dim=V, output_dim=10, input_length=2*2))
cbow.add(Lambda(lambda x: keras.backend.mean(x, axis=1), output_shape=(10,)))
cbow.add(Dense(V, activation='softmax'))
cbow.compile(loss='categorical_crossentropy', optimizer='adam')

# view model summary
Model: "sequential_1"
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 4, 10)             72020     
lambda_1 (Lambda)            (None, 10)                0         
dense (Dense)                (None, 7202)              79222     
Total params: 151,242
Trainable params: 151,242
Non-trainable params: 0
#!pip install pydot graphviz 

Training the CBOW Model

for epoch in range(1, 6):
    loss = 0.
    i = 0
    for x, y in generate_context_word_pairs(corpus=tokenized_reviews, window_size=context_size, vocab_size=V):
        i += 1
        loss += cbow.train_on_batch(x, y)
        if i % 10000 == 0:
            print('Processed {} (context, word) pairs'.format(i))
    print('Epoch:', epoch, '\tLoss:', loss/i)
Skipgram Model

Defining Skipgram Pairs

from tensorflow.keras.preprocessing.sequence import skipgrams
skip_grams=[skipgrams(r,V,window_size=4) for r in tokenized_reviews_idx]

Sample Inputs and Outputs

# view sample skip-grams
pairs, labels = skip_grams[0][0], skip_grams[0][1]
for i in range(10):
    print("({:s} ({:d}), {:s} ({:d})) -> {:d}".format(
          id2word[pairs[i][0]], pairs[i][0], 
          id2word[pairs[i][1]], pairs[i][1], 

(recomend (1127), slinky (5656)) -> 0
(keyboarding (3168), december (5055)) -> 0
(fresh (1344), take (50)) -> 1
(step (776), crude (705)) -> 1
(grate (3169), orchestra (3171)) -> 1
(mind (236), load (1457)) -> 0
(beautiful (349), sound (152)) -> 1
(music (49), fun*charater (7028)) -> 0
(^_^ (2158), style (161)) -> 0
(game (32), dave (7003)) -> 0

Defining Skipgram Architecture

from tensorflow.keras.models import Model,Sequential
from tensorflow.keras.layers import Input, Dense, Embedding, Lambda,Reshape, Dot

# build skip-gram architecture
word_model = Sequential()
word_model.add(Embedding(V, embed_size,
word_model.add(Reshape((embed_size, )))

context_model = Sequential()
context_model.add(Embedding(V, embed_size,
input_sequence_1 = Input((None,))
input_sequence_2 = Input((None,))
dot=Dot(1)([word_model(input_sequence_1), context_model(input_sequence_2)])
out=Dense(1, kernel_initializer="glorot_uniform", activation="sigmoid")(dot)
skip_gram=Model(inputs=[input_sequence_1, input_sequence_2], outputs=out)
skip_gram.compile(loss="mean_squared_error", optimizer="adam")

Model: "model"
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, None)]       0                                            
input_2 (InputLayer)            [(None, None)]       0                                            
sequential_2 (Sequential)       (None, 10)           72020       input_1[0][0]                    
sequential_3 (Sequential)       (None, 10)           72020       input_2[0][0]                    
dot (Dot)                       (None, 1)            0           sequential_2[1][0]               
dense_1 (Dense)                 (None, 1)            2           dot[0][0]                        
Total params: 144,042
Trainable params: 144,042
Non-trainable params: 0

Training Skipgram Model

for epoch in range(1, 21):
    loss = 0
    for i, elem in enumerate(skip_grams):
        pair_first_elem = np.array(list(zip(*elem[0]))[0], dtype='int32')
        pair_second_elem = np.array(list(zip(*elem[0]))[1], dtype='int32')
        labels = np.array(elem[1], dtype='int32')
        X = [pair_first_elem, pair_second_elem]
        Y = labels
        if i % 1000 == 0:
            print('Processed {} (skip_first, skip_second, relevance) pairs'.format(i))
        loss += skip_gram.train_on_batch(X,Y)  
    print('Epoch:', epoch, 'Loss:', loss)
Comparing Skipgram Embeddings with CBOW Embeddings

pd.DataFrame(word_emb_cbow, index=id2word.values()).head()
0 1 2 3 4 5 6 7 8 9
book 0.708490 -0.374733 -1.156976 0.061611 1.323896 -0.167964 -0.163473 -0.119780 -0.237391 -0.220100
read 0.735406 -0.177951 -1.701628 -0.180904 0.908241 0.057287 -0.259185 0.507574 -0.836217 0.124057
good 0.382017 -0.094007 -1.040212 -0.299303 0.120877 0.473118 0.280422 -0.570508 -0.499639 0.430844
great 0.134586 0.407227 -1.682064 0.282897 1.348614 0.793174 -0.404906 -1.019498 0.586437 -0.836739
love 0.195157 0.107005 -1.384699 0.248910 1.094821 0.434969 -0.049532 -0.878325 0.349899 -0.685128
pd.DataFrame(word_emb_skipgram, index=id2word.values()).head()
0 1 2 3 4 5 6 7 8 9
book -0.006523 -0.023810 0.018569 -0.005135 -0.009594 -0.027225 0.002897 -0.026608 -0.013788 -0.000316
read 0.696219 -0.371762 -0.280955 -0.019985 -0.103077 -1.282474 -0.550851 -0.432109 -0.228415 0.262486
good -0.351847 0.394407 -0.998008 0.393601 0.488250 -0.620138 -0.151128 -0.424484 -0.238454 0.016535
great 0.260107 -0.087505 0.411976 -0.294271 0.393042 -0.302891 -0.348807 1.168035 -0.540051 0.437487
love -0.080335 -0.385654 -0.596333 0.183089 0.144461 -0.085787 -0.689355 0.973601 0.321709 -0.142176

Visualizing learnt Embeddings

from sklearn.manifold import TSNE
tsne = TSNE()
Z = tsne.fit_transform(word_emb_cbow[:1000])
%matplotlib notebook
plt.scatter(Z[:,0], Z[:,1])
for i in range(len(words)):
    plt.annotate(s=words[i], xy=(Z[i,0], Z[i,1]))
<IPython.core.display.Javascript object>

from sklearn.manifold import TSNE
tsne = TSNE()
Z = tsne.fit_transform(word_emb_skipgram[:1000])
%matplotlib notebook
plt.scatter(Z[:,0], Z[:,1])
for i in range(len(words)):
    plt.annotate(s=words[i], xy=(Z[i,0], Z[i,1]))
<IPython.core.display.Javascript object>


The GloVe algorithm uses context-counting approach to builds a word co-occurrence matrix and trains the word vectors to predict co-occurrence ratios based on their differences. Before Word2Vec, the matrix factorization techniques like Latent Semantic Analysis (LSA) were used to generate the word embeddings. In LSA, the matrices are of “term-document” type, i.e., the rows correspond to words or terms, and the columns correspond to different documents in the corpus. Word Vectors were generated by decomposing term-document matrices using Singular Value Decomposition. The resulting embeddings were not able to express word analogies into simple arithmetic operations unlike Word2Vec. GloVe, on the other hand, uses local context to compute the co-occurrence matrix using a fixed window size (words are deemed to co-occur when they appear together within a fixed window). After this, GLoVe aims to predict the co-occurrence ratios using the word vectors. Glove might result in generating better embeddings faster than word2vec as GloVe uses both the global co-occurrence statistics as well as local context.

Co-occurence Matrix Creation using Probability Ratios

# co-occurence matrix
X = np.zeros((V, V))
N = len(tokenized_reviews_idx)
for s in tokenized_reviews_idx:
    for i in range(len(s)):
        wi=s[i] # select current word
        start= max(0,i-context_size) # define start index
        #end = min(3,i+context_size) # define end index of the context
        if i - context_size < 0:
            points = 1.0/(i+1) # calculate context distances 

        for j in range(start,i):
                wj = s[j]
                points = 1.0 / (i - j) # this is +ve
                X[wi,wj] += points
                X[wj,wi] += points
# initialize weight matrix

Taking the log of the probability ratios to convert the ratio into a subtraction between probabilities.

# target
import tensorflow as tf
# Define the loss
def get_loss(model, inputs, targets):
    predictions = model(inputs)
    delta = targets - predictions
    return tf.reduce_sum(inputs * delta * delta)

# Gradient function
def get_grad(model, inputs, targets):
    with tf.GradientTape() as tape:
        # calculate the loss
        loss_value = get_loss(model, inputs, targets)
        # return gradient
        return tape.gradient(loss_value, model.params)
class Glove(tf.keras.Model):
    def __init__(self, num_dims, vocab_size,mu):
        super(Glove, self).__init__()
        # initialize weights
        W = np.random.randn(V, num_dims) / np.sqrt(V + num_dims)
        b = np.zeros(V)
        U = np.random.randn(V, num_dims) / np.sqrt(V + num_dims)
        c = np.zeros(V) = mu
        # initialize weights, inputs, targets placeholders
        self.W = tf.Variable(W.astype(np.float32))
        self.b = tf.Variable(b.reshape(V, 1).astype(np.float32))
        self.U = tf.Variable(U.astype(np.float32))
        self.c = tf.Variable(c.reshape(1, V).astype(np.float32))
        self.params = [self.W, self.b,self.U,self.c]

    def call(self,inputs):
        return tf.matmul(self.W, tf.transpose(self.U)) + self.b + self.c +

Training GLoVE

mu = logX.mean()

# Store the losses here
losses = []

# Create an optimizer
optimizer = tf.keras.optimizers.SGD(learning_rate=0.0001)

# Run the training loop
for i in range(200):
    # Get gradients
    grads = get_grad(glove_model, fX, logX)

    # Do one step of gradient descent: param <- param - learning_rate * grad
    optimizer.apply_gradients(zip(grads, glove_model.params))

    # Store the loss
    loss = get_loss(glove_model, fX, logX)
    print(i," ",loss)
Visualizing GLoVe

tsne = TSNE()
Z = tsne.fit_transform(We_avg[:1000])
%matplotlib notebook
plt.scatter(Z[:,0], Z[:,1])
for i in range(len(words)):
    plt.annotate(s=words[i], xy=(Z[i,0], Z[i,1]))
<IPython.core.display.Javascript object>

FastText with Gensim

FastText splits out words using n-gram characters. Contrary to other popular models that learn word representations by assigning a distinct vector to each word, FastText is based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram; words being represented as the sum of these representations. This approach is a significant improvement over word2vec and GloVe for two reasons:

  • The ability to infer out-of-vocabulary words. Example, ‘England’ is related to ‘Netherlands’ because of land present in both as ‘lan’ and ‘and’.
  • The robustness to spelling mistakes and typos.
from gensim.models import FastText
model_ft = FastText(reviews, size=20, window=5, min_count=1, iter=10, sorted_vocab=1)
Pretrained Embeddings

Loading Pretained GLoVE

import numpy as np
def loadGloveModel(File):
    print("Loading Glove Model")
    f = open(File,'r',encoding='utf8')
    gloveModel = {}
    for line in f:
        splitLines = line.split()
        if len(splitLines)>1:
            word = splitLines[0]
            wordEmbedding = np.array([float(value) for value in splitLines[1:]])
            gloveModel[word] = wordEmbedding
    print(len(gloveModel)," words loaded!")
    return gloveModel
filename = 'glove.6B/glove.6B.50d.txt'
Loading Glove Model
400000  words loaded!

Spacy Word Vectors

Evaluating Embeddings

  • Finding Similar Words using Word Vectors
    • Cosine Similarity
  • Satisfying Word Analogies
    • King - Man + Woman = Queen
import numpy as np
from numpy import dot
from numpy.linalg import norm

# cosine similarity
def cosine(v1, v2):
    if norm(v1) > 0 and norm(v2) > 0:
        return dot(v1, v2) / (norm(v1) * norm(v2))
        return 0.0

Word Analogy Example:

Simlarity between Dog and Puppy should be more than Trousers and Octopus.

If we compare King - Man + Woman, we should get high similarity to Queen

cosine(v1, v2)
cosine(glove_pretrained_embeddings['dog'], glove_pretrained_embeddings['puppy']) > cosine(
    glove_pretrained_embeddings['trousers'], glove_pretrained_embeddings['octopus'])

Applications of Word Embeddings

Word embeddings have found use across the complete spectrum of NLP tasks. Word Embeddings can help improve:

  • Text Classification tasks
  • Quality of language translations, by aligning single-language word embeddings using a transformation matrix.
  • Document search and information retrieval applications, where search strings no longer require exact keyword searches and can be insensitive to spelling.

More to explore

  • Doc2Vec
  • Combining Word Embeddings with TFIDF
  • Transformer Models
    • BERT
    • RoBERTa
    • DistilBERT
    • Open GPT (1 & 2)