Introduction

Ever wondered how state-of-the-art Deep Learning models for Natural Language Understanding (NLU) have been possible? Thanks to the concept of Word Embeddings. So, we all know that natural language for humans have characters and words. We use words as atomic symbols representing the natural language. To use them for deep learning purposes, we need to convert them into numbers because numbers represent the natural language of Neural Networks. In this blog we will see how various techniques for doing so have evolved and made NLU possible.

The most basic way of representing words can be just to assign different numbers to the words. But this technique would make the computer treat numbers differently. For example suppose 'Apple' is assigned a number '15' whereas orange is assigned a number '200', so in this case our model will give the word 'orange' more importance than 'apple'.

To overcome this, we could convert them into a One-Hot Vector. ****But the problem with them is they are very sparse and each vector is a completely independent entity. Also, it was unable to leverage global co-occurrence information like SVD-Based Methods.

[ Back To Top ]

Word2Vec

Efficient Estimation of Word Representations in Vector Space, Tomas Mikolov et. al., January 2013, Google.

The above mentioned methods were not able to capture relationship between words and also they were sparse. To overcome this problem different embedding strategies were introduced, one such strategy is known as Word2Vec.

Word2Vec uses a shallow neural network to learn word embeddings. Word embeddings are nothing but vector representations of words. For training this shallow neural network, a window-based strategy is used. There are 2 algorithms supporting this strategy, Continuous-Bag-Of-Words (CBOW) and Skip-Gram.

CBOW model aims to predict the target word W given a set or a window of surrounding context words, whereas, Skip-Gram model is just the opposite of the CBOW model. It uses a single target word to predict C context words.

Illustration of CBOW and Skip-Gram model. Image taken from here.

So, in this process the N-dim vector representations of the words are learnt by the hidden layer, which are nothing but the embeddings. To train these networks optimally, they used 2 training methodologies, viz. Negative Sampling and Hierarchical Softmax.

The embeddings obtained from this model captured linear relationships between word pairs, which could be used to find similar words, perform linear operations on these vectors like $king - man + woman = queen$. It performed well on syntactic and semantic pair of words which were chosen for testing.

[ Back To Top ] [More Content]

Table of Contents

Introduction

Word2Vec

Global Vector [GloVe]