Skip to main content

In the World of Document Similarity

How does a human infer whether two documents are similar? This question has dazzled cognitive scientists, and is one area under which a lot of research is taking place. As of  now there is no product that is able to match or surpass human capability in finding the similarity in documents. But things are improving in this domain, and companies such as IBM and Microsoft are investing a lot in this area.

We at Cere Labs, an Artificial Intelligence startup based in Mumbai, also are working in this area, and have applied LDA and Word2Vec techniques, both giving us promising results:

Latent Dirichlet Allocation (LDA): LDA is a technique used mainly for topic modeling. You can leverage on this topic modeling to find the similarity between documents. It is assumed that more the topics two documents overlap, more are the chances that those documents carry semantic similarity.

You can study LDA in the following paper:

You can implement LDA using Gensim:

Word2Vec:

Word2Vec bring words into vector space, where words with similar semantic meaning are embedded near each other. So when plotted in a higher dimensional vector space, similar words tend to come together. The best part with Word2Vec is that it also captures semantic similarity.

You can read the original Word2Vec paper here:

You can also check the implementation in tensorflow at:

Both LDA and Word2Vec techniques can be combined to achieve interesting results. Keep following this space as we will report our findings in future blog posts.

When we look at the results achieved by such techniques, it feels that the AI is thinking. 

For a detailed understanding of Word Embeddings please refer to the following article - An Introduction to Word Embeddings


Comments

Popular posts from this blog

How is AI Saving the Future

Meanwhile the talk of AI being the number one risk of human extinction is going on, there are lot many ways it is helping humanity. Recent developments in Machine Learning are helping scientists to solve difficult problems ranging from climate change to finding the cure for cancer. It will be a daunting task for humans to understand enormous amount of data that is generated all over the world. Machine Learning is helping scientists to use algorithms that learn from data and find patterns. Below is a list of few of the problems AI is working on to help find solutions which otherwise would not have been possible: Cancer Diagnostics : Recently, scientists at University of California (UCLA) applied Deep Learning to extract features for achieving high accuracy in label-free cell classification. This technique will help in faster cancer diagnostics, and thus will save a lot of lives. Low Cost Renewable Energy : Artificial-intelligence is helping wind power forecasts of u...

Understanding Generative Adversarial Networks - Part II

In "Understanding Generative Adversarial Networks - Part I" you gained a conceptual understanding of how GAN works. In this post let us get a mathematical understanding of GANs. The loss functions can be designed most easily using the idea of zero-sum games.  The sum of the costs of all players is 0.         This is the Minimax algorithm for GANs Let’s break it down. Some terminology: V(D, G) : The value function for a minimax game E(X) : Expectation of a random variable X, also equal to its average value D(x) : The discriminator output for an input x from real data, represents probability G(z): The generator's output when its given z from the noise distribution D(G(z)) : Combining the above, this represents the output of the discriminator when  given a generated image G(z) as input Now, as explained above, the discriminator is the maximizer and hence it tries to  maximize V(D, G) . The discriminator wa...