Skip to main content

Information Extraction - The key to Question Answering Systems

The day AI reads a document and answers each and every question asked and do reasoning on it, will be the day when we will call it true intelligence. Welcome to the world of Information Extraction, where algorithms try to extract information from unstructured documents into structured information, which the AI can further access to answer questions. Apparently easy for humans perform such an important task, looks hard for AI to do.

The difficulty lies in recognizing named entities, identifying context, relationship extraction, understanding tables and diagrams, and many more. The research in Information Extraction has progressed exponentially since this problem was identified, and today we have lot of open source tools at our disposal.

Any toolkit for Information Extraction is expected to contain the following modules
  • Tokenizer - Converts a sequence of characters into a sequence of tokens
  • Gazetteers - Entity dictionaries used as a lookup table
  • Sentence splitter - Understands where a sentence begins and ends
  • Part of Speech (POS) tagger - Identify and tags part of speech in a text
  • Named Entity Recognition - Identify named entities in a text
  • Coreference Resolution - Identify multiple expressions that refer to the same entity

The following list shows some of the popular open source Information Extraction toolkits that contain most of the above modules

  • General Architecture for Text Engineering (GATE):
    GATE is a suite of tools developed in Java for Natural Language Processing tasks that includes ANNIE (A Nearly-New Information Extraction System) which contains all the basic modules for information extraction. 
      Check out GATE at https://gate.ac.uk/

  • Unstructured Information Management Architecture (UIMA):
    Unstructured Information Management applications are software systems analyze large volumes of unstructured information to discover relevant knowledge. IBM’s specialized Artificial Intelligence Watson that won the Jeopardy challenge uses UIMA. 
    Lean more about UIMA at https://uima.apache.org/

  • OpenNLP:
The Apache OpenNLP library is a toolkit for the processing of natural language text and supports all modules required for Information Extraction. OpenNLP also offers maximum entropy and perceptron based machine learning. 

  • Natural Language Toolkit (NLTK):
   NLTK offers NLP libraries in python to perform Information Extraction. It provides a strong integration with WordNet (lexical database for English language). 
   Explore NLTK at http://www.nltk.org/

With such vast array of open source Information Extraction toolkits you can create your custom Information Extraction software in few days. Almost 80% of business information is unstructured. Solutions that make that information structured by capitalizing on the above mentioned toolkits have huge value. The future is going to be on question-answer style querying the unstructured documents and not keyword based search.

Comments

Popular posts from this blog

How is AI Saving the Future

Meanwhile the talk of AI being the number one risk of human extinction is going on, there are lot many ways it is helping humanity. Recent developments in Machine Learning are helping scientists to solve difficult problems ranging from climate change to finding the cure for cancer. It will be a daunting task for humans to understand enormous amount of data that is generated all over the world. Machine Learning is helping scientists to use algorithms that learn from data and find patterns. Below is a list of few of the problems AI is working on to help find solutions which otherwise would not have been possible: Cancer Diagnostics : Recently, scientists at University of California (UCLA) applied Deep Learning to extract features for achieving high accuracy in label-free cell classification. This technique will help in faster cancer diagnostics, and thus will save a lot of lives. Low Cost Renewable Energy : Artificial-intelligence is helping wind power forecasts of u...

In the World of Document Similarity

How does a human infer whether two documents are similar? This question has dazzled cognitive scientists, and is one area under which a lot of research is taking place. As of  now there is no product that is able to match or surpass human capability in finding the similarity in documents. But things are improving in this domain, and companies such as IBM and Microsoft are investing a lot in this area. We at Cere Labs, an Artificial Intelligence startup based in Mumbai, also are working in this area, and have applied LDA and Word2Vec techniques, both giving us promising results: Latent Dirichlet Allocation (LDA) : LDA is a technique used mainly for topic modeling. You c an leverage on this topic modeling to find the similarity between documents. It is assumed that more the topics two documents overlap, more are the chances that those documents carry semantic similarity. You can study LDA in the following paper: https://www.cs.princeton.edu/~blei/papers/BleiNgJordan20...

Understanding Generative Adversarial Networks - Part II

In "Understanding Generative Adversarial Networks - Part I" you gained a conceptual understanding of how GAN works. In this post let us get a mathematical understanding of GANs. The loss functions can be designed most easily using the idea of zero-sum games.  The sum of the costs of all players is 0.         This is the Minimax algorithm for GANs Let’s break it down. Some terminology: V(D, G) : The value function for a minimax game E(X) : Expectation of a random variable X, also equal to its average value D(x) : The discriminator output for an input x from real data, represents probability G(z): The generator's output when its given z from the noise distribution D(G(z)) : Combining the above, this represents the output of the discriminator when  given a generated image G(z) as input Now, as explained above, the discriminator is the maximizer and hence it tries to  maximize V(D, G) . The discriminator wa...