Skip to main content

The Value of Data

Although, there is not a simple answer for what came first - Chicken or Egg?, in Machine Learning, there is an easy answer. Data came first before any function. Machine learning is all about learning from data. The learning algorithm tries to learn a function that can either classify the data into different categories, or learn the function itself that plots the data.

There are two popular ways in which a Machine Learning algorithm can be taught to learn a function, but in both cases it needs data.

  • Supervised Learning: We give the algorithm a lot of data with both input and output, and it learns the function. In case of regression problems, the function approximately plots the function that understands the data. In case of classification problems, the function tries to classify the data.

  • Unsupervised Learning: We give the algorithm a lot of input data with no output, and it tries to find patterns in the data. The algorithm classifies the data based on the similarities of the data points. This method is called clustering.


In both the cases, enough data is needed for the algorithm to learn the function. Most of the machine learning algorithms are provided with three kinds of data -

  • Training Data: This data is used by the algorithm to learn the function, based on which it tries to generalize.

  • Validation Data: There is a high chance that the algorithm might overfit the training data, and will fail on any other data. To protect it from doing so, validation data is used. Validation data helps the algorithm to correlate how accurate it works on both known and unknown data.

  • Test Data: Once the function is learnt, it is tested on the test data. Here the algorithm checks whether it is also able to generalize on the test data, and thus able to stand a better chance in generalizing future data it has to do inference on.


Two interesting questions emerge, both having elegant answers.

Question. What are the parameters to decide on the amount of data needed for the algorithm to learn a function?
Answer. This might come with experience, but the more the merrier. Today’s Machine Learning algorithms such as Neural Networks are so powerful that if not enough data is given, they overfit easily. Also once your model is fine tuned and no more optimization is possible, it will only do better with more data.

Question. What if there are outliers in the data?
Answer. To get better performance, such outliers should be filtered from the data, otherwise the algorithm might get confused, and thus create a function that tries to learn the outlier too. It also depends on how sensitive some algorithms are to such outliers.

In today’s generation where companies having more data are richer than companies with less data, it seems data will decide the future.

P.S: You don’t get any spam emails in your inbox, thanks to billions of email that your spam filter has been trained on.

Comments

Post a Comment

Popular posts from this blog

How is AI Saving the Future

Meanwhile the talk of AI being the number one risk of human extinction is going on, there are lot many ways it is helping humanity. Recent developments in Machine Learning are helping scientists to solve difficult problems ranging from climate change to finding the cure for cancer. It will be a daunting task for humans to understand enormous amount of data that is generated all over the world. Machine Learning is helping scientists to use algorithms that learn from data and find patterns. Below is a list of few of the problems AI is working on to help find solutions which otherwise would not have been possible: Cancer Diagnostics : Recently, scientists at University of California (UCLA) applied Deep Learning to extract features for achieving high accuracy in label-free cell classification. This technique will help in faster cancer diagnostics, and thus will save a lot of lives. Low Cost Renewable Energy : Artificial-intelligence is helping wind power forecasts of u...

In the World of Document Similarity

How does a human infer whether two documents are similar? This question has dazzled cognitive scientists, and is one area under which a lot of research is taking place. As of  now there is no product that is able to match or surpass human capability in finding the similarity in documents. But things are improving in this domain, and companies such as IBM and Microsoft are investing a lot in this area. We at Cere Labs, an Artificial Intelligence startup based in Mumbai, also are working in this area, and have applied LDA and Word2Vec techniques, both giving us promising results: Latent Dirichlet Allocation (LDA) : LDA is a technique used mainly for topic modeling. You c an leverage on this topic modeling to find the similarity between documents. It is assumed that more the topics two documents overlap, more are the chances that those documents carry semantic similarity. You can study LDA in the following paper: https://www.cs.princeton.edu/~blei/papers/BleiNgJordan20...

Anomaly Detection based on Prediction - A Step Closer to General Artificial Intelligence

Anomaly detection refers to the problem of finding patterns that do not conform to expected behavior [1]. In the last article "Understanding Neocortex to Create Intelligence" , we explored how applications based on the workings of neocortex create intelligence. Pattern recognition along with prediction makes human brains the ultimate intelligent machines. Prediction help humans to detect anomalies in the environment. Before every action is taken, neocortex predicts the outcome. If there is a deviation from the expected outcome, neocortex detects anomalies, and will take necessary steps to handle them. A system which claims to be intelligent, should have anomaly detection in place. Recent findings using research on neocortex have made it possible to create applications that does anomaly detection. Numenta’s NuPIC using Hierarchical Temporal Memory (HTM) framework is able to do inference and prediction, and hence anomaly detection. HTM accurately predicts anomalies in real...