Skip to main content

REINFORCEMENT LEARNING- PART 2

For an introduction to Reinforcement Learning, its basic terminologies, concepts and types read Reinforcement Learning - Part 1 by following this link: http://blog.cerelabs.com/2017/04/reinforcement-learning-part-1.html
 

Q-Learning

Q learning is an algorithm in reinforcement learning. It originates from the model based reinforcement learning. It can be referred to as a different kind of value function. The values are called Q values and are denoted by Q(s,a). It signifies the Q value when in a state 's' and taking an action 'a'.

Mathematically,

                      Q(s,a) = R(s) + γ Σs' P(s,a,s') maxa' Q(s',a')


It can be defined as the value for arriving in a state which is obtained by learning via action 'a' and proceeding optimally thereafter.

Also,
                           V(s)     = maxa Q(s,a)
                                 л(s)      =   argmaxa Q(s,a)

V(s) is a value, i.e. it returns a number , a scalar value in particular, whereas л(s) returns an action. Hence, it is written as argmax.

V(s) = The reward and behaving optimally thereafter.
Q(s) = The reward and then taking a specific action and behaving optimally thereafter.

We can convert Q into V if we always pick the best action.

Therefore,
                           V(s) = maxa Q(s,a)

л(s) is the policy we want to follow that maximizes your value going towards the wall, except the difference is, it is returning an action not an value, so it should be argmax.

                           л(s) = argmaxa Q(s,a)

To find Q in Q-learning

To solve the Q(s,a) equation we don't have the P(s,a,s') values. But in the learning scenario we don't have the models we have the transitions.

Estimating Q from transitions,
                 Q(s,a)  =  R(s) + γ Σs' P(s,a,s') maxa' Q(s',a')

A transition means: We observe that we are in a particular state s of the MDP, then action 'a' was chosen somehow, then transition happens and we land in a state and we find out what state we are in.

Now, imagine we got an estimate of the Q function Q^(s,a), and we are going to update it using learning rate alpha.
          Q^(s,a) <-- α --- { R(s) + γ maxa' Q^(s',a')} is utility of a state

Where,
α = learning rate
R(s) = Immediate reward
γ maxa' Q^(s',a') = discounted estimated value of the next state (utility of next state)

With each iteration Q^(s,a) approaches towards the sum of the immediate reward and discounted estimated value of the next state. This is also called as utility of state.

eg.
V <== α == X means moving alpha percent of value X to V with each iteration.
                                              
                 V = (1- α V) + αX

As we know s , a, s' ,r,
                 Q^(s,a) <--αt--- R(s) + γ maxa' Q^(s',a')

Where,
αt = learning rate increasing over time.
                 Q^(s,a) = E { R(s) + γ maxa' Q^(s',a')}
                                    = R(s) + γ Es' [ maxa' Q^(s',a') ]
                                    = R(s) + γ Σs' P(s,a,s') Q^(s',a')


The value of Q^(s,a) actually approaches to the value of the 'utility of state' term. But, the term value is changing over time, hence it is a moving target. There is a theorem by which this equation can solve the Markov Decision Process, it is called the Q- learning convergence theorem.

Q-Learning Convergence
In this theorem, Q^(s,a) starts at a random value. Then with each iteration, the value of Q^(s,a) converges to Q(s,a). The speed of convergence depends on learning rate alpha.

                    Q^(s,a) <--αt--- R(s) + γ maxa' Q^(s',a')
                    Q^(s,a) ---converges to -------> Q(s,a)

This is the actual solution to Bellman's equation. But, it is only true if we visit (s,a) infinitely often.

Choosing Actions

Q-learning is a family of algorithms. In order to solve the above problem we need to find three things:
  • How to initialize Q^?
  • How to decay α ?
  • How to choose actions?

There are two options of choosing actions:
  1. Always Choose a0
  2. Choose 'a' randomly, then take all a's and find best Q(s,a)

In the second option of choosing randomly we might learn the best action to take to get best Q(s,a) but we don't use it. So, the next step is to use it.

Now we initialize,
                Q^(s , a0 )  =  awesome;    for all states
                                  and
                    Q^(s , a  )  =  terrible ;       for all actions except a0

So, when it goes to a0 , it gets awesome results. But, whenever it goes to any other 'a' then it gets really really bad results. Hence, it continues to use a0 action forever i.e. infinitely. This is known as greedy action selection strategy.

So, we have to be careful where we start and then take random restarts. But this process could be very slow. So, instead of taking random restarts we can just take random action once in a while. Once in a state, find the estimated value for each action and take that with probability ( 1 - ε).

Where,
ε = randomness factor
                 л^(s) = argmaxa Q^(s,a) ;         for P =( 1 - ε )
                                 and
                       Random Action                            for P = ε

Assuming 'ε ' is small.
Hence, we visit (s,a) an infinite number of times as long as (ε > 0 ).


ε- Greedy Exploration:
If action selection is GLIE ( Greedy in limit with infinite exploration ). This means we start off 'ε' randomly then get less and less random and more and more greedy. We get two facts from this

                  Q^ -----> Q        <==== Learn
                  л^ -----> л          <==== Use
This is an example of Exploration- Exploitation dilemma.

Summary
This part, therefore gives a brief idea of Q-learning algorithm, q-values and their mathematical representation. Until now, the reinforcement learning robot has observed its present state and found out the q-values for all possible states in future. Using Q-value convergence theorem we find the appropriate q-values and using ε- greedy exploration take proper action. This part therefore covers up how the robot works conceptually. The next and the final part will show how the robot interacts with the real world.


References:

  
By Sanjuksha Nirgude,  
Research Project Fellow,
Cere Labs Pvt. Ltd



Comments

Popular posts from this blog

How is AI Saving the Future

Meanwhile the talk of AI being the number one risk of human extinction is going on, there are lot many ways it is helping humanity. Recent developments in Machine Learning are helping scientists to solve difficult problems ranging from climate change to finding the cure for cancer. It will be a daunting task for humans to understand enormous amount of data that is generated all over the world. Machine Learning is helping scientists to use algorithms that learn from data and find patterns. Below is a list of few of the problems AI is working on to help find solutions which otherwise would not have been possible: Cancer Diagnostics : Recently, scientists at University of California (UCLA) applied Deep Learning to extract features for achieving high accuracy in label-free cell classification. This technique will help in faster cancer diagnostics, and thus will save a lot of lives. Low Cost Renewable Energy : Artificial-intelligence is helping wind power forecasts of u...

In the World of Document Similarity

How does a human infer whether two documents are similar? This question has dazzled cognitive scientists, and is one area under which a lot of research is taking place. As of  now there is no product that is able to match or surpass human capability in finding the similarity in documents. But things are improving in this domain, and companies such as IBM and Microsoft are investing a lot in this area. We at Cere Labs, an Artificial Intelligence startup based in Mumbai, also are working in this area, and have applied LDA and Word2Vec techniques, both giving us promising results: Latent Dirichlet Allocation (LDA) : LDA is a technique used mainly for topic modeling. You c an leverage on this topic modeling to find the similarity between documents. It is assumed that more the topics two documents overlap, more are the chances that those documents carry semantic similarity. You can study LDA in the following paper: https://www.cs.princeton.edu/~blei/papers/BleiNgJordan20...

Anomaly Detection based on Prediction - A Step Closer to General Artificial Intelligence

Anomaly detection refers to the problem of finding patterns that do not conform to expected behavior [1]. In the last article "Understanding Neocortex to Create Intelligence" , we explored how applications based on the workings of neocortex create intelligence. Pattern recognition along with prediction makes human brains the ultimate intelligent machines. Prediction help humans to detect anomalies in the environment. Before every action is taken, neocortex predicts the outcome. If there is a deviation from the expected outcome, neocortex detects anomalies, and will take necessary steps to handle them. A system which claims to be intelligent, should have anomaly detection in place. Recent findings using research on neocortex have made it possible to create applications that does anomaly detection. Numenta’s NuPIC using Hierarchical Temporal Memory (HTM) framework is able to do inference and prediction, and hence anomaly detection. HTM accurately predicts anomalies in real...