0:00
Hey everyone, my name is Aastha Chauhan. Welcome to the Tutorials Point. In the previous video
0:05
we have learned about reinforcement learning and in this particular video, we are going
0:10
to learn about Q-learning. So let's see what's in for you in this video. We are going to
0:15
look at what is Q-learning. After that, we will see some important terms related to Q-learning
0:22
After that Bellman's equation and then what are the steps in Q-learning and at the end
0:28
we are going to understand the whole process using an example. So let's see what is Q-learning
0:34
Q-learning is a model-free off-policy reinforcement learning algorithm that finds the next best
0:42
action given a current state. It chooses this action at random and aims to maximize the
0:49
reward. Here you can see this standard reinforcement learning graph. If you have watched our previous
0:55
video that is of reinforcement learning, you must be familiar with this. So where you
1:01
have an agent and agent takes some action and action affects the environment and then
1:08
environment sends back the feedback or the reward and state is the new state where the
1:15
agent lies maybe on the chessboard or some position at a game. Here the term off-policy
1:22
refers to the methods where the learning agent can learn through the actions which are outside
1:28
the current policy. This means that the agent can learn about the optimal policy regardless
1:35
of the actions taken by the behavior policy. So now let's consider an ad recommendation system
1:42
Usually when you look up a product online, you get ads which will suggest the same product over
1:48
and over again. Using Q-learning, we can make an ad recommendation system which will suggest
1:55
related products to our previous purchase. The reward will be if user clicks on the suggested
2:02
product. And again you can see here you might have lots of products on your web advertisement
2:09
or on your page but it is still not a float number. It is still a set number and that is
2:17
something to be aware of while using or when you are using Q-learning and you can see here if you
2:24
have hundreds of people clicking on ads and you clicked one of the ads it might go in there and
2:31
say okay this person clicked on this particular ad what are the best set of ads based on the
2:39
clicking on this ad and afterwards based on where they are browsing. So let's go ahead and look at
2:46
some important terms when we are talking about Q-learning. We have state. The state represents
2:54
the current position of an agent in an environment. Action. The action is the step taken by the agent
3:01
when it is in a particular state. Rewards. For every action, the agent will get a positive or
3:08
negative reward. Episodes. When an agent ends up in a terminating state and can't take a new action
3:16
Maybe in a game some character died so that will be a whole episode. Okay next is Q-values. Q-values
3:26
are used to determine how good an action A taken at a particular state S and it is represented as
3:34
Q A, S. Next is temporal difference. A formula used to find the Q-value by using the value of
3:43
current state and action and the previous state and action. Now let's look at the Bellman's
3:48
equation. Bellman equation is used to determine the value of a particular state and deduce how good
3:56
it is to be in that state. The optimal state will give us the highest optimal value. There are three
4:04
parts in Bellman equation. First one is maximum function or max function which selects the action
4:11
that maximizes the rewards max A. Okay next is discount factor that is represented by comma and
4:19
its values lie between 0 and 1 representing the importance of the future reward. Next is the
4:27
function that computes the reward based on the selected action and the current state and that
4:33
is represented as RS comma A. So the maximum function is here RS comma A is our reward
4:41
instantaneous reward we are getting and gamma is our discount factor and Q S prime and A prime
4:49
are representing the Q-value for the next state. So the whole Bellman's equations looks like this
4:58
Now let's see what are the steps that we take to prepare a Q-value table. Here we are taking
5:05
example of that dog who can fetch run or sit. Okay so first step is create an initial Q-table
5:13
in which we are going to initialize all the values to zero. Here we have some actions like fetching
5:20
sitting running and some states like start idle wrong action correct action and end. Okay so next
5:30
is choose an action in perform it update values in the table. Here we have to note something that
5:37
every action has some attached reward. So using that reward we are going to calculate the Q-value
5:45
So next step obviously is get the value of the reward and calculate the value Q-value using
5:53
Bellman equation and update that value in the Q-value table and we will continue the same
6:01
until the table is filled or an episode ends. Now here we have the final Q-value table and we will
6:08
use these Q-values or Q-value table to trace the best optimal path. Now let's understand the whole
6:16
process with an example to understand it better but before that I have a quiz for you all and
6:22
the question is which of the following is a key feature of Q-learning and the options are it
6:29
requires a model of the environment next is it is a type of supervised learning and third option is
6:36
it is an off policy algorithm and the final option is it cannot handle continuous state process
6:43
If you know the answer write down in the comment section. So now let's move to the example through
6:49
which we are going to understand the whole process better. Suppose we have five rooms in a building
6:55
connected by doors as shown in this figure. We will number each room from 0 to 4 and the outside
7:03
of the building can be thought of as one big room and we will number that as 5. So notice that
7:12
door 1 and 5 leads to the building that is room 5. So here what we are going to do is we are going
7:20
to consider these doors as action and these rooms as a state. We can represent the room on the graph
7:30
each room as node or state and each door as a link or action. So this particular map can be
7:39
represented as this graph and notice that here we have room 5 as the goal state that means we have
7:48
to get into the room 5. So the goal room is 5 the doors that leads immediately to the goal have an
7:55
instant reward of 100. Other doors not directly connected to the target have zero reward. Each
8:02
arrow contains an instant reward values as shown in this graph. We can put the state diagram and
8:10
the instant reward values into the following matrix that is R. The minus ones in the table represent
8:19
null values that is where there isn't a link between nodes that means that we can't take any
8:27
action between those rooms. For example 0 and 1 here 0 and 1 have minus 1 value that means we can't
8:36
move directly from room 0 to room 1. Okay so all the minus 1 values in this matrix represent
8:45
null values. Zeros are the rooms that are not directly connected to the goal state that is 5
8:52
and 100 is the value that is directly connected to the goal state that is 5. Now before starting
8:59
we need some initialized value. So here we have discount factor is equals to 0.8 and the initial
9:07
state let us consider as room 1 and we will going to initialize matrix Q as 0 matrix. Okay so
9:17
initially we have all the Q values initialized to 0 as represented here in this matrix Q
9:25
So now look at the second row that is 1 of the matrix R. There are two possible actions here
9:34
that we can take first one 1 to third and second one is 1 to 5. So let us consider that we are
9:42
moving from state 1 to 5 that means we are taking action 5. Now let's imagine what would happen
9:50
if our agent is in the state 5 that is the next state. If our agent is in the state 5 it can move
9:59
to 1 or 4 or 5 that means we have three possible actions here. So using updated Q learning rule
10:08
derived from the Bellman's equation we have Q state comma action is equals to R state comma
10:14
action plus comma that is our discount factor maximum of Q next state and all the possible
10:23
actions. So we get Q 1 comma 5 that is the Q value for the current state and action we are
10:31
calculating R 1 comma 5 reward for the current state and action 0.8 that is our discount factor
10:40
multiplied by maximum of the Q values for the next state and all the possible actions. So here
10:49
we have reward for 1 comma 5 is 100. So we will put 100 here plus 0.8 multiplied by maximum of
10:59
all the Q values here we have all the values 0. So maximum of 0 is 0 and multiplied by 0.8 is 0
11:07
So we are getting Q 1 comma 5 is equals to 100. So we will take this value in the Q value table
11:17
Now the next state 5 now becomes the current state because 5 is the goal state. So we have
11:24
finished one episode. Our agent's brain now contains an updated matrix Q as this. For the next episode
11:34
we randomly choose the initial state let's say 3. So from 3 we can go to 1 2 or 4. So now we are
11:45
assuming that we are moving from state 3 to state 1. So now we imagine that we are in the state 1
11:54
So from state 1 we have possible actions 3 and 5. That means we can move from state 1 to state 3
12:03
and state 5. Again we compute the Q value using the same equation. We have Q 3 comma 1 is equals
12:12
to R 3 comma 1 plus 0.8 that is our discount factor maximum of all the Q values for the next
12:19
state and for all the actions. So R 3 comma 1 R 3 comma 1 we have 0. So we will put 0 here plus 0.8
12:29
multiplied by maximum of Q values. We have 0 and 100 and maximum of 0 and 100 is 100. So 0 plus 0.8
12:40
multiplied by 100 comes out 80. So we will update this Q value of 3 comma a to the Q value table and
12:50
it will turns out like this. So if our agent learns more through further episodes it will finally
12:59
reach convergence values in matrix Q like this and finally using this Q table we can trace the
13:07
best optimal path and you know what tracing the best sequence of the state is as simple as following
13:14
the links with the highest values at each state. So let us consider initially we are at state 2
13:21
So let's see what is the best Q value for state 2. For state 2 we have best value that is 64 for
13:30
action 3. So from 2 we are going to move to the 3. Now let's see for state 3 what is the best
13:39
Q value and for 3 we have we have two best Q values that is 80 for 1 and another one is also
13:48
80 for 4. So from 3 we can move to 1 and 4. Let's see for 1 we have best value 100. That means we
13:57
can move from 1 to 5 and that means that we have reached our goal state and one episode ends here
14:07
and for 4 we have best value for 5. So that means we can move from 4 to 5. Again we have reached
14:15
our goal state. I hope this video helped you to understand the Q learning algorithm and this is
14:22
not it. After this we are going to cover deep Q network and policy gradient and one thing more
14:29
before this video we have already covered the supervised machine learning algorithms
14:34
and unsupervised machine learning algorithms. So stay tuned for further informative videos
14:41
Thanks for watching and have a nice day