What Is Q-Learning? The Best Guide to Understand Q-Learning

What do you do when a dog or child misbehaves? You scold them to make sure that they do not repeat and learn bad behavior. On the other hand, you will reward them if they do something good, to instill good behavior. Believe it or not, this system of positive or negative reinforcement is also used to train machines. It is called reinforcement learning, and it helps us come up with unique solutions. Q learning is a type of reinforcement learning which is model-free! 

In this article we will talk about what is Q-learning and how to go about implementing it.

Your AI/ML Career is Just Around The Corner!

AI Engineer Master's ProgramExplore Program
Your AI/ML Career is Just Around The Corner!

What Is Reinforcement Learning?

In machine learning, a common drawback is the vast amount of data that models need to train. The more complex a model, the more data it may require. Even after all this, the data we get may not be reliable. It may have false or missing values or may be collected from untrustworthy sources.

Reinforcement Learning overcomes the problem of data acquisition by almost completely removing the need for data!

Reinforcement learning is a branch of Machine Learning that trains a model to come to an optimum solution for a problem by taking decisions by itself. 

It consists of:

  • An Environment, which an agent will interact with, to learn to reach a goal or perform an action.
  • A Reward if the action performed by the model is bringing us closer to the goal/is leading to the goal. This is done to train the model in the right direction. 
  • A negative reward if it performs an action that will not lead to the goal to prevent it from learning in the wrong direction.

Reinforcement learning requires a machine learning model to learn from the problem and come up with the most optimal solution by itself. This means that we also arrive at fast and unique solutions which the programmer might not even have thought of.

Consider the image below. You can see a dog in a room that has to perform an action, which is fetching. The dog is the agent; the room is the environment it has to work in, and the action to be performed is fetching.


Figure 1: Agent, Action, and Environment

If the correct action is performed, we will reward the agent. If it performs the wrong action, we will not give it any reward or give it a negative reward, like a scolding.


Figure 2: Agent performing an action

What Is Q-Learning?

Q-Learning is a Reinforcement learning policy that will find the next best action, given a current state. It chooses this action at random and aims to maximize the reward.


Figure 3: Components of Q-Learning

Your AI/ML Career is Just Around The Corner!

AI Engineer Master's ProgramExplore Program
Your AI/ML Career is Just Around The Corner!

Q-learning is a model-free, off-policy reinforcement learning that will find the best course of action, given the current state of the agent. Depending on where the agent is in the environment, it will decide the next action to be taken. 

The objective of the model is to find the best course of action given its current state. To do this, it may come up with rules of its own or it may operate outside the policy given to it to follow. This means that there is no actual need for a policy, hence we call it off-policy.

Model-free means that the agent uses predictions of the environment’s expected response to move forward. It does not use the reward system to learn, but rather, trial and error.

An example of Q-learning is an Advertisement recommendation system. In a normal ad recommendation system, the ads you get are based on your previous purchases or websites you may have visited. If you’ve bought a TV, you will get recommended TVs of different brands. 


Figure 4: Ad Recommendation System

Using Q-learning, we can optimize the ad recommendation system to recommend products that are frequently bought together. The reward will be if the user clicks on the suggested product.


Figure 5: Ad Recommendation System with Q-Learning

Important Terms in Q-Learning

  1. States: The State, S, represents the current position of an agent in an environment. 
  2. Action: The Action, A, is the step taken by the agent when it is in a particular state.
  3. Rewards: For every action, the agent will get a positive or negative reward.
  4. Episodes: When an agent ends up in a terminating state and can’t take a new action.
  5. Q-Values: Used to determine how good an Action, A, taken at a particular state, S, is. Q (A, S).
  6. Temporal Difference: A formula used to find the Q-Value by using the value of current state and action and previous state and action.

What Is The Bellman Equation?

The Bellman Equation is used to determine the value of a particular state and deduce how good it is to be in/take that state. The optimal state will give us the highest optimal value. 

The equation is given below. It uses the current state, and the reward associated with that state, along with the maximum expected reward and a discount rate, which determines its importance to the current state, to find the next state of our agent. The learning rate determines how fast or slow, the model will be learning. 

Figure 6: Bellman Equation

Figure 6: Bellman Equation   

How to Make a Q-Table?

While running our algorithm, we will come across various solutions and the agent will take multiple paths. How do we find out the best among them? This is done by tabulating our findings in a table called a Q-Table.

A Q-Table helps us to find the best action for each state in the environment. We use the Bellman Equation at each state to get the expected future state and reward and save it in a table to compare with other states. 

Lets us create a q-table for an agent that has to learn to run, fetch and sit on command. The steps taken to construct a q-table are :

Step 1: Create an initial Q-Table with all values initialized to 0

When we initially start, the values of all states and rewards will be 0. Consider the Q-Table shown below which shows a dog simulator learning to perform actions :


Figure 7: Initial Q-Table      

Step 2: Choose an action and perform it. Update values in the table

This is the starting point. We have performed no other action as of yet. Let us say that we want the agent to sit initially, which it does. The table will change to:


Figure 8: Q-Table after performing an action

Step 3: Get the value of the reward and calculate the value Q-Value using Bellman Equation

For the action performed, we need to calculate the value of the actual reward and the Q( S, A ) value


Figure 9: Updating Q-Table with Bellman Equation

Step 4: Continue the same until the table is filled or an episode ends

The agent continues taking actions and for each action, the reward and Q-value are calculated and it updates the table.


 Figure 10: Final Q-Table at end of an episode

Your AI/ML Career is Just Around The Corner!

AI Engineer Master's ProgramExplore Program
Your AI/ML Career is Just Around The Corner!

Q-Learning With Python

Let's use Q-Learning to find the shortest path between two points. We have a group of nodes and we want the model to automatically find the shortest way to travel from one node to another. We start by importing the necessary modules:


 Figure 11: Import necessary modules

Then we define all possible actions or the points/nodes that exist.


Figure 12: Define the actions

We define the rewards array for every action.


Figure 13: Define the rewards

We define our environment by mapping the state to a location and set the discount factor and learning rate:


Figure 14: Create Environment and set variables

We then define our agent class and set its attributes. 


Figure 15:Define Agent           

We then define its methods. The first method we refer to is training, which will train the robot in the environment. 



Figure 16: Define a method for how the agent interacts with the environment                  

We then define a method to select the optimal route for the next state.


Figure 17: Define a method to get optimal route

Now, let's call our agent and check the shortest route between points L9 and L1:


Figure 16: Find the shortest route between two points

As we can see, the model has found the shortest path between points 1 and 9 by traversing through points 5 and 8. 


In this article titled ‘What is Q-Learning? The best guide to Q-Learning’, we first looked at a sub-branch of machine learning called Reinforcement Learning. We then answered the question, ‘What is Q-Learning?’ which is a type of model-free reinforcement learning. The different terms associated with Q-Learning were introduced and we looked at the Bellman Equation, which is used to calculate the next state of our agent. We looked at the steps required to make a Q-Table and finally, we saw how to implement Q-Learning in Python with a demo.

If you are looking to enhance your skills in the AI & ML domain, we would highly recommend you check Simplilearn's Caltech Post Graduate Program in AI and Machine Learning

We hope this article answered the question which was burning in the back of your mind: ‘What is Q-Learning?’. 

Do you have any doubts or questions for us? Mention them in this article's comments section, and we'll have our experts answer them for you at the earliest!

About the Author

Mayank BanoulaMayank Banoula

Mayank is a Research Analyst at Simplilearn. He is proficient in Machine learning and Artificial intelligence with python.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.