Q-Learning Library

This library allows you to create agents which use a demo q-learning algorithm to control their behavior. Q learning uses an update function to track the best actions for an agent to take in its current state.


The Q-Learning algorithm is one of the most popular - and fundamental - model free reinforcement learning algorithms. Every time step an agent selects an action and calculates the reward value of the new state. It then uses that reward value, in conjunction with the maximum possible score attainable from that state, to calculate the q-value of the state and action. Over multiple runs, it will build up a table of states and actions, and how 'good' they are. In this way an agent can learn paths to navigate the environment.


The three behaviors in this library provide the core logic required for q-learning. They should be paired with custom behaviors that allow the agent to interact with the specific environment within which you are using it. An agent's behavior chain should look something like:

  <determine possible actions>, 
  <take action>, 
  <determine reward>, 

This custom behavior must provide the agent with an array of all valid actions it can take in its current state. These should be stored in state["actions"]. An array of all possible actions should be stored in globals.json.


This library behavior chooses the next action the agent will take. It will either choose a random action or the best possible action, also known as "explore" or "exploit". This ration is controlled by

This custom behavior should translate the action chosen by the previous behavior into an interaction with the environment, or a change in state for the agent.


"epsilon" - Represents the tradeoff between exploring and exploiting when selecting an action. In action.py the agent chooses to explore (select an action at random) if it generates a random number above epsilon. "epsilon_decay" - Over time the agent prefers to exploit its knowledge of good vs bad actions; epsilon decay is the rate at which epsilon is modified at the end of each episode. Over time epsilon approaches zero; how fast it does is determined by epsilon decay. "learning_rate" - The size of the update to make to the q value. "learning_rate_decay" - Similar to epsilon_decay, allows an agent to decrease the rate at which it modifies its q-table over time. "discount_factor" - The degree to which the agent should discount future rewards in favor of immediate ones. A higher discount factor indicates a preference for longer term rewards, lower means it will update in favor of immediate rewards. "episode_length" - the maximum number of time steps for an episode. Episodes end when the episode length is reached or when the agent encounters the goal or obstacle state. "actions" - All possible actions an agent can take over the course of the simulation.