Q Learning Grid Explorer

The simulation creates a gridworld where an agent explores paths to reach a goal state. It uses the Q-Learning algorithm to update (state,action) pairs with a quality value. Over time it learns to arrive at the goal state, while avoiding obstacles.

forked from @hash/q-learning-map-explorer

OVERVIEW LICENSE

RL Example

The simulation creates a gridworld where an agent explores paths to reach a goal state. It uses the Q-Learning algorithm to update (state,action) pairs with a quality value. Over time it learns the optimal path to arrive at the goal state, while avoiding obstacle/penalty states.

Q-Learning

The Q-Learning algorithm is one of the most popular - and fundamental - model free reinforcement learning algorithms. Every time step an agent selects an action and calculates the reward value of the new state. It then uses that reward value, in conjunction with the maximum possible score attainable from that state, to calculate the q-value of the state and action. Over multiple runs, it will build up a table of states and actions, and how 'good' they are. In this way an agent can learn paths to navigate the environment.

Parameters

"epsilon" - Represents the tradeoff between exploring and exploiting when selecting an action. In action.py the agent chooses to explore (select an action at random) if it generates a random number above epsilon. "epsilon_decay" - Over time the agent prefers to exploit its knowledge of good vs bad actions; epsilon decay is the rate at which epsilon is modified at the end of each episode. Over time epsilon approaches zero; how fast it does is determined by epsilon decay. "learning_rate" - The size of the update to make to the q value. "discount_factor" - The degree to which the agent should discount future rewards in favor of immediate ones. A higher discount factor indicates a preference for longer term rewards, lower means it will update in favor of immediate rewards. "episode_length" - the max number of time steps for an episode. Episodes end when the episode length is reached or when the agent encounters the goal or obstacle state. "gridworld.obstacles" - the location of different obstacles in the gridworld state. "gridworld.goal" - the location of the goal in the gridworld state. "gridword.move_penalty" - the cost of taking an action. "gridword.goal_reward" - the positive reward for reaching the end state "gridword.obstacle_penalty" - the negative penalty for encountering an obstacle "agent_reset" - the properties that should be reset on the agent whenever it completes an episode. This returns the agent to the initial state, without resetting the q table, which must persist across episodes.

Possible actions:

[1, 0, 0]
[0, 1, 0]
[-1, 0, 0]
[0, -1, 0]