Discord

Improving the Prisoners Dilemma with Q-Learning

We recently published HASH’s Q-Learning Library, to make it easier to start building simulations and agents that use reinforcement learning. Let’s take a look at how we can use the library to update an older simulation. We’ll take the classic Prisoner’s Dilemma simulation and add an agent which uses q-learning to determine its strategy.

The Prisoner’s Dilemma is a classic game theory problem, where two players individually choose to cooperate or defect with the other player. Depending on the strategy and the outcome, they earn different rewards.

Prisoner B

Prisoner A
Prisoner B stays silent
(cooperates)
Prisoner B betrays
(defects)
Prisoner A stays silent
(cooperates)
Each serves 1 yearPrisoner A: 3 years
Prisoner B: goes free
Prisoner A betrays
(defects)
Prisoner A: goes free
Prisoner B: 3 years
Each serves 2 years

The simulation uses hard-coded strategies like “Always cooperate” or “Tit for Tat”, but with the Q-Learning Library, we can create a strategy that adapts and learns by looking at the history of moves made.

In addition to the library behaviors, we always need to create two custom behaviors to allow the Q-Learning algorithm to properly interact with the environment of the specific simulation.

One behavior will provide a ‘reward’ to the agent, and one will translate the action determined by the q learning algorithm into a valid “play”. You can see the end result in this simulation.

Representing the State Space

To use the Q-Learning library, we’ll need to decide how to encode the possible states of an agent into a Q-Table. Since our agent is trying to learn from the previous rounds, and there are four possible outcomes for each round, there are 4^h possible states for an agent, where h is the number of previous rounds an agent takes into consideration. Here is the conversion from current history to location in the Q-table (“q_state”):

// from globals.json
"lookback": 3,
"outcomes": {"outcomes": ["cc", "cd", "dc", "dd"]


// Determine the current q state
h = history.slice(-lookback).map(m => outcomes.indexOf(m));
state["q_state"] = h.reduce((acc, val, i) => acc + val * (4**i), 0);;

This conversion will be used multiple times in the simulation.

Writing reward.js

The reward.js behavior will determine the reward the agent should use to update its q-table, and set the next_q_state field on the agent based on the moves the agent and its opponent played. The reward is based on the actions of both the agent and its opponent, and the opponent’s actions can only be seen at the beginning of the next time step. As a result, this behavior must run at the beginning of the behavior chain, when the agent can see it’s opponent’s most recent move (through context.neighbors()) but before it updates its q-table.

The state["reward"] field is determined by looking at the most recent entry in state["curr_histories“], and using the “scoring” object in globals.json(). The state["next_q_state"] field is calculated using the conversion described in the preceding section.

Writing strategy_q_learning.js

This behavior needs to translate the decisions made by the @hash/qrl/action.py behavior into a valid move that the agent can play in the game. It does this by using some of the logic from the other strategy_... behaviors, then makes use of the state["action"] field:

// Play the move chosen by @hash/qrl/action.py
context.neighbors().map(n => {
  state.curr_moves[n.agent_id] = state.action;
});

Since the agent needs a certain number of rounds in the history in order to determine its next move, its first few moves will actually be randomly determined. Once enough rounds have been played (3 in this example), it will add back the q-learning behaviors and begin adjusting its expectation values as it learns a strategy.

Initializing the Simulation

Initializing the agent with a q-learning strategy will require us to set a number of global and state variables. The full list can be found in the README for the library.

When we run the simulation, the q-learning agent will attempt to fill out its q-table by exploring different actions, and exploiting its current knowledge of the opponent. You can run the “Test Every Strategy” experiment to observe how well the q-learning agent learns to play against other strategies.