Deep Reinforcement Learning

DRL is a subset of Machine Learning in which agents are allowed to solve tasks on their own, and thus discover new solutions independent of human intuition.

What is DRL?

Deep Reinforcement Learning (DRL) is a subset of Machine Learning and an amalgamation of methods from both Deep Learning and Reinforcement Learning, through which agents autonomously complete tasks.

DRL allows agents to solve tasks on their own, and thus discover new solutions independent of human intuition, historical understanding, and path-dependent thinking. Sifting through mutli-dimensional, non-linear data and instantaneously taking action means that DRL outperforms its human counterpart through both depth and breadth of perception.

The use of sequential decision-making is a key differentiator from other types of Machine Learning. DRL employs multiple neural networks, layered on top of each other (thus the term ‘deep’), which can represent more complex policies. Through combining concepts from both Deep Learning and Reinforcement Learning, DRL is able to overcome the ‘catastrophic forgetting’ that is often a byproduct of deep neural networks.

It is particularly useful as an iterative and adaptive process through which actions can be tested and refined, with the agents changing their behavior based on the result.

Key Concepts

  • Agent: an autonomous entity whose activity is directed towards achieving goals.
  • State: a comprehensive description of the world.
  • Observation: a partial description of the state (world).
  • Action Spaces: the valid actions that the agent is allowed to take within a state; may be discrete or continuous.
  • Policy: a rule used by an agent to decide which action to take; may be deterministic or stochastic (based on a Gaussian/normal distribution).
  • Policy network: conversion of inputted problems into outputted actions.
  • Policy gradient: method for optimizing parametrized policy to maximize the expected return for an agent.
  • Trajectories: a sequence of states and actions within a world (originating from the agents within it); may also be deterministic or stochastic.
  • Reward: the positive number an agent receives when it has taken a successful action.
  • Return: the cumulative effect of multiple rewards received by an agent. An agent will want to maximize its expected return.
  • Value Functions: the expected return if an agent starts in a state, and then follows a policy continuously.

How does DRL work?

The goal of DRL is to develop a robust ‘policy network’ — the name for converting presented problems into outputted actions. This functions as a loop of learned behavior for the agent. Each time it reacts to a problem, it produces a new and better-informed action.

  1. Using a sample distribution of available actions, we can vary the information we feed to the agent. This allows it to explore the possibilities available to it (the ‘action spaces), through randomization of possible actions. With this sample distribution of potential actions, the balance of probability determines that the agent will theoretically find the best possible action to take.
  2. We provide feedback to the agent whenever it completes a task. If successful, we reward the agent with a positive integer. Using this integer (a ‘policy gradient’), we can make the probability of the agent selecting the successful actions more likely in the future. (Inversely, we make the unsuccessful actions less likely to be selected, by feeding the agent a negative integer.)
  3. During this process, the policy network is “recording” through the updates to its weights from the signal from the cost function. The policy network is ‘recording’ the information at its disposal through constantly updating the signals it receives from the value function.
  4. The cumulative effect of maximizing the chances of the agent selecting successful actions and minimizing the chances of it selecting unsuccessful actions optimizes the agent’s future behavior. The reinforcement of positive behavior incentivizes the agent to figure out the best method for tackling future problems, learning from its successes and failures (to improve its ‘expected return’).
  5. This trial-and-error approach is constantly improving and compounding the agent’s actions over time (in the form of ‘value functions’), so that we no longer need a human model to intervene.

How can we use DRL?

Although DRL is currently widely used in robotics, computer science and video game development, other industries are set to benefit from its proliferation. The opportunities for scaling this technology span multiple industries, from public health and education to transport and the financial services.

By eliminating human miscalculation and improving a machine’s retention of past activity, DRL uses agents to find the optimal action for a given problem.

Here are some key case uses:
  • Healthcare: automating medical diagnosis from both unstructured and structured clinical data (paper), then controlling dosage of medication through mapping from registry databases (paper).
  • Finance: integrating macro economic analysis with predictive technology to forecast future systems-level shifts, while adhering to pre-determined risk parameters (paper); analyzing uncorrelated trading signals to achieve superior short-term returns across financial markets (paper).
  • Cybersecurity: use of DRL-based simulations of intrusion attempts to detect potential cyber attacks (paper).

Deep Reinforcement Learning is a promising approach for implementing powerful, autonomous agents. It has the potential to dramatically expand the use of AI in new and existing domains.

Applying DRL in practice

At HASH we’re building a platform to create environments and agents useful both in training DRL agents and, in the near future, incorporate DRL trained agents into the simulation to navigate and discover solutions in complex systems. Learn more about HASH’s platform >

Create a free

account

Sign up to try HASH out for yourself, and see what all the fuss is about

By signing up you agree to our terms and conditions and privacy policy