Reinforcement Learning Overview

Data Science Milan
4 min readJul 18, 2020

“Towards Artificial Intelligence”

On 30th June 2020, Data Science Milan has organized a webMeetup hosting Marco Del Pra, an Artificial Intelligence Specialist, to talk about Reinforcement Learning.

“Reinforcement Learning Overview”, by Marco Del Pra, Freelancer

Reinforcement Learning, is the area of Machine Learning that deals with sequential decision-making, it can be described as a Markov decision process.

There are three basic concepts in Reinforcement Learning: state, action, and reward. The state describes the current situation. The action is what an agent can do in each state. The reward describes the feedback from the environment which can be positive or even negative. The agent is the learner and decision-maker, it interacts continually with the environment. In these interactions the agent selects actions and the environment responds to these actions showing new situations and giving a reward. More specifically, at each time step t, the agent receives some representation of the environment state St, and on that basis select an action At. At the time step later, the agent receives a numerical reward Rt+1, and go on the new state St+1.

A reinforcement learning agent typically has an explicit representation of one or more of the following three things: a policy, a value function and optionally a model. A policy π is a mapping from the agent state to an action. The agent’s goal is to find an optimal policy, which achieves the maximum expected return from all states. Policy is a behaviour function that allows the agent to select the best action.

Given a policy π and a discount factor γ, the optimal V-value function is an expected sum of discounted rewards in a given state S, the optimal Q-value function is an expected sum of discounted rewards in a given state S, and for a given action A. The advantage function, difference between the Q-value function and the V-value function, describes how good the action A is, as compared to the expected return when following directly policy π. A reinforcement learning agent may have a model, in this case it’s called a model-based agent, instead if it doesn’t incorporate a model, it’s called a model-free agent.

Reinforcement Learning applications can be founded in games, robotics, finance, healthcare, image processing….

Reinforcement Learning problems can be solved by these approaches: value-based, policy-based and model-based. The value-based method tries to maximize the state value function or the state-action value function. The policy-based method tries to come up with such a policy that the action performed in every state, it helps to gain maximum reward in the future. The model-based method needs to create a virtual model for each environment. The agent learns to perform in that specific environment.

The value-based class of algorithms aims to build a value function, which subsequently define a policy. The popular model is the value function approximation that generalizes the value function by parameterized functional forms. A Deep Q-Network is a value function approximation that use a Deep Neural Network, and in particular a Convolutional Neural Network as a function approximator. It enables to scale up to making decisions in really large domains and enable automatic feature extraction.

With the policy gradient method, the goal is to directly find the policy with the highest value function, rather than first finding the value-function of the optimal policy and then extracting the policy from it. The policy improvement takes a gradient step to optimize the policy with respect to the value function estimation.

The policy gradient method is simpler than the value-based method, it allows the action to be continuous with respect the state, it can learn stochastic policies, but typically converges to a local rather than global optimum and has high variance.

Another common approach is to use the actor-critic architecture that estimates and update both the policy and the value function; it consists of two parts: an actor and the critic. The actor represents the policy and the critic the estimate value function. They are both represented by a non-linear neural network function approximators. The actor uses gradients derived from the policy gradient and adjusts the policy parameters, the critic, estimates the approximate value function for the current policy.

The previous methods attempted to learn either a value function or a policy from experience. In contrast there are model-based approaches that first learn a model of the world from experience, and then use this for planning and acting.

There are three steps:

-Acting: the policy is used to select the actions to perform in the real environment;

-Model learning: from the collected experience, a new model is assumed in order to minimize the error between the model new state and the real new state;

-Planning: the value function and the policy are updated, in order to be used in the real environment in the next iteration.

Recording&Slides:

video

slides

References:

http://www.incompleteideas.net/book/RLbook2018.pdf

https://arxiv.org/abs/1811.12560

Interesting related links:

http://web.stanford.edu/class/cs234/schedule.html

Written by Claudio G. Giancaterino

--

--

Data Science Milan

Blog and summary of events of the Data Science Milan community.