Policy gradient reinforcement learning with separated knowledge: Environmental dynamics and action-values in policies

Seiji Ishihara, Harukazu Igarashi

Research output: Contribution to journalArticle

Abstract

The knowledge concerning an agent's policies consists of two types: the environmental dynamics for defining state transitions around the agent, and the behavior knowledge for solving a given task. However, these two types of information, which are usually combined into state-value or action-value functions, are learned together by conventional reinforcement learning. If they are separated and learned respectively, we might be able to transfer the behavior knowledge to other environments and reuse or modify it. In our previous work, we presented appropriate rules of learning using policy gradients with an objective function, which consists of two types of parameters representing the environmental dynamics and the behavior knowledge, to separate the learning for each type. In the learning framework, state-values were used as reusable parameters corresponding to the behavior knowledge. Instead of state-values, this paper adopts action-values as parameters in the objective function of a policy and presents learning rules by the policy gradient method for each of the separated knowledge. Simulation results on a pursuit problem showed that such parameters can also be transferred and reused more effectively than the unseparated knowledge.

Original languageEnglish
Pages (from-to)282-289
Number of pages8
JournalIEEJ Transactions on Electronics, Information and Systems
Volume136
Issue number3
DOIs
Publication statusPublished - 2016

Fingerprint

Reinforcement learning
Gradient methods

Keywords

  • Action-value
  • Environmental dynamics
  • Policy gradient method
  • Pursuit problem
  • Reinforcement learning
  • Transfer learning

ASJC Scopus subject areas

  • Electrical and Electronic Engineering

Cite this

@article{d83c1f463ed74e99aefffbbd2b482064,
title = "Policy gradient reinforcement learning with separated knowledge: Environmental dynamics and action-values in policies",
abstract = "The knowledge concerning an agent's policies consists of two types: the environmental dynamics for defining state transitions around the agent, and the behavior knowledge for solving a given task. However, these two types of information, which are usually combined into state-value or action-value functions, are learned together by conventional reinforcement learning. If they are separated and learned respectively, we might be able to transfer the behavior knowledge to other environments and reuse or modify it. In our previous work, we presented appropriate rules of learning using policy gradients with an objective function, which consists of two types of parameters representing the environmental dynamics and the behavior knowledge, to separate the learning for each type. In the learning framework, state-values were used as reusable parameters corresponding to the behavior knowledge. Instead of state-values, this paper adopts action-values as parameters in the objective function of a policy and presents learning rules by the policy gradient method for each of the separated knowledge. Simulation results on a pursuit problem showed that such parameters can also be transferred and reused more effectively than the unseparated knowledge.",
keywords = "Action-value, Environmental dynamics, Policy gradient method, Pursuit problem, Reinforcement learning, Transfer learning",
author = "Seiji Ishihara and Harukazu Igarashi",
year = "2016",
doi = "10.1541/ieejeiss.136.282",
language = "English",
volume = "136",
pages = "282--289",
journal = "IEEJ Transactions on Electronics, Information and Systems",
issn = "0385-4221",
publisher = "The Institute of Electrical Engineers of Japan",
number = "3",

}

TY - JOUR

T1 - Policy gradient reinforcement learning with separated knowledge

T2 - Environmental dynamics and action-values in policies

AU - Ishihara, Seiji

AU - Igarashi, Harukazu

PY - 2016

Y1 - 2016

N2 - The knowledge concerning an agent's policies consists of two types: the environmental dynamics for defining state transitions around the agent, and the behavior knowledge for solving a given task. However, these two types of information, which are usually combined into state-value or action-value functions, are learned together by conventional reinforcement learning. If they are separated and learned respectively, we might be able to transfer the behavior knowledge to other environments and reuse or modify it. In our previous work, we presented appropriate rules of learning using policy gradients with an objective function, which consists of two types of parameters representing the environmental dynamics and the behavior knowledge, to separate the learning for each type. In the learning framework, state-values were used as reusable parameters corresponding to the behavior knowledge. Instead of state-values, this paper adopts action-values as parameters in the objective function of a policy and presents learning rules by the policy gradient method for each of the separated knowledge. Simulation results on a pursuit problem showed that such parameters can also be transferred and reused more effectively than the unseparated knowledge.

AB - The knowledge concerning an agent's policies consists of two types: the environmental dynamics for defining state transitions around the agent, and the behavior knowledge for solving a given task. However, these two types of information, which are usually combined into state-value or action-value functions, are learned together by conventional reinforcement learning. If they are separated and learned respectively, we might be able to transfer the behavior knowledge to other environments and reuse or modify it. In our previous work, we presented appropriate rules of learning using policy gradients with an objective function, which consists of two types of parameters representing the environmental dynamics and the behavior knowledge, to separate the learning for each type. In the learning framework, state-values were used as reusable parameters corresponding to the behavior knowledge. Instead of state-values, this paper adopts action-values as parameters in the objective function of a policy and presents learning rules by the policy gradient method for each of the separated knowledge. Simulation results on a pursuit problem showed that such parameters can also be transferred and reused more effectively than the unseparated knowledge.

KW - Action-value

KW - Environmental dynamics

KW - Policy gradient method

KW - Pursuit problem

KW - Reinforcement learning

KW - Transfer learning

UR - http://www.scopus.com/inward/record.url?scp=84960455941&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84960455941&partnerID=8YFLogxK

U2 - 10.1541/ieejeiss.136.282

DO - 10.1541/ieejeiss.136.282

M3 - Article

AN - SCOPUS:84960455941

VL - 136

SP - 282

EP - 289

JO - IEEJ Transactions on Electronics, Information and Systems

JF - IEEJ Transactions on Electronics, Information and Systems

SN - 0385-4221

IS - 3

ER -