Policy gradient reinforcement learning with environmental dynamics and action-values in policies

Seiji Ishihara, Harukazu Igarashi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The knowledge concerning an agent's policies consists of two types: the environmental dynamics for defining state transitions around the agent, and the behavior knowledge for solving a given task. However, these two types of information, which are usually combined into state-value or action-value functions, are learned together by conventional reinforcement learning. If they are separated and learned independently, either might be reused in other tasks or environments. In our previous work, we presented learning rules using policy gradients with an objective function, which consists of two types of parameters representing environmental dynamics and behavior knowledge, to separate the learning for each type. In such a learning framework, state-values were used as an example of the set of parameters corresponding to behavior knowledge. By the simulation results on a pursuit problem, our method properly learned hunter-agent policies and reused either bit of knowledge. In this paper, we adopt action-values as a set of parameters in the objective function instead of state-values and present learning rules for the function. Simulation results on the same pursuit problem as in our previous work show that such parameters and learning rules are also useful.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages120-130
Number of pages11
Volume6881 LNAI
EditionPART 1
DOIs
Publication statusPublished - 2011
Event15th International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, KES 2011 - Kaiserslautern
Duration: 2011 Sep 122011 Sep 14

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
NumberPART 1
Volume6881 LNAI
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other15th International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, KES 2011
CityKaiserslautern
Period11/9/1211/9/14

Fingerprint

Reinforcement learning

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Ishihara, S., & Igarashi, H. (2011). Policy gradient reinforcement learning with environmental dynamics and action-values in policies. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (PART 1 ed., Vol. 6881 LNAI, pp. 120-130). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 6881 LNAI, No. PART 1). https://doi.org/10.1007/978-3-642-23851-2_13

Policy gradient reinforcement learning with environmental dynamics and action-values in policies. / Ishihara, Seiji; Igarashi, Harukazu.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 6881 LNAI PART 1. ed. 2011. p. 120-130 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 6881 LNAI, No. PART 1).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Ishihara, S & Igarashi, H 2011, Policy gradient reinforcement learning with environmental dynamics and action-values in policies. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). PART 1 edn, vol. 6881 LNAI, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), no. PART 1, vol. 6881 LNAI, pp. 120-130, 15th International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, KES 2011, Kaiserslautern, 11/9/12. https://doi.org/10.1007/978-3-642-23851-2_13
Ishihara S, Igarashi H. Policy gradient reinforcement learning with environmental dynamics and action-values in policies. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). PART 1 ed. Vol. 6881 LNAI. 2011. p. 120-130. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); PART 1). https://doi.org/10.1007/978-3-642-23851-2_13
Ishihara, Seiji ; Igarashi, Harukazu. / Policy gradient reinforcement learning with environmental dynamics and action-values in policies. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 6881 LNAI PART 1. ed. 2011. pp. 120-130 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); PART 1).
@inproceedings{9d48d1d9a1ba40eba70b9bc390b952b6,
title = "Policy gradient reinforcement learning with environmental dynamics and action-values in policies",
abstract = "The knowledge concerning an agent's policies consists of two types: the environmental dynamics for defining state transitions around the agent, and the behavior knowledge for solving a given task. However, these two types of information, which are usually combined into state-value or action-value functions, are learned together by conventional reinforcement learning. If they are separated and learned independently, either might be reused in other tasks or environments. In our previous work, we presented learning rules using policy gradients with an objective function, which consists of two types of parameters representing environmental dynamics and behavior knowledge, to separate the learning for each type. In such a learning framework, state-values were used as an example of the set of parameters corresponding to behavior knowledge. By the simulation results on a pursuit problem, our method properly learned hunter-agent policies and reused either bit of knowledge. In this paper, we adopt action-values as a set of parameters in the objective function instead of state-values and present learning rules for the function. Simulation results on the same pursuit problem as in our previous work show that such parameters and learning rules are also useful.",
author = "Seiji Ishihara and Harukazu Igarashi",
year = "2011",
doi = "10.1007/978-3-642-23851-2_13",
language = "English",
isbn = "9783642238505",
volume = "6881 LNAI",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
number = "PART 1",
pages = "120--130",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
edition = "PART 1",

}

TY - GEN

T1 - Policy gradient reinforcement learning with environmental dynamics and action-values in policies

AU - Ishihara, Seiji

AU - Igarashi, Harukazu

PY - 2011

Y1 - 2011

N2 - The knowledge concerning an agent's policies consists of two types: the environmental dynamics for defining state transitions around the agent, and the behavior knowledge for solving a given task. However, these two types of information, which are usually combined into state-value or action-value functions, are learned together by conventional reinforcement learning. If they are separated and learned independently, either might be reused in other tasks or environments. In our previous work, we presented learning rules using policy gradients with an objective function, which consists of two types of parameters representing environmental dynamics and behavior knowledge, to separate the learning for each type. In such a learning framework, state-values were used as an example of the set of parameters corresponding to behavior knowledge. By the simulation results on a pursuit problem, our method properly learned hunter-agent policies and reused either bit of knowledge. In this paper, we adopt action-values as a set of parameters in the objective function instead of state-values and present learning rules for the function. Simulation results on the same pursuit problem as in our previous work show that such parameters and learning rules are also useful.

AB - The knowledge concerning an agent's policies consists of two types: the environmental dynamics for defining state transitions around the agent, and the behavior knowledge for solving a given task. However, these two types of information, which are usually combined into state-value or action-value functions, are learned together by conventional reinforcement learning. If they are separated and learned independently, either might be reused in other tasks or environments. In our previous work, we presented learning rules using policy gradients with an objective function, which consists of two types of parameters representing environmental dynamics and behavior knowledge, to separate the learning for each type. In such a learning framework, state-values were used as an example of the set of parameters corresponding to behavior knowledge. By the simulation results on a pursuit problem, our method properly learned hunter-agent policies and reused either bit of knowledge. In this paper, we adopt action-values as a set of parameters in the objective function instead of state-values and present learning rules for the function. Simulation results on the same pursuit problem as in our previous work show that such parameters and learning rules are also useful.

UR - http://www.scopus.com/inward/record.url?scp=80053154140&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80053154140&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-23851-2_13

DO - 10.1007/978-3-642-23851-2_13

M3 - Conference contribution

SN - 9783642238505

VL - 6881 LNAI

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 120

EP - 130

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

ER -