Applying the policy gradient method to behavior learning in multiagent systems: The pursuit problem

Seiji Ishihara, Harukazu Igarashi

Research output: Contribution to journalArticle

6 Citations (Scopus)

Abstract

In the field of multiagent systems, some methods use the policy gradient method for behavior learning. In these methods, the learning problem in the multiagent system is reduced to each agent's independent learning problem by adopting an autonomous distributed behavior determination method. That is, a probabilistic policy that contains parameters is used as the policy of each agent, and the parameters are updated while calculating the maximum gradient so as to maximize the expectation value of the reward. In this paper, first, recognizing the action determination problem at each time step to be a minimization problem for some objective function, the Boltzmann distribution, in which this objective function is the energy function, was adopted as the probabilistic policy. Next, we showed that this objective function can be expressed by such terms as the value of the state, the state action rule, and the potential. Further, as a result of an experiment applying this method to a pursuit problem, good policy was obtained and this method was found to be flexible so that it can be adapted to use of heuristics and to modification of behavioral constraint and objective in the policy.

Original languageEnglish
Pages (from-to)101-109
Number of pages9
JournalSystems and Computers in Japan
Volume37
Issue number10
DOIs
Publication statusPublished - 2006 Sep
Externally publishedYes

Fingerprint

Gradient methods
Multi agent systems
Experiments

Keywords

  • Multiagent system
  • Policy gradient method
  • Pursuit problem
  • Reinforcement learning

ASJC Scopus subject areas

  • Hardware and Architecture
  • Information Systems
  • Theoretical Computer Science
  • Computational Theory and Mathematics

Cite this

Applying the policy gradient method to behavior learning in multiagent systems : The pursuit problem. / Ishihara, Seiji; Igarashi, Harukazu.

In: Systems and Computers in Japan, Vol. 37, No. 10, 09.2006, p. 101-109.

Research output: Contribution to journalArticle

@article{a82da389b2ed459083336a4198e0f7e4,
title = "Applying the policy gradient method to behavior learning in multiagent systems: The pursuit problem",
abstract = "In the field of multiagent systems, some methods use the policy gradient method for behavior learning. In these methods, the learning problem in the multiagent system is reduced to each agent's independent learning problem by adopting an autonomous distributed behavior determination method. That is, a probabilistic policy that contains parameters is used as the policy of each agent, and the parameters are updated while calculating the maximum gradient so as to maximize the expectation value of the reward. In this paper, first, recognizing the action determination problem at each time step to be a minimization problem for some objective function, the Boltzmann distribution, in which this objective function is the energy function, was adopted as the probabilistic policy. Next, we showed that this objective function can be expressed by such terms as the value of the state, the state action rule, and the potential. Further, as a result of an experiment applying this method to a pursuit problem, good policy was obtained and this method was found to be flexible so that it can be adapted to use of heuristics and to modification of behavioral constraint and objective in the policy.",
keywords = "Multiagent system, Policy gradient method, Pursuit problem, Reinforcement learning",
author = "Seiji Ishihara and Harukazu Igarashi",
year = "2006",
month = "9",
doi = "10.1002/scj.20248",
language = "English",
volume = "37",
pages = "101--109",
journal = "Systems and Computers in Japan",
issn = "0882-1666",
publisher = "John Wiley and Sons Inc.",
number = "10",

}

TY - JOUR

T1 - Applying the policy gradient method to behavior learning in multiagent systems

T2 - The pursuit problem

AU - Ishihara, Seiji

AU - Igarashi, Harukazu

PY - 2006/9

Y1 - 2006/9

N2 - In the field of multiagent systems, some methods use the policy gradient method for behavior learning. In these methods, the learning problem in the multiagent system is reduced to each agent's independent learning problem by adopting an autonomous distributed behavior determination method. That is, a probabilistic policy that contains parameters is used as the policy of each agent, and the parameters are updated while calculating the maximum gradient so as to maximize the expectation value of the reward. In this paper, first, recognizing the action determination problem at each time step to be a minimization problem for some objective function, the Boltzmann distribution, in which this objective function is the energy function, was adopted as the probabilistic policy. Next, we showed that this objective function can be expressed by such terms as the value of the state, the state action rule, and the potential. Further, as a result of an experiment applying this method to a pursuit problem, good policy was obtained and this method was found to be flexible so that it can be adapted to use of heuristics and to modification of behavioral constraint and objective in the policy.

AB - In the field of multiagent systems, some methods use the policy gradient method for behavior learning. In these methods, the learning problem in the multiagent system is reduced to each agent's independent learning problem by adopting an autonomous distributed behavior determination method. That is, a probabilistic policy that contains parameters is used as the policy of each agent, and the parameters are updated while calculating the maximum gradient so as to maximize the expectation value of the reward. In this paper, first, recognizing the action determination problem at each time step to be a minimization problem for some objective function, the Boltzmann distribution, in which this objective function is the energy function, was adopted as the probabilistic policy. Next, we showed that this objective function can be expressed by such terms as the value of the state, the state action rule, and the potential. Further, as a result of an experiment applying this method to a pursuit problem, good policy was obtained and this method was found to be flexible so that it can be adapted to use of heuristics and to modification of behavioral constraint and objective in the policy.

KW - Multiagent system

KW - Policy gradient method

KW - Pursuit problem

KW - Reinforcement learning

UR - http://www.scopus.com/inward/record.url?scp=33747465457&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33747465457&partnerID=8YFLogxK

U2 - 10.1002/scj.20248

DO - 10.1002/scj.20248

M3 - Article

AN - SCOPUS:33747465457

VL - 37

SP - 101

EP - 109

JO - Systems and Computers in Japan

JF - Systems and Computers in Japan

SN - 0882-1666

IS - 10

ER -