earticle

논문검색

Oral Session A-2 : Language Processing

Learning strategic play in Mafia via PPO Finetuning of Large Language Models

초록

영어

We investigate whether Large Language Model (LLM)s can learn strategic reasoning and social deception abilities through Reinforcement Learning (RL) finetuning via a multi-agent “Mafia Game” simulation environment. We finetune a baseline 7B model using Proximal Policy Optimization (PPO) with sparse binary rewards based on game outcomes. Training samples are collected through an opponent pool consisting of different versions of the finetuned model. Our experiment results show that the finetuned model outperforms the baseline model by a significant margin and suggest that strategic capabilities unseen in baseline models emerge.

목차

Abstract
I. INTRODUCTION
II. METHODOLOGY
A. Game Environment Setup
B. Training Setup
C. Opponent Pool Design
III. EXPERIMENTS AND RESULTS
IV. CONCLUSION
ACKNOWLEDGMENT
REFERENCES

저자정보

  • Jiho Jun School of Electrical Engineering Korea University Seoul, Korea
  • Junhee Seok School of Electrical Engineering Korea University Seoul, Korea

참고문헌

자료제공 : 네이버학술정보

    함께 이용한 논문

      0개의 논문이 장바구니에 담겼습니다.