# psro **Repository Path**: bobhjybaba/psro ## Basic Information - **Project Name**: psro - **Description**: No description available - **Primary Language**: Python - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-08-31 - **Last Updated**: 2025-09-17 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # PSRO Research Implementation This directory contains implementations for Policy Space Response Oracles (PSRO) research using Leduc Poker with OpenSpiel. ## Files ### `oracle_trainer.py` Main PSRO implementation with Oracle training featuring: - `OracleTrainer`: Best response computation using OpenSpiel's BestResponsePolicy - `PerPlayerMixturePolicy`: Mixed strategy implementation for opponents - Complete PSRO training loop with meta-strategy updates - Support for both best response and reactor-lite modes ### `psro_helper.py` Core PSRO infrastructure including: - `SimpleAgent`: Agent class managing strategy pools and meta-probabilities - `MetaGame`: Payoff matrix management and strategy evaluation - `RandomStrategy`: Baseline random policy implementation - Episode-based strategy evaluation system ### `meta_solver.py` Meta-strategy solvers for population mixing: - Regret matching implementation - Support for different meta-game solution approaches ## Leduc Poker Rules Leduc poker is a simplified poker variant designed for AI research: - **Deck**: 6 cards total (2 suits × 3 ranks: Jack, Queen, King) - **Players**: 2 players (configurable) - **Rounds**: 2 betting rounds - **Structure**: - Round 1: Each player gets 1 private card - Round 2: 1 public card is revealed - **Betting**: - Round 1: Raise amount = 2 chips - Round 2: Raise amount = 4 chips - Maximum 2 raises per round - **Starting**: Each player starts with 1 chip in pot ## Usage ### PSRO Training with Oracle ```bash cd /root/psro python3 oracle_trainer.py ``` This will: 1. Initialize agents with random strategies 2. Run PSRO iterations (configurable, default: 100) 3. Compute best responses using OpenSpiel's BestResponsePolicy 4. Update meta-strategies using regret matching 5. Evaluate strategy pairs and build payoff matrices 6. Show population growth and meta-strategy evolution ### Helper Components Testing ```bash cd /root/psro python3 psro_helper.py ``` Basic testing of core PSRO components with CFR-based strategies. ## Requirements - OpenSpiel installed and available in Python environment - Python 3.6+ - NumPy - PySpiel (part of OpenSpiel) ## Key Concepts ### Strategy vs Policy Design Convention To improve code readability and maintain clear distinction between different types of strategic objects, we follow this naming convention: #### Strategy - **Definition**: Custom-defined classes that implement deterministic action selection - **Required Method**: `get_action(state)` - returns a single action choice - **Type**: Deterministic behavior - **Examples**: `RandomStrategy`, `BRStrategy`, `CFRStrategy`, `SimpleAgent` #### Policy - **Definition**: Objects that follow OpenSpiel's policy interface - **Required Method**: `action_probabilities(state)` - returns probability distribution over actions as `{action: probability}` - **Type**: Probabilistic behavior - **Examples**: `os_policy.Policy`, `PerPlayerMixturePolicy`, `BestResponsePolicy` #### Interface Compatibility The framework supports both types through automatic adaptation: ```python # In PerPlayerMixturePolicy.action_probabilities() if hasattr(obj, "action_probabilities"): # Probabilistic policy - use directly ap = obj.action_probabilities(state) elif hasattr(obj, "get_action"): # Deterministic strategy - convert to probability distribution action = obj.get_action(state) ap = {action: 1.0} ``` This design allows: - OpenSpiel policies to be used directly - Custom deterministic strategies to integrate seamlessly - Mixed strategy pools containing both types - Clear distinction between probabilistic and deterministic approaches ### PSRO (Policy Space Response Oracles) - Maintains populations of policies for each player - Iteratively computes best responses against current populations - Updates meta-strategies for selecting from populations - Aims to find diverse, robust strategy sets ### Best Response Oracles - Uses OpenSpiel's BestResponsePolicy for exact best response computation - More efficient than CFR for computing responses to mixed strategies - Builds information state sets for optimal decision making ### Meta-Strategy Optimization - Uses regret matching to update population mixing weights - Balances exploitation of current strategies with exploration of new ones - Maintains diverse strategy populations ## Customization You can modify parameters in `oracle_trainer.py`: - `iterations`: Number of PSRO iterations (default: 100, line 168) - `episodes`: Episodes per strategy evaluation (default: 1000) - `injection_prob`: New strategy injection probability (default: 1e-3) - `mode`: Oracle mode - 'br' for best response or 'reactor-lite' (line 129-130) ## Expected Output The PSRO implementation shows: - Growing strategy populations for both players - Evolving meta-strategy distributions - Payoff matrix updates as new strategies are evaluated flag Example output: ``` === 迭代 1 === payoff matrix in agent0 view: [[ 0.123 -0.456]] payoff matrix in agent1 view: [[-0.123 0.456]] 策略池大小: 2 vs 2 元策略: ['0.999', '0.001'] ``` # TODO List - ~~引入 PRD + γ 探索(替换/扩展 RD):meta_solver.py~~ - ~~在 solve() 末尾执行 sigma = (1-γ)*sigma + γ*Uniform,再做带下界的单纯形投影~~ - ~~保留 min_prob 仅作数值兜底~~ - 统一策略协议:psro_helper.py / oracle_trainer.py - 定义 action_probabilities(state) 为必选接口;get_action() 可选(默认按 prob 采样) - BRStrategy 补全概率接口:oracle_trainer.py - 在 BRStrategy 中实现 action_probabilities(优先转发 OpenSpiel BR 的 prob;无则对 get_action 做 δ 分布并 mask 非法动作) - 元策略更新频率统一:oracle_trainer.py 主循环 - 每回合:P0 训练→补齐新行 payoff→P1 训练→补齐新列 payoff→只更新一次 meta-solver