# psro

**Repository Path**: bobhjybaba/psro

## Basic Information

- **Project Name**: psro
- **Description**: No description available
- **Primary Language**: Python
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-08-31
- **Last Updated**: 2025-09-17

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# PSRO Research Implementation

This directory contains implementations for Policy Space Response Oracles (PSRO) research using Leduc Poker with OpenSpiel.

## Files

### `oracle_trainer.py`
Main PSRO implementation with Oracle training featuring:
- `OracleTrainer`: Best response computation using OpenSpiel's BestResponsePolicy
- `PerPlayerMixturePolicy`: Mixed strategy implementation for opponents
- Complete PSRO training loop with meta-strategy updates
- Support for both best response and reactor-lite modes

### `psro_helper.py` 
Core PSRO infrastructure including:
- `SimpleAgent`: Agent class managing strategy pools and meta-probabilities
- `MetaGame`: Payoff matrix management and strategy evaluation
- `RandomStrategy`: Baseline random policy implementation
- Episode-based strategy evaluation system

### `meta_solver.py`
Meta-strategy solvers for population mixing:
- Regret matching implementation
- Support for different meta-game solution approaches

## Leduc Poker Rules

Leduc poker is a simplified poker variant designed for AI research:

- **Deck**: 6 cards total (2 suits × 3 ranks: Jack, Queen, King)
- **Players**: 2 players (configurable)
- **Rounds**: 2 betting rounds
- **Structure**:
  - Round 1: Each player gets 1 private card
  - Round 2: 1 public card is revealed
- **Betting**:
  - Round 1: Raise amount = 2 chips
  - Round 2: Raise amount = 4 chips
  - Maximum 2 raises per round
- **Starting**: Each player starts with 1 chip in pot

## Usage

### PSRO Training with Oracle
```bash
cd /root/psro
python3 oracle_trainer.py
```

This will:
1. Initialize agents with random strategies
2. Run PSRO iterations (configurable, default: 100)
3. Compute best responses using OpenSpiel's BestResponsePolicy
4. Update meta-strategies using regret matching
5. Evaluate strategy pairs and build payoff matrices
6. Show population growth and meta-strategy evolution

### Helper Components Testing
```bash
cd /root/psro
python3 psro_helper.py
```

Basic testing of core PSRO components with CFR-based strategies.

## Requirements

- OpenSpiel installed and available in Python environment
- Python 3.6+
- NumPy
- PySpiel (part of OpenSpiel)

## Key Concepts

### Strategy vs Policy Design Convention

To improve code readability and maintain clear distinction between different types of strategic objects, we follow this naming convention:

#### Strategy
- **Definition**: Custom-defined classes that implement deterministic action selection
- **Required Method**: `get_action(state)` - returns a single action choice
- **Type**: Deterministic behavior
- **Examples**: `RandomStrategy`, `BRStrategy`, `CFRStrategy`, `SimpleAgent`

#### Policy
- **Definition**: Objects that follow OpenSpiel's policy interface
- **Required Method**: `action_probabilities(state)` - returns probability distribution over actions as `{action: probability}`
- **Type**: Probabilistic behavior
- **Examples**: `os_policy.Policy`, `PerPlayerMixturePolicy`, `BestResponsePolicy`

#### Interface Compatibility
The framework supports both types through automatic adaptation:

```python
# In PerPlayerMixturePolicy.action_probabilities()
if hasattr(obj, "action_probabilities"):
    # Probabilistic policy - use directly
    ap = obj.action_probabilities(state)
elif hasattr(obj, "get_action"):
    # Deterministic strategy - convert to probability distribution
    action = obj.get_action(state)
    ap = {action: 1.0}
```

This design allows:
- OpenSpiel policies to be used directly
- Custom deterministic strategies to integrate seamlessly
- Mixed strategy pools containing both types
- Clear distinction between probabilistic and deterministic approaches

### PSRO (Policy Space Response Oracles)
- Maintains populations of policies for each player
- Iteratively computes best responses against current populations
- Updates meta-strategies for selecting from populations
- Aims to find diverse, robust strategy sets

### Best Response Oracles
- Uses OpenSpiel's BestResponsePolicy for exact best response computation
- More efficient than CFR for computing responses to mixed strategies
- Builds information state sets for optimal decision making

### Meta-Strategy Optimization
- Uses regret matching to update population mixing weights
- Balances exploitation of current strategies with exploration of new ones
- Maintains diverse strategy populations

## Customization

You can modify parameters in `oracle_trainer.py`:

- `iterations`: Number of PSRO iterations (default: 100, line 168)
- `episodes`: Episodes per strategy evaluation (default: 1000)
- `injection_prob`: New strategy injection probability (default: 1e-3)
- `mode`: Oracle mode - 'br' for best response or 'reactor-lite' (line 129-130)

## Expected Output

The PSRO implementation shows:
- Growing strategy populations for both players
- Evolving meta-strategy distributions
- Payoff matrix updates as new strategies are evaluated
flag
Example output:
```
=== 迭代 1 ===
payoff matrix in agent0 view: [[ 0.123 -0.456]]
payoff matrix in agent1 view: [[-0.123  0.456]]
策略池大小: 2 vs 2
元策略: ['0.999', '0.001']
```


# TODO List
- ~~引入 PRD + γ 探索（替换/扩展 RD）：meta_solver.py~~
  - ~~在 solve() 末尾执行 sigma = (1-γ)*sigma + γ*Uniform，再做带下界的单纯形投影~~
  - ~~保留 min_prob 仅作数值兜底~~
- 统一策略协议：psro_helper.py / oracle_trainer.py
  - 定义 action_probabilities(state) 为必选接口；get_action() 可选（默认按 prob 采样）
- BRStrategy 补全概率接口：oracle_trainer.py
  - 在 BRStrategy 中实现 action_probabilities（优先转发 OpenSpiel BR 的 prob；无则对 get_action 做 δ 分布并 mask 非法动作）
- 元策略更新频率统一：oracle_trainer.py 主循环
  - 每回合：P0 训练→补齐新行 payoff→P1 训练→补齐新列 payoff→只更新一次 meta-solver