# LittleAgent

**Repository Path**: cyddcydd/LittleAgent

## Basic Information

- **Project Name**: LittleAgent
- **Description**: 学习RAG/SFT/RL/Agent
- **Primary Language**: Python
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-01-01
- **Last Updated**: 2026-02-26

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# 智扫通机器人智能客服 / LittleAgent

面向扫地/扫拖机器人的智能客服项目，融合了 RAG 检索增强、Tool 工具调用、ReAct Agent、SFT、GRPO 等多项技术。


## 前端界面
![alt text](img/前端.png)


### 快速开始（API版本）
前端目前接入的是API版本，还没有接入训练后的SFT或者GRPO后的模型

```bash
# 1. 安装依赖并配置 config/ 下 YAML（如 API Key）
# 2. 加载知识库（首次或更新文档后）
python -m rag.vectore_store
# 3. 启动客服前端
streamlit run app.py
```

---

## 项目进度概览

| 模块 | 状态 | 说明 |
|------|------|------|
| **RAG + 高级检索** | 已实现 | 多查询扩展(MQE)、HyDE、扩展检索框架，Chroma 向量库 |
| **Tool + ReAct Agent** | 已实现 | 多工具编排、中间件（监控/日志/动态 Prompt） |
| **SFT 监督微调** | 已实现 | 482条训练数据，BERTScore 0.6553 |
| **GRPO 强化学习** | 进行中 | 6个自定义奖励函数四个好了，就是工具的奖励函数不知道怎么写，还有这个参数也太玄学了 |


---

## 模块介绍

### 1. RAG 检索增强

- **作用**：用知识库（故障排除、维护保养、选购指南等）增强回答，减少幻觉。
- **实现**：
  - 基础检索：Chroma + 向量相似度，支持 txt/pdf/csv 入库（`rag/vectore_store.py`）。
  - **高级检索**（`rag/advanced_retrieval.py`）：
    - **MQE**：多查询扩展，生成多条语义等价问句提高召回。
    - **HyDE**：假设文档嵌入，先生成假设答案再检索相似文档。
    - **扩展检索框架**：MQE + HyDE 多路检索后合并去重、排序。
    - 多轮对话：可选查询改写，结合历史上下文。
  - 策略可选：`base` / `mqe` / `hyde` / `expanded` / `auto`，由 `rag_summarize` 等工具透出。

配置：`config/retrieval.yml`、`config/chroma.yml`。知识库加载：运行 `python -m rag.vectore_store`（或 `python rag/vectore_store.py`）从 `data/` 导入文档。

---

### 2. Tool 工具调用

- **作用**：Agent 根据用户问题决定是否调用、调用哪些工具，并汇总结果生成回复。
- **已有工具**（`agent/tools/agent_tools.py`）：
  - **rag_summarize**：RAG 检索并总结（支持上述检索策略）。
  - rag_search_mqe / rag_search_hyde / rag_search_expanded：指定策略的检索。
  - get_weather、get_user_location、get_user_id、get_current_month：示例/业务工具。
  - fetch_external_data：按用户与月份拉取使用记录。
  - fill_context_for_report：为报告场景注入上下文（配合中间件）。
- 工具均通过 `@tool` 注册，由 ReAct Agent 在推理时选择并执行。

---

### 3. ReAct Agent

- **作用**：实现「推理 + 行动」循环：根据当前对话决定下一步是调用工具还是直接回答，支持多轮工具调用。
- **实现**：基于 LangChain `create_agent`（LangGraph），模型 + 系统提示词 + 工具列表 + 中间件。
- **中间件**（`agent/tools/middleware.py`）：
  - **monitor_tool**：工具调用前后打日志，并可改写上下文（如 report 场景）。
  - **log_before_model**：每次调用模型前记录消息条数及简要内容。
  - **report_prompt_switch**：按上下文动态切换/注入报告相关 Prompt。
- 入口：`agent/react_agent.py` 的 `ReactAgent`，对外提供 `execute_stream(query)` 流式输出。

---

### 4. SFT 监督微调

- **作用**：用「问题 + 标准答案」数据对基座模型做 LoRA 微调，使回复更贴合扫地/扫拖机器人客服话术与领域知识。
- **当前进展**：
  - 训练数据：482条高质量问答对，覆盖17个业务类别
  - 测试数据：73条，采用分层采样确保类别平衡
  - 基座模型：Qwen2-0.5B-Instruct
  - 训练方法：LoRA参数高效微调
  - **当前最佳**：BERTScore 0.6553, BLEU 0.0275
  - **优化目标**：BERTScore ≥ 0.70
- **评估指标**：
  - **主要指标**：BERTScore（语义相似度，更适合问答系统）
  - 辅助指标：BLEU、ROUGE
  - 人工评估：准确性、有用性、自然性、简洁性
- **目录**：`sft/`  
  - `data/`：`train.jsonl`（482条）、`test.jsonl`（73条）
  - `config/train_config.yaml`：训练配置（10 epochs, LoRA r=16）
  - `scripts/`：训练、评估、数据处理脚本
  - `output/`：模型checkpoint和评估报告
- **依赖**：`sft/requirements_sft.txt`（torch、transformers、peft、datasets 等）
- **说明**：详见 `sft/README.md`

---

### 5. GRPO 强化学习

- **目标**：在 SFT 基础上，通过 GRPO（Group Relative Policy Optimization）优化回复质量、鼓励“像用了工具/RAG”的专业回答（领域术语、结构清晰），并控制简洁性。
- **当前进展**：
  - **技术栈**：TRL 0.28.0 + Qwen2-0.5B-Instruct + SFT LoRA 合并 + 新 LoRA
  - **数据**：基于 SFT 数据生成 chat 格式偏好数据（prompt + solution + category），472 条训练 + 83 条验证
  - **奖励函数**（6 个，无需单独奖励模型）：
    - `format_reward`：格式检查（中文、无乱码、无重复）
    - `relevance_reward`：与参考答案的 **BERTScore** 语义相似度
    - `completeness_reward`：关键信息点覆盖度
    - `conciseness_reward`：长度适中性（30–200 字最优，抑制冗长）
    - `tool_use_reward`：与 Agent 工具对齐，奖励回答中的领域术语与结构化表述
    - `rag_use_reward`：与 rag_summarize 对齐，奖励与参考答案相关且结构清晰的回答
  - **训练**：v1（3 epochs）、v2（2 epochs 调优简洁性）已完成；总奖励分提升，回答更全面，v2 加强简洁性约束
- **目录**：`grpo/`
  - `config/grpo_config.yaml`：学习率、beta、num_generations、奖励权重等
  - `data/`：train_preferences.jsonl、eval_preferences.jsonl
  - `scripts/`：train_grpo.py、generate_preferences.py、evaluate_grpo.py
  - `output/grpo_run/`：最终 LoRA 权重、TensorBoard 日志、评估结果
- **说明**：详见 `grpo/README.md`、`grpo/GRPO_PLAN.md`

---

## 使用说明

### 环境与配置

- Python 3.10+，安装项目依赖（含 LangChain、LangGraph、DashScope 等）。
- 配置 `config/` 下 YAML：如 `config/rag.yml`（模型名、Embedding）、`config/chroma.yml`（向量库路径、数据路径）、`config/retrieval.yml`（检索策略）、`config/agent.yml` 等；需填写可用的 API Key（如 DashScope）。

### 知识库加载（RAG）

```bash
# 在项目根目录执行，将 data/ 下 txt/pdf/csv 导入 Chroma
python -m rag.vectore_store
```

### 启动智能客服前端

```bash
streamlit run app.py
```

浏览器打开后即可与 Agent 对话；Agent 会按需调用 RAG、天气、用户信息、外部数据等工具并流式输出。

### 仅测试 Agent（无前端）

```bash
python agent/react_agent.py
```

### SFT 训练与评估

```bash
cd sft
pip install -r requirements_sft.txt

# 训练SFT模型
python scripts/train_sft.py

# 评估模型
python scripts/evaluate_sft.py \
  --base_model Qwen/Qwen2-0.5B-Instruct \
  --finetuned_model output/sft_run/final \
  --test_data data/test.jsonl \
  --output_dir output/evaluation

# 查看评估报告
cat output/evaluation/comparison_report.md
```

### GRPO 训练与评估

```bash
cd grpo
pip install -r requirements.txt

# 1. 生成训练数据（推荐：rule_based，直接用 SFT 数据转 chat 格式）
python scripts/generate_preferences.py --method rule_based --split --eval_ratio 0.15

# 2. 训练 GRPO（读取 config/grpo_config.yaml）
python scripts/train_grpo.py

# 3. 评估 GRPO vs SFT（对比 BERTScore、总奖励分等）
python scripts/evaluate_grpo.py \
  --grpo_model output/grpo_run/final \
  --sft_model ../sft/output/sft_run/final \
  --test_data ../sft/data/test.jsonl \
  --output output/evaluation_results.json
```

一键流程：`./train_grpo.sh`（可选 `--data-only` / `--train-only` / `--eval-only`）。

---

## 项目结构

```
├── agent/                      # ReAct Agent、工具与中间件
│   ├── react_agent.py          # Agent 入口，execute_stream 流式输出
│   └── tools/
│       ├── agent_tools.py      # 工具定义：rag_summarize、天气、用户/月份、外部数据等
│       └── middleware.py       # 中间件：工具监控、调用前日志、报告 Prompt 切换
├── rag/                        # RAG 与高级检索
│   ├── vectore_store.py       # Chroma 向量库、知识库加载 load_document
│   ├── rag_service.py         # RAG 检索服务、rag_summarize 链
│   └── advanced_retrieval.py  # 高级检索：MQE、HyDE、扩展检索、查询改写
├── model/                      # 模型工厂
│   └── factory.py             # 对话模型、Embedding（如 DashScope）
├── config/                     # 配置
│   ├── rag.yml
│   ├── chroma.yml
│   ├── retrieval.yml
│   ├── agent.yml
│   └── prompts.yml
├── prompts/                    # 提示词模板
│   ├── main_prompt.txt
│   ├── rag_summarize.txt
│   ├── mqe_prompt.txt
│   ├── hyde_prompt.txt
│   ├── query_rewrite_prompt.txt
│   └── report_prompt.txt
├── utils/                      # 工具函数
│   ├── config_handler.py
│   ├── file_handler.py
│   ├── logger_handler.py
│   ├── prompt_loader.py
│   ├── path_tools.py
│   └── text_splitter.py
├── data/                       # 知识库原始文档
│   ├── *.txt / *.pdf / *.csv
│   └── external/
│       └── records.csv
├── sft/                        # SFT 监督微调
│   ├── data/
│   │   ├── train.jsonl         # 训练数据
│   │   └── test.jsonl          # 测试数据
│   ├── config/
│   │   └── train_config.yaml   # 训练配置
│   ├── scripts/
│   │   ├── train_sft.py        # 训练脚本
│   │   ├── evaluate_sft.py     # 评估脚本
│   │   └── ...                 # 数据处理脚本
│   ├── output/
│   │   ├── sft_run/            # 训练输出
│   │   └── evaluation/         # 评估报告
│   ├── requirements_sft.txt
│   └── README.md
├── grpo/                       # GRPO 强化学习
│   ├── data/                   # 偏好数据
│   │   ├── train_preferences.jsonl
│   │   ├── eval_preferences.jsonl
│   │   └── test_preferences.jsonl
│   ├── config/
│   │   └── grpo_config.yaml    # GRPO训练配置
│   ├── scripts/
│   │   ├── train_grpo.py       # GRPO训练脚本
│   │   ├── generate_preferences.py  # 偏好数据生成
│   │   └── evaluate_grpo.py    # 评估脚本
│   ├── output/                 # 训练输出
│   ├── models/                 # 模型存储
│   ├── README.md
│   └── requirements.txt
├── chroma_db/                  # Chroma 持久化目录
├── logs/                       # 运行日志
├── app.py                      # Streamlit 前端入口
├── test.py                     # 检索与 Agent 测试
└── README.md
```

---

## 性能指标

### SFT模型（当前最佳）

| 指标 | 基座模型 | 微调模型 | 提升 |
|------|---------|---------|------|
| **BERTScore** | 0.6228 | **0.6553** | +5.2% |
| BLEU | 0.0171 | 0.0275 | +60.8% |
| ROUGE-L | 0.1160 | 0.1347 | +16.1% |

**评估说明**：
- **BERTScore为主要指标**（语义相似度，更适合问答系统）
- BLEU、ROUGE为辅助参考
- 目标：BERTScore ≥ 0.70

### GRPO模型

| 指标 | SFT 基线 | GRPO（v1/v2） | 说明 |
|------|----------|----------------|------|
| 总奖励分 | ~0.22 | ~0.24（v1）、调优中（v2） | 相关性/完整性奖励提升明显 |
| BERTScore F1 | ~0.66 | ~0.63 | 回答更长更详细时略降；v2 加强简洁性以平衡 |
| 平均回答长度 | ~135 字 | v1 偏长（~334 字），v2 约束中 | 通过 conciseness_reward 权重与 max_completion_length 控制 |
| 目标 | - | BERTScore ≥0.72，长度 30–200 字 | 持续调参与奖励权重优化 |


---

## 参考资料

- [Datawhale Hello-Agents](https://datawhalechina.github.io/hello-agents/#/)
- [zst_agent (Gitee)](https://gitee.com/javacaoyu/zst_agent)


**联系方式**：zengyicydd@tju.edu.cn