# word-document-tool

**Repository Path**: pandacouple/word-document-tool

## Basic Information

- **Project Name**: word-document-tool
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-12-15
- **Last Updated**: 2026-02-02

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Word Document Tool

一个功能强大的 Word 文档处理工具，用于提取文档元数据、文本内容和图片资源。

## 功能特点

- 📄 **文档解析**：支持解析 DOCX 格式文档
- 📋 **元数据提取**：提取文档标题、作者、创建日期等元数据
- 📝 **文本提取**：提取文档中的段落文本，支持按章节组织
- 🖼️ **图片提取**：提取文档中的图片资源，保留原始格式
- 📊 **页面分割**：根据 PDF 参考进行精确的页面分割
- 🔧 **场景化处理**：支持分析模式和复制模式两种处理场景
- 📱 **命令行工具**：提供便捷的命令行界面
- 📚 **TypeScript 支持**：完整的类型定义，便于集成到 TypeScript 项目

## 安装

### 从 NPM 安装

```bash
npm install word-document-tool
```

### 从源码构建

```bash
git clone https://github.com/yourusername/word-document-tool.git
cd word-document-tool
npm install
npm run build
```

## 使用方法

### 作为库使用

```typescript
import { extractDocument } from 'word-document-tool';

async function main() {
  try {
    // 分析模式：不使用 PDF，主要生成 chapter.json
    const analysisResult = await extractDocument('path/to/document.docx', undefined, {
      scene: 'analysis',
      outputDir: 'path/to/output'
    });

    // 复制模式：使用 DOCX 和 PDF，生成完整输出
    const duplicationResult = await extractDocument('path/to/document.docx', 'path/to/document.pdf', {
      scene: 'duplication',
      outputDir: 'path/to/output'
    });

    console.log('提取完成:', analysisResult);
  } catch (error) {
    console.error('提取失败:', error);
  }
}

main();
```

### 命令行工具

```bash
# 安装后可直接使用
word-tool extract --docx path/to/document.docx --pdf path/to/document.pdf --output path/to/output --scene duplication

# 查看帮助
word-tool --help
```

## 场景说明

### 分析场景 (analysis)

- **适用范围**：仅需要文本内容和章节结构，不需要精确页面分割
- **输入**：仅需要 DOCX 文件
- **输出**：
  - `chapters.json`：按章节组织的文本内容

### 复制场景 (duplication)

- **适用范围**：需要精确的页面分割和完整的文档结构
- **输入**：需要 DOCX 和对应的 PDF 文件
- **输出**：
  - `metadata.json`：文档元数据
  - `pages.json`：按页面组织的文本内容
  - `images/`：提取的图片资源目录

## API 文档

### `extractDocument`

```typescript
export const extractDocument = async (
  docxPath: string,
  pdfPath?: string,
  options?: Partial<ExtractionOptions>
): Promise<ExtractionResult>;
```

#### 参数

- `docxPath`：DOCX 文件路径
- `pdfPath`：PDF 文件路径（仅复制场景需要）
- `options`：提取选项
  - `scene`：处理场景，可选值：`analysis`、`duplication`
  - `outputDir`：输出目录路径
  - `extractPdfText`：是否提取 PDF 文本
  - `outputIntermediateProducts`：是否输出中间产品用于调试
  - `maxLevel`：目录最大层级

#### 返回值

```typescript
interface ExtractionResult {
  metadata: DocumentMetadata;
  text: { title: string; sections: any[] };
  images: ExtractedImage[];
  pages: PageContent[];
}
```

## 项目结构

```
src/
├── index.ts                   # 主入口文件
├── config.ts                  # 配置管理
├── types/                     # 类型定义
├── services/                  # 核心服务
│   ├── document-converter.ts  # 文档格式转换
│   ├── html-converter.ts      # HTML 转换
│   ├── image-extractor.ts     # 图片提取
│   ├── metadata-extractor.ts  # 元数据提取
│   ├── page-extractor.ts      # 页面提取
│   ├── paragraph-extractor.ts # 段落提取
│   ├── pdf-converter.ts       # PDF 转换
│   └── toc-extractor.ts       # 目录提取
├── utils/                     # 工具函数
└── cli.ts                     # 命令行工具
```

## 开发指南

### 运行测试

```bash
# 运行所有测试
npm test

# 运行指定测试文件
npm test test/scene.test.ts

# 运行测试并查看详细报告
npm test -- --reporter=verbose
```

### 代码检查

```bash
# 运行 ESLint 检查
npm run lint

# 运行 TypeScript 类型检查
npm run typecheck
```

### 构建项目

```bash
npm run build
```

## 贡献

欢迎提交 Issue 和 Pull Request！

### 贡献流程

1. Fork 本仓库
2. 创建特性分支 (`git checkout -b feature/AmazingFeature`)
3. 提交更改 (`git commit -m 'Add some AmazingFeature'`)
4. 推送到分支 (`git push origin feature/AmazingFeature`)
5. 打开 Pull Request

## 许可证

本项目采用 MIT 许可证 - 查看 [LICENSE](LICENSE) 文件了解详情。

## 问题反馈

如果您在使用过程中遇到任何问题，请通过以下方式反馈：

- [GitHub Issues](https://github.com/yourusername/word-document-tool/issues)
- 电子邮件：your.email@example.com

## 致谢

- 感谢所有贡献者的辛勤工作
- 基于 [pdfjs-dist](https://mozilla.github.io/pdf.js/) 实现 PDF 处理
- 基于 [jsdom](https://github.com/jsdom/jsdom) 实现 HTML 解析
- 基于 [adm-zip](https://github.com/cthackers/adm-zip) 实现 ZIP 文件处理

---

**Word Document Tool** - 让 Word 文档处理变得简单高效！