# nanoserve

**Repository Path**: chen_lin_k/nanoserve

## Basic Information

- **Project Name**: nanoserve
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-02-14
- **Last Updated**: 2026-03-14

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Llama + FlashInfer Integration

This project implements a high-performance integration between HuggingFace Llama models and FlashInfer operators, featuring a custom BlockManager for efficient KV cache management.

## Components

### 1. BlockManager (`block_manager.py`)
- Manages physical blocks for KV cache allocation
- Pre-allocates large tensor pool to avoid runtime allocation
- Thread-safe block allocation and deallocation
- Optimized for high-performance inference

### 2. ModelExecutor (`model_executor.py`)
- Integrates FlashInfer operators with HuggingFace models
- Handles KV cache mapping and attention computation
- Supports both prefill and decode phases
- Manages FlashInfer wrapper initialization

### 3. Key Features
- **Zero-copy KV cache access**: Direct tensor views without memory copying
- **Efficient block management**: O(1) allocation/deallocation using deque
- **Thread-safe**: Concurrent access support with locking
- **FlashInfer integration**: Paged KV cache for optimal memory usage
- **Single layer support**: Focus on core functionality

## Installation

### Prerequisites
```bash
pip install torch numpy
```

### FlashInfer Installation (Optional)
```bash
pip install flashinfer
```

Note: If FlashInfer is not installed, the code will use mock implementations for testing.

## Quick Start

```python
from block_manager import BlockManager
from model_executor import ModelExecutor

# 1. Create BlockManager
block_manager = BlockManager(
    num_blocks=100,
    num_layers=32,
    num_heads=32,
    head_dim=128,
    block_size=16,
    dtype=torch.float16,
    device="cuda"
)

# 2. Create ModelExecutor
model_executor = ModelExecutor(
    block_manager=block_manager,
    num_heads=32,
    head_dim=128,
    page_size=16,
    dtype=torch.float16,
    device="cuda"
)

# 3. Allocate blocks for your sequence
seq_length = 50
block_indices = block_manager.allocate_blocks(seq_length)

# 4. Execute model with FlashInfer attention
output = model_executor.execute_model(
    input_ids=input_tensor,
    block_tables=[block_indices],
    seq_lengths=[seq_length],
    layer_idx=0,
    is_prefill=True
)
```

## Usage Examples

### Basic Block Allocation
```python
# Allocate blocks for 100 tokens with 16-token block size
blocks = block_manager.allocate_blocks(100)
print(f"Allocated {len(blocks)} blocks: {blocks}")

# Free blocks when done
block_manager.free_blocks(blocks)
```

### FlashInfer KV Cache Mapping
```python
# Convert block tables to FlashInfer format
flashinfer_inputs = model_executor.prepare_flashinfer_inputs(
    block_tables=[[0, 1, 2], [3, 4]],
    seq_lengths=[40, 25],
    is_prefill=True
)
```

### Attention Computation
```python
# Compute attention using FlashInfer
attention_output = model_executor.compute_attention_with_flashinfer(
    query=query_tensor,
    key_cache=key_cache_tensor,
    value_cache=value_cache_tensor,
    block_tables=block_tables,
    seq_lengths=seq_lengths,
    is_prefill=True
)
```

## Testing

Run the unit tests:
```bash
python test_block_manager.py
python test_model_executor.py
```

Run the integration example:
```bash
python integration_example.py
```

## Architecture

### BlockManager Design
- Pre-allocates KV cache pool: `(num_blocks, num_layers, 2, num_heads, head_dim, block_size)`
- Uses deque for O(1) block allocation/deallocation
- Thread-safe operations with locking
- Tracks allocated vs free blocks

### ModelExecutor Design
- Initializes FlashInfer decode and prefill wrappers
- Maps logical block tables to FlashInfer format:
  - `paged_kv_indices`: Flattened block indices
  - `paged_kv_indptr`: Cumulative block counts
  - `paged_kv_last_page_len`: Remainder tokens in last blocks
- Handles KV cache tensor extraction and reshaping
- Provides unified interface for attention computation

## Performance Considerations

1. **Memory Efficiency**: Paged KV cache reduces memory fragmentation
2. **Compute Efficiency**: FlashInfer operators optimized for GPU execution
3. **Scalability**: Block-based allocation supports long sequences
4. **Concurrency**: Thread-safe operations enable batch processing

## Limitations

- Currently supports single layer execution
- Mock FlashInfer implementation when not installed
- CPU-only testing (GPU requires CUDA setup)
- Simplified attention computation for demonstration

## Future Enhancements

- Multi-layer execution support
- Distributed inference capabilities
- Advanced memory management strategies
- Integration with actual HuggingFace Llama models
- Performance benchmarking and optimization

## Contributing

This is a demonstration project for integrating BlockManager with FlashInfer. Feel free to extend it for production use cases.