# XIELU **Repository Path**: dandelion-young/XIELU ## Basic Information - **Project Name**: XIELU - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-11-03 - **Last Updated**: 2025-11-03 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # XIELU XIELU (https://arxiv.org/abs/2411.13010) is a high-performance CUDA implementation of a parameterized activation function designed for deep learning applications. This library provides optimized GPU kernels with PyTorch integration for both training and inference. ## Mathematical Definition The XIELU activation function is defined as: ``` f(x) = { α_p * x² + β * x, if x > 0 α_n * (exp(min(x, ε)) - 1) - α_n * x + β * x, if x ≤ 0 } ``` Where: - `α_p = softplus(alpha_p)`: Learned positive slope parameter - `α_n = β + softplus(alpha_n)`: Learned negative slope parameter - `β`: Fixed scaling factor - `ε`: Numerical stability parameter ## Overview XIELU implements a custom activation function with learnable parameters `alpha_p` (positive slope), `alpha_n` (negative slope), `beta` (scaling factor), and `eps` (epsilon for numerical stability). The activation function is designed to be differentiable and suitable for gradient-based optimization. ### Features - **CUDA Accelerated**: Optimized CUDA kernels for high performance on NVIDIA GPUs - **PyTorch Integration**: Integration with PyTorch's autograd system - **Flexible Precision**: Support for different floating-point precisions including bfloat16 and half precision optimizations - **Gradient Support**: Full backward pass implementation for training ## Installation ### Requirements - Python >= 3.10 - PyTorch >= 2.0 - CUDA Toolkit (CUDA_HOME environment variable must be set) - CMake >= 3.30 - NVIDIA GPU with compute capability 6.0+ ### Setup 1. Ensure the `CUDA_HOME` environment variable points to your CUDA toolkit directory: ```bash export CUDA_HOME=/usr/local/cuda ``` 2. Install the package: ```bash pip install . --no-build-isolation --no-deps ``` For GH200 or other specialized hardware, install on top of your existing container/uenv/python environment. ## Usage XIELU provides three implementation variants for different use cases: - **`XIELU`**: CUDA-accelerated implementation with `torch.compile` support (recommended for production) - **`XIELUfn`**: Pure PyTorch with custom autograd function - **`XIELUPy`**: Pure PyTorch implementation (reference implementation) ### Basic Usage ```python import torch from xielu.ops.wrappers import XIELU # Initialize the activation function device = torch.device("cuda") xielu = XIELU( alpha_p_init=0.8, # Initial positive slope parameter alpha_n_init=0.8, # Initial negative slope parameter beta=0.5, # Scaling factor eps=1e-6, # Epsilon for numerical stability device=device, dtype=torch.float32 ) # Forward pass input_tensor = torch.randn(32, 128, 512, device=device) output = xielu(input_tensor) # The parameters are learnable and will be updated during training optimizer = torch.optim.Adam(xielu.parameters(), lr=0.001) ``` ### Integration with Neural Networks ```python import torch.nn as nn from xielu.ops.wrappers import XIELU class MyModel(nn.Module): def __init__(self, input_dim, hidden_dim, output_dim): super().__init__() self.linear1 = nn.Linear(input_dim, hidden_dim) self.xielu = XIELU(device=torch.device("cuda")) self.linear2 = nn.Linear(hidden_dim, output_dim) def forward(self, x): x = self.linear1(x) x = self.xielu(x) # Custom activation x = self.linear2(x) return x ``` ### Performance Options For maximum performance, you can enable vectorized memory loads: ```python xielu = XIELU( alpha_p_init=0.8, alpha_n_init=0.8, beta=0.5, eps=1e-6, device=device, with_vector_loads=True # Enable optimized memory access ) ``` ### torch.compile Compatibility XIELU supports `torch.compile` for additional performance optimizations and integration with compilable models: ```python import torch from xielu.ops.wrappers import XIELU # Create model with XIELU activation class MyModel(nn.Module): def __init__(self): super().__init__() self.xielu = XIELU(device=torch.device("cuda")) def forward(self, x): return self.xielu(x) # Compile the model for optimized performance model = MyModel() compiled_model = torch.compile(model) # Use as normal - now with compilation optimizations output = compiled_model(input_tensor) ``` ## Development ### Running Tests The test suite includes correctness tests, gradient checks, and performance benchmarks: ```bash # Run all tests python -m pytest tests/ -v # Run specific test files python tests/test_xielu.py python tests/test_reduced_precision.py # Run benchmark python tests/benchmark.py ``` ### Test Coverage The test suite validates: - **Correctness**: Forward pass agreement between CUDA and PyTorch implementations - **Gradients**: Gradient correctness using `torch.autograd.gradcheck` - **Precision**: Reduced precision (bfloat16) functionality - **Performance**: Throughput benchmarks across different tensor sizes ### Building from Source The project uses CMake for building the CUDA extensions: ```bash # Clean build rm -rf build/ # Build in development mode pip install -e . --no-build-isolation --no-deps # For debugging, you can build with verbose output CMAKE_VERBOSE_MAKEFILE=1 pip install -e . --no-build-isolation --no-deps ``` ### Optimization Features - **Vectorized Memory Access**: Enable `with_vector_loads=True` for improved memory throughput - **Reduced Precision**: Support for bfloat16 operations for faster inference - **Gradient Optimization**: Efficient backward pass implementation