# AscendC-Code-Release

**Repository Path**: sztu-bdi/ascendc-code-release

## Basic Information

- **Project Name**: AscendC-Code-Release
- **Description**: 昇腾AI创新大赛-算子挑战赛，这次包能算对团队，S4赛季优秀奖
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-12-05
- **Last Updated**: 2025-12-06

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# 项目概述

本仓库基于 AscendC/CANN 自定义算子框架，提供多个 AI Core 自定义算子的参考实现与打包示例：`FresnelCos`、`Pows`、`RmsNorm`、`SelectV2`、`AddLayerNorm`。各算子均包含 Host 侧（注册、推理与 tiling）与 Kernel 侧（AscendC 内核实现）代码，便于在昇腾环境中集成和部署。

# 目录结构

- `FresnelCos/`
  - `op_host/`: `fresnel_cos.cpp`（注册与 tiling）
  - `op_kernel/`: `fresnel_cos.cpp`、`fresnel_cos.h`（内核与辅助实现）
- `Pows/`
  - `op_host/`: `pows.cpp`、`pows_tiling.h`，
  - `op_kernel/`: `pows.cpp`、`pow.h`、`powb.h`，
- `RmsNorm/`
  - `op_host/`: `rms_norm.cpp`、`rms_norm_tiling.h`，
  - `op_kernel/`: `rms_norm.cpp`，
- `SelectV2/`
  - `op_host/`: `select_v2.cpp`、`select_v2_tiling.h`，
  - `op_kernel/`: `select_v2.cpp`，
- `AddLayerNorm/`
  - `op_host/`: `add_layer_norm_custom.cpp`、`add_layer_norm_custom_tiling.h`，
  - `op_kernel/`: `add_layer_norm_custom.cpp`，

# 算子说明

- `Pows`
  - 输入输出：`x1`、`x2` → `y`
  - 数据类型：`float32`、`float16`、`bfloat16`
  - Shape 支持：完整支持广播与非广播的以及非对齐shape；Host 侧实现了多维广播的地址映射与分块 tiling
  - 设备配置：`ascend310b`
- `SelectV2`
  - 输入输出：`condition`（`bool`）、`x1`、`x2` → `y`
  - 数据类型：`float16`、`float32`、`int32`、`int8`
  - Shape 支持：完整支持广播与非广播的以及非对齐shape；依据数据类型对 UB 分块进行 128B 对齐优化
  - 设备配置：`ascend310b`
- `RmsNorm`
  - 输入输出：`x`、`gamma` → `y`、`rstd`，可配置属性 `epsilon`（默认 `1e-6`）
  - 数据类型：`float32`、`float16`、`bfloat16`
  - Shape 支持：当前仅支持非广播的二维张量（按行归一化）；根据 UB 容量动态确定每次处理行数
  - 设备配置：`ascend310b`
- `FresnelCos`
  - 输入输出：`x` → `y`
  - 实现状态：Kernel 侧仍在完善；当前 FP32 精度未达到 `1e-4` 要求
- `AddLayerNorm`
  - 输入输出：`x`、`y`、`gamma`、`beta` → `res_out`
  - 数据类型：`float32`、`float16`
  - Shape 支持：二维张量，不支持广播；支持多 shape tiling；`gamma`/`beta` 为每行长度的向量
  - 属性：`epsilon`（默认 `1e-5`）
  - 设备配置：`ascend910b`；理论可支持 `ascend310b`
  - 路径说明：
    - 当前路径（UB 内两次归约）：在 UB 中完成 `x+y`、行均值、中心化、平方和、`std`/`invstd`、归一化与仿射，双缓冲流水。整块 GM 访问为 2 次读取（`x`、`y`）+ 1 次写回（`res_out`）；`gamma`/`beta` 每核各 1 次预取。
    - 更优路径（Welford 流式 + 写出融合）：首遍流式统计均值与方差，次遍在写出阶段融合 `gamma`/`beta` 完成归一化。在 tile 能容纳当前行的前提下，整块 GM 读取同为 2 次；若采用“先写中间值再读”的实现，整块 GM 读取为 3 次（首遍读 `x`、`y`，次遍读中间 `x+y`），并增加 1 次中间写回。

# 项目支撑

人工智能与数字经济广东省实验室（深圳）开放课题资助（编号：GML-KF-24-04）。
This research was financially supported by the Open Research Fund from Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), under Grant No. GML-KF-24-04.

# 参与贡献
1. 深圳技术大学 熊凯文 马军超 2. 人工智能与数字经济广东省实验室（深圳）