# ndt_omp
**Repository Path**: jaredjp/ndt_omp
## Basic Information
- **Project Name**: ndt_omp
- **Description**: 基于x86架构和openmp的ndt加速库。
需要使用pcl1.7,切勿在电脑上安装pcl1.8.
只支持RELEASE模式,因此qt编译时要选择RELEASE。
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 1
- **Forks**: 2
- **Created**: 2021-11-18
- **Last Updated**: 2021-12-24
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
**ARM上使用需要修改cmakelists**
```
#add_definitions(-std=c++11 -msse -msse2 -msse3 -msse4 -msse4.1 -msse4.2)
#set(CMAKE_CXX_FLAGS "-std=c++11 -msse -msse2 -msse3 -msse4 -msse4.1 -msse4.2")
set(CMAKE_CXX_FLAGS "-std=c++11 -fopenmp")
```
**18.04系统使用 由于pcl是1.8 ,不知道为什么pcl_ros里没有vtk的一些库,因此需要修改cmakelists**
```
# -mavx causes a lot of errors!!
add_definitions(-std=c++11 -msse -msse2 -msse3 -msse4 -msse4.1 -msse4.2)
set(CMAKE_CXX_FLAGS "-std=c++11 -msse -msse2 -msse3 -msse4 -msse4.1 -msse4.2")
# pcl 1.7 causes a segfault when it is built with debug mode
set(CMAKE_BUILD_TYPE "RELEASE")
find_package(catkin REQUIRED COMPONENTS
roscpp
pcl_ros
)
find_package(PCL 1.8 REQUIRED)
include_directories(${PCL_INCLUDE_DIRS})
link_directories(${PCL_LIBRARY_DIRS})
add_definitions(${PCL_DEFINITIONS})
message(STATUS "PCL_INCLUDE_DIRS:" ${PCL_INCLUDE_DIRS})
message(STATUS "PCL_LIBRARY_DIRS:" ${PCL_LIBRARY_DIRS})
message(STATUS "PCL_DEFINITIONS:" ${PCL_DEFINITIONS})
```
2019.1.10 尝试将ndt对象保存下来(因为传入点云,进行初始化,太耗时),失败。因为这种c++的对象要保存,实际上是序列化技术,而现有的序列化技术主要是[google的protobuf和boost serialization](https://www.cnblogs.com/mfrbuaa/p/3940854.html)。protobuf效率高但是太轻量级,boost很全面,支持stl。但是pcl各种继承派生,虚基类。。太难按照标准进行序列化了。[详细教程](https://blog.csdn.net/chenaqiao/article/details/48371597)
# ndt_omp
This package provides an OpenMP-boosted Normal Distributions Transform (and GICP) algorithm derived from pcl. The NDT algorithm is modified to be SSE-friendly and multi-threaded. It can run up to 10 times faster than its original version in pcl.
### Benchmark (on Core i7-6700K)
```
$ roscd ndt_omp/data
$ rosrun ndt_omp align 251370668.pcd 251371071.pcd
--- pcl::NDT ---
single : 282.222[msec]
10times: 2921.92[msec]
fitness: 0.213937
--- pclomp::NDT (KDTREE, 1 threads) ---
single : 207.697[msec]
10times: 2059.19[msec]
fitness: 0.213937
--- pclomp::NDT (DIRECT7, 1 threads) ---
single : 139.433[msec]
10times: 1356.79[msec]
fitness: 0.214205
--- pclomp::NDT (DIRECT1, 1 threads) ---
single : 34.6418[msec]
10times: 317.03[msec]
fitness: 0.208511
--- pclomp::NDT (KDTREE, 8 threads) ---
single : 54.9903[msec]
10times: 500.51[msec]
fitness: 0.213937
--- pclomp::NDT (DIRECT7, 8 threads) ---
single : 63.1442[msec]
10times: 343.336[msec]
fitness: 0.214205
--- pclomp::NDT (DIRECT1, 8 threads) ---
single : 17.2353[msec]
10times: 100.025[msec]
fitness: 0.208511
```
Several methods for neighbor voxel search are implemented. If you select pclomp::KDTREE, results will be completely same as the original pcl::NDT. We recommend to use pclomp::DIRECT7 which is faster and stable. If you need extremely fast registration, choose pclomp::DIRECT1, but it might be a bit unstable.

Red: target, Green: source, Blue: aligned
AERO 15
```
--- pcl::NDT ---
single : 540.319[msec]
10times: 5289.08[msec]
fitness: 0.213937
--- pclomp::NDT (KDTREE, 1 threads) ---
single : 355.901[msec]
10times: 3518.03[msec]
fitness: 0.213937
--- pclomp::NDT (DIRECT7, 1 threads) ---
single : 287.231[msec]
10times: 2848.42[msec]
fitness: 0.214205
--- pclomp::NDT (DIRECT1, 1 threads) ---
single : 68.8867[msec]
10times: 649.162[msec]
fitness: 0.208511
--- pclomp::NDT (KDTREE, 8 threads) ---
single : 77.1235[msec]
10times: 717.681[msec]
fitness: 0.213937
--- pclomp::NDT (DIRECT7, 8 threads) ---
single : 57.8502[msec]
10times: 555.979[msec]
fitness: 0.214205
--- pclomp::NDT (DIRECT1, 8 threads) ---
single : 18.9918[msec]
10times: 149.262[msec]
fitness: 0.208511
```
修改了cmakelists以后,禁用了sse
```
#add_definitions(-std=c++11 -msse -msse2 -msse3 -msse4 -msse4.1 -msse4.2)
#set(CMAKE_CXX_FLAGS "-std=c++11 -msse -msse2 -msse3 -msse4 -msse4.1 -msse4.2")
set(CMAKE_CXX_FLAGS "-std=c++11")
```
结果
```
--- pcl::NDT ---
single : 536.016[msec]
10times: 5277.47[msec]
fitness: 0.213937
--- pclomp::NDT (KDTREE, 1 threads) ---
single : 369.782[msec]
10times: 3660.63[msec]
fitness: 0.213937
--- pclomp::NDT (DIRECT7, 1 threads) ---
single : 305.225[msec]
10times: 3013.47[msec]
fitness: 0.214205
--- pclomp::NDT (DIRECT1, 1 threads) ---
single : 71.9889[msec]
10times: 686.992[msec]
fitness: 0.208511
--- pclomp::NDT (KDTREE, 8 threads) ---
single : 96.6104[msec]
10times: 747.321[msec]
fitness: 0.213937
--- pclomp::NDT (DIRECT7, 8 threads) ---
single : 61.2044[msec]
10times: 589.004[msec]
fitness: 0.214205
--- pclomp::NDT (DIRECT1, 8 threads) ---
single : 19.3979[msec]
10times: 160.325[msec]
fitness: 0.208511
```
台机 Intel® Core™ i7-8700 CPU @ 3.20GHz × 12
```
--- pcl::NDT ---
single : 223.293[msec]
10times: 2185.31[msec]
fitness: 0.213937
--- pclomp::NDT (KDTREE, 1 threads) ---
single : 211.347[msec]
10times: 2057.26[msec]
fitness: 0.213937
--- pclomp::NDT (DIRECT7, 1 threads) ---
single : 82.9705[msec]
10times: 808.223[msec]
fitness: 0.214205
--- pclomp::NDT (DIRECT1, 1 threads) ---
single : 22.7415[msec]
10times: 205.486[msec]
fitness: 0.208511
--- pclomp::NDT (KDTREE, 12 threads) ---
single : 36.593[msec]
10times: 311.669[msec]
fitness: 0.213937
--- pclomp::NDT (DIRECT7, 12 threads) ---
single : 16.9687[msec]
10times: 150.324[msec]
fitness: 0.214205
--- pclomp::NDT (DIRECT1, 12 threads) ---
single : 7.12142[msec]
10times: 48.8959[msec]
fitness: 0.208511
```
use tx2
修改了cmakelists以后
```
--- pcl::NDT ---
single : 967.739[msec]
10times: 9643.9[msec]
fitness: 0.213937
--- pclomp::NDT (KDTREE, 1 threads) ---
single : 697.156[msec]
10times: 7116.16[msec]
fitness: 0.213937
--- pclomp::NDT (DIRECT7, 1 threads) ---
single : 370.99[msec]
10times: 3648.4[msec]
fitness: 0.214205
--- pclomp::NDT (DIRECT1, 1 threads) ---
single : 106.453[msec]
10times: 955.57[msec]
fitness: 0.208511
--- pclomp::NDT (KDTREE, 4 threads) ---
single : 208.352[msec]
10times: 2055.04[msec]
fitness: 0.213937
--- pclomp::NDT (DIRECT7, 4 threads) ---
single : 114.866[msec]
10times: 1158.24[msec]
fitness: 0.214205
--- pclomp::NDT (DIRECT1, 4 threads) ---
single : 39.8362[msec]
10times: 291.261[msec]
fitness: 0.208511
```
以下是APEX(xavier)的测试
```
--- pcl::NDT ---
single : 647.187[msec]
10times: 5923.38[msec]
fitness: 0.213937
--- pclomp::NDT (KDTREE, 1 threads) ---
single : 573.673[msec]
10times: 5307.98[msec]
fitness: 0.213937
--- pclomp::NDT (DIRECT7, 1 threads) ---
single : 231.879[msec]
10times: 2015.26[msec]
fitness: 0.214205
--- pclomp::NDT (DIRECT1, 1 threads) ---
single : 62.265[msec]
10times: 535.759[msec]
fitness: 0.208511
--- pclomp::NDT (KDTREE, 8 threads) ---
single : 113.542[msec]
10times: 997.613[msec]
fitness: 0.213937
--- pclomp::NDT (DIRECT7, 8 threads) ---
single : 67.8999[msec]
10times: 629.635[msec]
fitness: 0.214205
--- pclomp::NDT (DIRECT1, 8 threads) ---
single : 38.7986[msec]
10times: 322.459[msec]
fitness: 0.208511
```
以下是海思3559a的测试
```
--- pcl::NDT ---
single : 1056.54[msec]
10times: 10484.1[msec]
fitness: 0.213937
--- pclomp::NDT (KDTREE, 1 threads) ---
single : 771.621[msec]
10times: 7625.17[msec]
fitness: 0.213937
--- pclomp::NDT (DIRECT7, 1 threads) ---
single : 471.365[msec]
10times: 4625.03[msec]
fitness: 0.214205
--- pclomp::NDT (DIRECT1, 1 threads) ---
single : 127.452[msec]
10times: 1190.68[msec]
fitness: 0.208511
--- pclomp::NDT (KDTREE, 4 threads) ---
single : 403.844[msec]
10times: 4198.75[msec]
fitness: 0.213937
--- pclomp::NDT (DIRECT7, 4 threads) ---
single : 265.707[msec]
10times: 2889.43[msec]
fitness: 0.214205
--- pclomp::NDT (DIRECT1, 4 threads) ---
single : 81.731[msec]
10times: 719.643[msec]
fitness: 0.208511
```