# SpleeterRT **Repository Path**: james34602/SpleeterRT ## Basic Information - **Project Name**: SpleeterRT - **Description**: Real time monaural source separation base on fully convolutional neural network operates on Time-frequency domain, pure C implementation. - **Primary Language**: C - **License**: GPL-3.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-08-17 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Real time monaural source separation base on fully convolutional neural network operates on time-frequency domain AI Source separator written in C running a U-Net model trained by Deezer, separate your audio input to Drum, Bass, Accompaniment and Vocal/Speech with Spleeter model. ## Build Instructions ### Prerequisite - Visual Studio 2019 - Intel MKL library - 2019 Update 5 - JUCE 6.x (but any version should work) Run `git clone https://github.com/james34602/SpleeterRT.git` Open the `.projucer` file. Build. ## Network overview The network accepts 2 channels magnitude spectrogram as input, U-Net is constructed using 6 pairs of encoder/decoder, final dilated convolution layer expand second last feature map into 2 channels for stereo inference. For 4 stem track separation, we need 4 networks to achieve separation, the neural network computes probability mask function as final output. The encoder uses convolutional layer with stride = 2, reduce the need for max pooling, a great improvement for a real-time system. Batch normalization and activation is followed by the output of each convolution layer except the bottleneck of U-Net. The decoder uses transposed convolution with stride = 2 for upsampling, with their input concatenated with each encoder Conv2D pair. Worth notice, batch normalization and activation isn't the output of each encoder layers we are going to concatenate. The decoder side concatenates just the convolution output of the layers of an encoder. ## Real time system design Deep learning inference is all about GEMM, we have to implement im2col() function with stride, padding, dilation that can handle TensorFlow-styled CNN or even Pytorch-styled convolutional layer. We also need col2im() function with stride and padding for transposed convolutional layers. After the construction of the model in C, the test run show promising performance, we process 14 seconds song within 600 ms wall clock, numeric accuracy is about 1e-4 MSE of TensorFlow model, indicate the architecture is correct. I don't plan to use libtensorflow, I'll explain why. Deep learning functions in existing code: im2col(), col2im(), gemm(), conv_out_dim(), transpconv_out_dim() We have to initialize a buck of memory and spawn some threads before processing begins, we allow developers to adjust the number of frequency bins and time frames for the neural network to inference, the __official__ Spleeter set FFTLength = 4096, Flim = 1024 and T = 512 for default CNN input, then the neural network will predict mask up to 11kHz and take about 10 secs. Which mean real-world latency of default setting using __official__ model will cost you 11 secs + overlap-add sample latency, no matter how fast your CPU gets, the sample latency is intrinsical. I decide to reduce the time-frequency frames collection to 1 / 4, that means we modify T to 128 at the cost of a slightly inaccurate result. However, all this reduce sample latency but doesn't solve the fact that our system is lagging, because each deep learning function call cost 600 ms, it is like we stopping the audio pipeline for 600 ms for the CNN. Ok, then we have to go for the double-buffered design. The 2D image buffering mechanism is **double-buffered**, that means we collect 128 frames, output last 128-256 frame, compute 1-128 frame in the background threads, we use samples latency to trade computation workload, result in 6 seconds latency, still lower than __official__ default setting. The program spawns 5 threads, we got 1 thread for FFT, T-F masking, IFFT, overlapping, while 4 other threads are actively doing deep learning task in the background. We got 4 sources to demix, we run 4 CNN in parallel, each convolutional layer gemm() is sequential. ## Demo and screenshot
Mixture [0 - 15000 Hz]
Vocal [0 - 15000 Hz]
Accompaniment [0 - 15000 Hz]
Drum [0 - 15000 Hz]
Bass (guitar) [0 - 1200 Hz]
VST plugin in action