I’ve done a bit of coding in the past with SSE instructions and achieved quite a significant speedup. I’ve also been playing recently with OpenCL as a means of implementing a fast, cross-device version of the Dual-Tree Complex Wavelet Transform (DT-CWT).

Since I lack a GPU capable of supporting OpenCL, I’m rather limited to targetting the CPU. This is a useful task in itself, as I would like a DT-CWT library that is fast on any platform, making use of the whatever hardware exists.

OpenCL implements a vector datatype, which the CPU compilers will try to map to SSE instructions. If no vector instructions are available, the compiler should just unroll the vector and perform scalar operations on each element (as would likely happen on most GPUs).

At the core of the DT-CWT is a convolution. The faster the convolution can be done, the faster the DT-CWT is done. This means a good starting point for a DT-CWT implementation is producing an efficient implementation of a convolution.

Given the difficulty of writing and debugging OpenCL, and the need to initially target a CPU, it made sense to me to create, in the first instance, a pure C version of an efficient convolution.

All the code is included on a github repository, with the meat in convolve.c.

As a starting point, we can write a basic naive convolution:

int convolve_naive(float* in, float* out, int length,
float* kernel, int kernel_length)
{
for(int i=0; i<=length-kernel_length; i++){
out[i] = 0.0;
for(int k=0; k<kernel_length; k++){
out[i] += in[i+k] * kernel[kernel_length - k - 1];
}
}
return 0;
}

This is compiled with the following options:

`gcc -std=c99 -Wall -O3 -msse3 -mtune=core2`

Running this with a 1024 sample input vector and a 16 sample kernel, it takes about 24.5μs per loop (on my rather old Core2 Duo).

The first modification is to reimplement this with SSE instructions. We can do this by computing 4 output values in parallel. If the kernel is of length N, then we create N 4-vectors. We then copy each element of kernel into each element of the corresponding 4-vector. We can then compute 4 output values in parallel. For this, we require that the input vector is a multiple of 4.

Note that we need to compute the last value in the output separately as the output is not a multiple of 4. We could also extend this principle to inputs that are not a multiple of 4 (but I don’t here!).

I use the SSE intrinsics (specifically, SSE3), included with

`#include <pmmintrin.h>`

#include <xmmintrin.h>

The `__m128`

type is used to denote an SSE 4-vector that can be used with the intrinsics.

int convolve_sse_simple(float* in, float* out, int length,
float* kernel, int kernel_length)
{
float kernel_block[4] __attribute__ ((aligned (16)));
__m128 kernel_reverse[kernel_length] __attribute__ ((aligned (16)));
__m128 data_block __attribute__ ((aligned (16)));
__m128 prod __attribute__ ((aligned (16)));
__m128 acc __attribute__ ((aligned (16)));
// Reverse the kernel and repeat each value across a 4-vector
for(int i=0; i<kernel_length; i++){
kernel_block[0] = kernel[kernel_length - i - 1];
kernel_block[1] = kernel[kernel_length - i - 1];
kernel_block[2] = kernel[kernel_length - i - 1];
kernel_block[3] = kernel[kernel_length - i - 1];
kernel_reverse[i] = _mm_load_ps(kernel_block);
}
for(int i=0; i<length-kernel_length; i+=4){
// Zero the accumulator
acc = _mm_setzero_ps();
/* After this loop, we have computed 4 output samples
* for the price of one.
* */
for(int k=0; k<kernel_length; k++){
// Load 4-float data block. These needs to be an unaliged
// load (_mm_loadu_ps) as we step one sample at a time.
data_block = _mm_loadu_ps(in + i + k);
prod = _mm_mul_ps(kernel_reverse[k], data_block);
// Accumulate the 4 parallel values
acc = _mm_add_ps(acc, prod);
}
_mm_storeu_ps(out+i, acc);
}
// Need to do the last value as a special case
int i = length - kernel_length;
out[i] = 0.0;
for(int k=0; k<kernel_length; k++){
out[i] += in[i+k] * kernel[kernel_length - k - 1];
}
return 0;
}

We can easily verify that the output from this routine is identical to that from the naive implementation.

We’ve immediately achieved a substantial speedup, going from 24.4μs to 10.8μs (2.3x).

If we also impose a constraint on the size of the kernel, making it a multiple of 4, we can specify this with a literal, allowing the compiler to partially unroll the inner-most loop.

for(int i=0; i<length-kernel_length; i+=4){
acc = _mm_setzero_ps();
for(int k=0; k<kernel_length; k+=4){
int data_offset = i + k;
for (int l = 0; l < 4; l++){
data_block = _mm_loadu_ps(in + data_offset + l);
prod = _mm_mul_ps(kernel_reverse[k+l], data_block);
acc = _mm_add_ps(acc, prod);
}
}
_mm_storeu_ps(out+i, acc);
}

This tweaks the time a bit lower, down to about 9.9μs per loop.

We’re still not done though! Each load from the dataset is an unaligned load (`_mm_loadu_ps`

). Aligned loads using SSE are significantly faster than unaligned loads. To make the load aligned, the address of the data to be copied in must be a multiple of 16 (with AVX, Intel’s latest vector instruction set incarnation, I believe this has been increased to 32).

Unaligned loads are a necessity in the code so far as we step over every sample, meaning we never know what the byte-alignment of any given block of 4 floats will be. By somehow persuading every load to be aligned, we might be able to get more speed.

We can do this by making 4 copies (well, almost whole copies) of the input array, each with a differing alignment, such that every possible set of 4 floats can be found on the crucial 16-byte alignment.

If we start with the original data, with the numbers showing the sample numbers:

original: [0, 1, 2, 3, 4, 5, ...]

We can create 4 copies as follows:

copy 1: [0, 1, 2, 3, 4, 5, ...]

copy 2: [1, 2, 3, 4, 5, 6, ...]

copy 3: [2, 3, 4, 5, 6, 7, ...]

copy 4: [3, 4, 5, 6, 7, 8, ...]

Since a float is 4 bytes, this means every element in the array now lies on a 16-byte boundary.

It’s now a simple case of referencing the correct copy of the input array and switching to an aligned load (`_mm_load_ps`

).

int convolve_sse_in_aligned(float* in, float* out, int length,
float* kernel, int kernel_length)
{
float kernel_block[4] __attribute__ ((aligned (16)));
float in_aligned[4][length] __attribute__ ((aligned (16)));
__m128 kernel_reverse[kernel_length] __attribute__ ((aligned (16)));
__m128 data_block __attribute__ ((aligned (16)));
__m128 prod __attribute__ ((aligned (16)));
__m128 acc __attribute__ ((aligned (16)));
// Repeat the kernel across the vector
for(int i=0; i<kernel_length; i++){
kernel_block[0] = kernel[kernel_length - i - 1];
kernel_block[1] = kernel[kernel_length - i - 1];
kernel_block[2] = kernel[kernel_length - i - 1];
kernel_block[3] = kernel[kernel_length - i - 1];
kernel_reverse[i] = _mm_load_ps(kernel_block);
}
/* Create a set of 4 aligned arrays
* Each array is offset by one sample from the one before
*/
for(int i=0; i<4; i++){
memcpy(in_aligned[i], (in+i), (length-i)*sizeof(float));
}
for(int i=0; i<length-kernel_length; i+=4){
acc = _mm_setzero_ps();
for(int k=0; k<kernel_length; k+=4){
int data_offset = i + k;
for (int l = 0; l < 4; l++){
data_block = _mm_load_ps(in_aligned[l] + data_offset);
prod = _mm_mul_ps(kernel_reverse[k+l], data_block);
acc = _mm_add_ps(acc, prod);
}
}
_mm_storeu_ps(out+i, acc);
}
// Need to do the last value as a special case
int i = length - kernel_length;
out[i] = 0.0;
for(int k=0; k<kernel_length; k++){
out[i] += in_aligned[0][i+k] * kernel[kernel_length - k - 1];
}
return 0;
}

A quick test demonstrates that the speed up from the aligned load overwhelms any increase in time due to extra copies being made.

The time with the same input and kernel is now about 5.4μs.

One final tweak is to fix the length of the kernel loop, replacing `kernel_length`

with a constant. This allows the compiler to unroll the whole of the kernel loop and yields a time of about 4.0μs (code here).

Inspecting the yielded assembly of the inner loop suggests that we’ve pretty much reached the limit of what is possible, with most of the time being spent doing an aligned load, followed by a 4-vector multiplication, followed by a 4-vector add, the essence of a convolution.

Running on larger arrays suggests that speedup scales well across larger arrays (I tested up to 32,768). In the repository is a python script that you can play with to convince yourself.