pyFFTW coming together

It has been sometime since I posted about pyFFTW, the last being well over a year ago when I introduced the wisdom functionality. Despite that, the development has been plodding along steadily (and quiet releases made!) and with the 0.9.1 release it has now reached the stage where it satisfies much of my wishlist.

Please go and check out the latest docs for all the features (the tutorial gives a pretty good overview). The main big improvements since the last post are the pretty neat interfaces module, which offers a drop in replacement for both numpy.fft and scipy.fftpack. Though not all of the functionality of those modules is provided by code within pyFFTW, the namespace is complete through importing parts of numpy.fft or scipy.fftpack as appropriate. There is some slowdown with using pyfftw.interfaces, but this can be largely alleviated with the cache functionality.

Both the above interfaces depend build on the simplified interface to construct FFT objects, pyfftw.builders, which is itself of potential interest to end users who want to squeeze out every drop of performance without getting bogged down in the details of creating pyfftw.FFTW objects directly.

In addition, some nice helper functionality has been added to pyfftw.FFTW objects. Mainly, the objects can now be called to yield the suitable output, updating the arrays as desired, with normalisation optionally applied to inverse FFTs. This makes creating and calling a pyfftw.FFTW object as simple as:

>>> import pyfftw
>>> a = pyfftw.n_byte_align_empty(4, 32, dtype='complex128')
>>> fft = pyfftw.builders.fft(a)
>>> a[:] = [1, 2, 3, 4]
>>> fft() # returns the output
array([ 10.+0.j,  -2.+2.j,  -2.+0.j,  -2.-2.j])
>>> fft([5, 6, 7, 8])
array([ 26.+0.j,  -2.+2.j,  -2.+0.j,  -2.-2.j])

Though in the above example, a is set to be aligned to 32 bytes, the builder will by default handle that automatically and the FFTW object will consequently align all the array updates to make sure the alignment is maintained (the optimum alignment is inspected from the CPU). In the above example, the update to the array is not explicitly aligned and is of the incorrect dtype for the FFTW object. The correction is automatically handled by the call.

That is in a nutshell most of important improvements, though there are loads of little usability enhancements and bug fixes. Please, everyone, go and have a play with it and report back bugs to the github page.

Posted in Programming | 2 Comments

Squishing Matlab mex files into Octave

Common wisdom says that the mex files that Matlab builds are good for Matlab, and Matlab only. Not having trivial access to an installation of Matlab and needing to access the very neat and useful Field II package, this was not wholly satisfactory for me.

Field II is a package for simulating an ultrasound field. It doesn’t matter what it does really, except that it only comes with precompiled mex files (no source :(). These mex files (in my case, the 64-bit version with extension .mexa64) are really just poorly disguised elf shared libraries so all the usual library inspection tools work – ldd, nm, objdump, as well as all manner of nefarious library hacking.

It turns out that it actually wasn’t that big a deal to get the library to work. Octave handily provides a matlab API, so implicitly has a chunk of code that does the right stuff, the only problem then is one of ABI compatibility. Now, a quick glance through the Mex documentation seems to suggest that only standard types, or pointers to exotic types are passed around on the stack, and also dictates that the Matlab internal objects are opaque, so the ABI for the Octave library is probably going to be compatible (another learning point for me, be nice to all potential users and keep function parameters clean of exotic types that aren’t references – thank you Mathworks!).

Before we even have a hope of linking the library though, we must first find out what’s missing. A peer into the distributed mex file tells us what we need to worry about:

$ ldd Mat_field.mexa64

gives us, among other things, the missing libraries (Mat_field.mexa64 is the mex file of interest):

libmx.so => not found
libmex.so => not found
libmat.so => not found

We can easily create a set of dummy libraries to keep the linker happy with these:

gcc -shared -lc -o libmx.so
gcc -shared -lc -o libmex.so
gcc -shared -lc -o libmat.so

Now we only need to tie it all together with the octave libs whence the missing symbols will come. You can see what octave is doing when it builds an oct file by passing the -v (verbose) flag to mkoctfile. Since we don’t actually need to compile everything, we just want to relink, so we hope the following will work (fortunately, none of the distributed mex files had the .mex extension, so I could claim it myself, otherwise there might well be a name collision that needs working around)…


# Create usual .so file name
ln -s Mat_field.mexa64 libMatfield.so

ld -shared -Bsymbolic -Bsymbolic-functions -o Mat_field.mex -L/usr/lib/x86_64-linux-gnu/octave/3.6.2 -L. -loctinterp -lMatfield -rpath `pwd`

Unfortunately, the distributed mex file uses a couple of symbols to functions that are not documented by Mathworks. These functions seemed to be version specific implementations of documented functions, so I just reimplemented them as wrappers of the documented functions (and by extension, those that are implemented in Octave). In my case, the missing functions were mxCreateDoubleMatrix_700 and mxGetString_700 (octave tells you the missing symbols when you try to use the library). The following code did the trick:

#include "mex.h"

int mxGetString_700(const mxArray *pm, char *str, mwSize strlen)
{
    return mxGetString(pm, str, strlen);
}

mxArray *mxCreateDoubleMatrix_700(mwSize m, mwSize n, mxComplexity ComplexFlag)
{
    return mxCreateDoubleMatrix(m, n, ComplexFlag);
}

Saving that file as undocumented_funcs.c, this can be compiled using the mkoctfile helper program, and then linked in as before.

mkoctfile -c undocumented_funcs.c
ld -shared -Bsymbolic -Bsymbolic-functions -o Mat_field.mex -L/usr/lib/x86_64-linux-gnu/octave/3.6.2 -L. -loctinterp -lMatfield -rpath '$ORIGIN' undocumented_funcs.o

Success! You use it exactly as you would the original mex file. The `pwd` argument to -rpath means you can twiddle the paths in Octave and it all works fine.

All this was under Linux. I’ve no idea what happens if you try something similar under Windows.

Posted in Engineering, Programming | Leave a comment

Speedy fast 1D convolution with SSE

I’ve done a bit of coding in the past with SSE instructions and achieved quite a significant speedup. I’ve also been playing recently with OpenCL as a means of implementing a fast, cross-device version of the Dual-Tree Complex Wavelet Transform (DT-CWT).

Since I lack a GPU capable of supporting OpenCL, I’m rather limited to targetting the CPU. This is a useful task in itself, as I would like a DT-CWT library that is fast on any platform, making use of the whatever hardware exists.

OpenCL implements a vector datatype, which the CPU compilers will try to map to SSE instructions. If no vector instructions are available, the compiler should just unroll the vector and perform scalar operations on each element (as would likely happen on most GPUs).

At the core of the DT-CWT is a convolution. The faster the convolution can be done, the faster the DT-CWT is done. This means a good starting point for a DT-CWT implementation is producing an efficient implementation of a convolution.

Given the difficulty of writing and debugging OpenCL, and the need to initially target a CPU, it made sense to me to create, in the first instance, a pure C version of an efficient convolution.

All the code is included on a github repository, with the meat in convolve.c.

As a starting point, we can write a basic naive convolution:

int convolve_naive(float* in, float* out, int length,
        float* kernel, int kernel_length)
{
    for(int i=0; i<=length-kernel_length; i++){

        out[i] = 0.0;
        for(int k=0; k<kernel_length; k++){
            out[i] += in[i+k] * kernel[kernel_length - k - 1];
        }
    }

    return 0;
}

This is compiled with the following options:
gcc -std=c99 -Wall -O3 -msse3 -mtune=core2

Running this with a 1024 sample input vector and a 16 sample kernel, it takes about 24.5μs per loop (on my rather old Core2 Duo).

The first modification is to reimplement this with SSE instructions. We can do this by computing 4 output values in parallel. If the kernel is of length N, then we create N 4-vectors. We then copy each element of kernel into each element of the corresponding 4-vector. We can then compute 4 output values in parallel. For this, we require that the input vector is a multiple of 4.

Note that we need to compute the last value in the output separately as the output is not a multiple of 4. We could also extend this principle to inputs that are not a multiple of 4 (but I don’t here!).

I use the SSE intrinsics (specifically, SSE3), included with
#include <pmmintrin.h>
#include <xmmintrin.h>

The __m128 type is used to denote an SSE 4-vector that can be used with the intrinsics.

int convolve_sse_simple(float* in, float* out, int length,
        float* kernel, int kernel_length)
{
    float kernel_block[4] __attribute__ ((aligned (16)));

    __m128 kernel_reverse[kernel_length] __attribute__ ((aligned (16)));
    __m128 data_block __attribute__ ((aligned (16)));

    __m128 prod __attribute__ ((aligned (16)));
    __m128 acc __attribute__ ((aligned (16)));

    // Reverse the kernel and repeat each value across a 4-vector
    for(int i=0; i<kernel_length; i++){
        kernel_block[0] = kernel[kernel_length - i - 1];
        kernel_block[1] = kernel[kernel_length - i - 1];
        kernel_block[2] = kernel[kernel_length - i - 1];
        kernel_block[3] = kernel[kernel_length - i - 1];

        kernel_reverse[i] = _mm_load_ps(kernel_block);
    }

    for(int i=0; i<length-kernel_length; i+=4){

        // Zero the accumulator
        acc = _mm_setzero_ps();

        /* After this loop, we have computed 4 output samples
         * for the price of one.
         * */
        for(int k=0; k<kernel_length; k++){

            // Load 4-float data block. These needs to be an unaliged
            // load (_mm_loadu_ps) as we step one sample at a time.
            data_block = _mm_loadu_ps(in + i + k);
            prod = _mm_mul_ps(kernel_reverse[k], data_block);

            // Accumulate the 4 parallel values
            acc = _mm_add_ps(acc, prod);
        }
        _mm_storeu_ps(out+i, acc);

    }

    // Need to do the last value as a special case
    int i = length - kernel_length;
    out[i] = 0.0;
    for(int k=0; k<kernel_length; k++){
        out[i] += in[i+k] * kernel[kernel_length - k - 1];
    }

    return 0;
}

We can easily verify that the output from this routine is identical to that from the naive implementation.

We’ve immediately achieved a substantial speedup, going from 24.4μs to 10.8μs (2.3x).

If we also impose a constraint on the size of the kernel, making it a multiple of 4, we can specify this with a literal, allowing the compiler to partially unroll the inner-most loop.

    for(int i=0; i<length-kernel_length; i+=4){

        acc = _mm_setzero_ps();

        for(int k=0; k<kernel_length; k+=4){

            int data_offset = i + k;

            for (int l = 0; l < 4; l++){

                data_block = _mm_loadu_ps(in + data_offset + l);
                prod = _mm_mul_ps(kernel_reverse[k+l], data_block);

                acc = _mm_add_ps(acc, prod);
            }
        }
        _mm_storeu_ps(out+i, acc);

    }

This tweaks the time a bit lower, down to about 9.9μs per loop.

We’re still not done though! Each load from the dataset is an unaligned load (_mm_loadu_ps). Aligned loads using SSE are significantly faster than unaligned loads. To make the load aligned, the address of the data to be copied in must be a multiple of 16 (with AVX, Intel’s latest vector instruction set incarnation, I believe this has been increased to 32).

Unaligned loads are a necessity in the code so far as we step over every sample, meaning we never know what the byte-alignment of any given block of 4 floats will be. By somehow persuading every load to be aligned, we might be able to get more speed.

We can do this by making 4 copies (well, almost whole copies) of the input array, each with a differing alignment, such that every possible set of 4 floats can be found on the crucial 16-byte alignment.

If we start with the original data, with the numbers showing the sample numbers:
original: [0, 1, 2, 3, 4, 5, ...]

We can create 4 copies as follows:
copy 1: [0, 1, 2, 3, 4, 5, ...]
copy 2: [1, 2, 3, 4, 5, 6, ...]
copy 3: [2, 3, 4, 5, 6, 7, ...]
copy 4: [3, 4, 5, 6, 7, 8, ...]

Since a float is 4 bytes, this means every element in the array now lies on a 16-byte boundary.

It’s now a simple case of referencing the correct copy of the input array and switching to an aligned load (_mm_load_ps).

int convolve_sse_in_aligned(float* in, float* out, int length,
        float* kernel, int kernel_length)
{
    float kernel_block[4] __attribute__ ((aligned (16)));
    float in_aligned[4][length] __attribute__ ((aligned (16)));

    __m128 kernel_reverse[kernel_length] __attribute__ ((aligned (16)));
    __m128 data_block __attribute__ ((aligned (16)));

    __m128 prod __attribute__ ((aligned (16)));
    __m128 acc __attribute__ ((aligned (16)));

    // Repeat the kernel across the vector
    for(int i=0; i<kernel_length; i++){
        kernel_block[0] = kernel[kernel_length - i - 1];
        kernel_block[1] = kernel[kernel_length - i - 1];
        kernel_block[2] = kernel[kernel_length - i - 1];
        kernel_block[3] = kernel[kernel_length - i - 1];

        kernel_reverse[i] = _mm_load_ps(kernel_block);
    }

    /* Create a set of 4 aligned arrays
     * Each array is offset by one sample from the one before
     */
    for(int i=0; i<4; i++){
        memcpy(in_aligned[i], (in+i), (length-i)*sizeof(float));
    }

    for(int i=0; i<length-kernel_length; i+=4){

        acc = _mm_setzero_ps();

        for(int k=0; k<kernel_length; k+=4){

            int data_offset = i + k;

            for (int l = 0; l < 4; l++){

                data_block = _mm_load_ps(in_aligned[l] + data_offset);
                prod = _mm_mul_ps(kernel_reverse[k+l], data_block);

                acc = _mm_add_ps(acc, prod);
            }
        }
        _mm_storeu_ps(out+i, acc);

    }

    // Need to do the last value as a special case
    int i = length - kernel_length;
    out[i] = 0.0;
    for(int k=0; k<kernel_length; k++){
        out[i] += in_aligned[0][i+k] * kernel[kernel_length - k - 1];
    }

    return 0;
}

A quick test demonstrates that the speed up from the aligned load overwhelms any increase in time due to extra copies being made.

The time with the same input and kernel is now about 5.4μs.

One final tweak is to fix the length of the kernel loop, replacing kernel_length with a constant. This allows the compiler to unroll the whole of the kernel loop and yields a time of about 4.0μs (code here).

Inspecting the yielded assembly of the inner loop suggests that we’ve pretty much reached the limit of what is possible, with most of the time being spent doing an aligned load, followed by a 4-vector multiplication, followed by a 4-vector add, the essence of a convolution.

Running on larger arrays suggests that speedup scales well across larger arrays (I tested up to 32,768). In the repository is a python script that you can play with to convince yourself.

Posted in Engineering, Programming | 2 Comments

Science in policy, an engineer’s take

I went to a talk on Thursday evening hosted by CSaP. It was Mark Henderson talking about his new book, The Geek Manifesto. The talk was interesting and Mark had lots of good points to make about science in government policy (specifically, the lack of science in policy). His basic points boiled down to:

  1. Science should be better at promoting itself in government policy.
  2. Policy creation should be more scientific in its process.

At the end of the talk I tried to ask a question picking up on something that was raised during the discussion. Mark made the point that science should just be one of the factors that is invoked during policy making, the others being such things as voter wishes, human rights, ethics etc.

In the long and rambling question, I tried and failed to articulate the point that science should be the overarching framework for policy, and that these other factors are just parameters within that framework. Science should not be subservient to other points, but these other points need to be made to fit within the scientific policy framework. The question was accompanied by lots of shaking heads and a response that reiterated Mark’s original point. This got me thinking about how better explain what I am talking about.

The basic point is that to talk about science as being one of several factors in decision making is rather missing the point about science. It is the only reliable way we have of building knowledge about the world. For this reason, it makes perfect sense that any mechanism that attempts to define the world needs to do some from an exclusively scientific perspective. Any attempts to do otherwise are misguided at best and dishonest at worst. Science makes no claims about the nature of the knowledge it discovers, and it makes no claim about the tractability of the discovery process.

The core of it though, and I think the point where I got lost initially, is that policy making isn’t really the acquisition of knowledge (and hence science) at all; at its heart, it is engineering. What we have is a massively multivariate optimisation problem with some poorly defined cost function. It is this cost function that needs to be debated, and it encapsulates all those issues that were argued need to be considered in parallel to science.

The dimensions of the optimisation problem correspond to policy parameters – laws, taxation, incentives etc. The cost function then reflects how those laws are translated into real world consequences, giving a metric of “goodness”. This means that ethics, maintenance of democracy, not locking everyone up etc are necessarily encapsulated in the cost function. I would argue this cost function is something akin to a utilitarian type total happiness metric.

Of course, such a cost function is something that is inordinately difficult to both define and measure in the most general sense, but I expect the problem can be considered as a whole set of subdomains with reasonable separation – e.g. the economy, social welfare, foreign policy etc

Science then does have something to say about measuring the cost function. A whole discipline will arise around honestly quantifying the impact of a given policy change. Done properly, slowly but surely a body of knowledge will develop around how certain outputs can be achieved through policy change, as well as a body of skills and knowledge about how to measure and trial policy changes.

Any attempt to say, trample on human rights, is prevented by a huge negative impact to the cost function.

Of course, the real problem, and the essence I think of the political problem that underpins a lack of scientific scrutiny in policy making, is that there is a strong political will to not ever define the cost function carefully. Never properly defining what you’re trying to achieve is one sure way of stopping people telling you you haven’t achieved it.

This brings me to my final point – wouldn’t it be wonderful if politics became no longer about defining policy, but about defining that cost function. Civil servants could then go away and optimise policy to improve the cost function, drawing in new research and optimisation techniques as they become available.

I discussed this at length with my house mate and he made the point that the only cost function that matters is one’s own personal cost function. Whether politicians and civil servants can escape this trap would dictate whether such a utopia is possible.

Posted in Engineering, Life | 4 Comments

The Wisdom of FFTW

Since the last post on my python wrappers for FFTW, I’ve advanced the code substantially.

It now supports all the basic transforms using the same unified pythonic interface in which the transform is inferred from the dtype. In addition, I’ve added support for importing and exporting wisdom. Wisdom is FFTW’s way of remembering plans that have already been created, thereby speeding up the planning process in future. In particular, the slow planning routines like FFTW_MEASURE will benefit at the first running it the wisdom can be loaded from disk.

The wisdom routines don’t actually write to disk at present, this is because the nice api feature of FFTW that makes this trivial wasn’t added until FFTW 3.3 which is not widely distributed yet. I’ve written the code for this, but commented it out at present. The wisdom is exported as a tuple of strings, which can be pickled and saved as necessary. I suppose the strings could be saved too, but I’ve not tried this. There may be some problems with unicode conversions (which the output from FFTW is not), but I’m happy to be proven wrong on this.

My next goal is to implement the numpy fft interface, making pyfftw a drop in replacement for numpy.fft. The one small problem I’ve encountered so far is that numpy.fft will happily run over repeated axes, which FFTW doesn’t seem to like (at least, using my wrappers). I may just ignore this limitation – who is likely to use it anyway? (serious question!)

As usual, code on github, release on python package index, and docs here.

Posted in Engineering, Programming | 7 Comments

The joys of Cython, Numpy, and a nice FFTW api

This is about my new FFTW python wrapper.

The FFT routines shipped with Numpy are rather slow and have been the performance bottleneck in my code for some time. Last week I decided I needed to move to FFTW for some of the new code I was writing, at least for the prototyping stage  – FFTW is GPL, which limits its use when it comes to distribution (though it is possible to buy a license, and apparently the Intel Math Kernel Library uses the FFTW api, which means the code is more widely useful).

I looked at an existing set of python wrappers, but didn’t really like the interface. The issues I had with it were as follows:

  1. It carried over the requirement of FFTW that a different library is used for each data type, so a different interface was used for complex64, complex128 and complex256.
  2. It cannot handle arbitrary striding of arrays. This rather breaks the wonderful way in which Numpy can handle views into memory, in which sub-arrays can be created which look and work like a normal array, but the dimensions are not contiguous in memory.
  3. There didn’t seem to be a way to choose arbitrary axes over which to take the DFT. Numpy’s fftn handles this with an axes argument (which is just a list of axes).

Anyway, the upshot of my difficulties was that I decided to write my own set of wrappers. It also gave me a good little project for working with Cython, which I needed to know about for some other things.

I had the core of what I needed written in a day, solving the second two of the issues above. This was in no small part down to just how fantastically nice Numpy is, as well as the neat fit is has to the guru interface to FFTW. I can only assume that the other wrapper writers didn’t look in too much detail at that interface. Basically, there is a clear and simple translation to be made between the strides attribute of a Numpy array, and the arguments to the FFTW guru planner. I actually got too confused by the ‘lesser’ advanced interface to do anything useful with it. I think the FFTW people are doing themselves a disservice by calling it the ‘guru’ interface – it sounds hard then!

A bit more work later and I had the unified interface for all the complex Numpy types supported by FFTW (which happens to be all those supported by Numpy on my platform), as well as a pretty comprehensive test suite and documentation. So far, I only have the complex DFT enabled, but the code should be sufficiently flexible to extend easily to the real DFT and other FFTW routines.

The docs are here, the code is here, and the python package index page is here.

One final point… Cython is just wonderful. One can write in python, and C (ish) and python again all in the same file, and then compile it, and end up with a proper python module, all with fantastic distutils support. That is how extensions should be written. Why is anyone still using Matlab?

Edit: In case anyone was wondering, my crude benchmark puts FFTW about 5 times faster than the numpy fft functions.

Another edit: The wrappers are now somewhat more capable, supporting real transforms and multi-threaded mode.

Posted in Engineering, Programming | Tagged , , , , , | 45 Comments

How to peel a beetroot

After you’ve roasted it in foil for an hour, stick a fork in one end, hold it vertically up, and scrape it downwards with a teaspoon.

Posted in Engineering, Food, Life | Leave a comment