A Personal Case for the EU

Prior to the UK referendum vote to leave the EU on June 23rd, my personally espoused view on the EU was rather academic. My internal argument was that the EU had its problems but it was on the whole a positive force. Free trade is good, for which we need free movement of people and capital, and all the positives and negatives can be weighed and I judged a net positive outcome for the EU. That said, I could understand others would take a different view and maybe a more global outlook would be positive along with access to the European free market. Ultimately though, I thought the stability of remaining in trumped just about everything else.

Waking up to the results on June 24th, it was clear to me I was deeply shocked, saddened and upset. This is not a normal response to a purely academic outcome – I was clearly much more emotionally invested in the concept of the EU than I realised.

To me, I discovered, the EU is about so much more than its trade agreements and open borders. It is about a shared political identity. The vision of a Europe free from conflict, in which every country works towards goals that matter for the next century and beyond. A vision in which petty tribalism is put aside in favour of our shared humanity. A vision in which divisions across continents are as relevant to the political discussion as divisions across countries.

It was put to me yesterday that as a northerner, I should be more concerned with the North/South divide in the UK than EU membership. Putting aside the obvious irrelevance of the North/South UK divide to the EU issue (though in no way diminishing the importance of the problem), it is precisely this thinking that I detest so much. The problems of inequality and disharmony are magnified enormously on the global scale compared to the UK. It is on the global scale we should seek to address them.

The EU is an imperfect entity (though contrary to popular opinion it has changed over time, largely for the better) that represents and embodies this aspiration. To a large extent, we now have a continent of equal opportunity. For the half a billion EU citizens, there is nothing to stop them seeking the best opportunities across a continent, and they can be confident that fundamental principles of political stability, anti-corruption and environmental standards will be fit for purpose everywhere (and yes, I’m well aware in places this aspiration falls a little short, but it’s only a matter of time before the new member states reach parity).

This makes us all richer. Not just richer in the financial sense, but richer culturally as well as providing us with a richer shared humanity.

I look forward to the day that Turkey (and Russia and …) will join the EU, because then we will have another state that is able to claim to share the ideals of the EU, of peace, rule of law and outward looking positivity and we will have another country that can join the level playing field. Moreover, I look forward to a World in which the state can be considered as largely an administrative region and every person enjoys the same rights and freedoms and opportunities as I do in Western Europe. Nationalistic tribalism needs to be a thing of the past. Sure we can celebrate our cultural distinctions, but they should never be used to divide.

I do not identify as a European because I was born in Europe. I identify as a European because I share a common cause with hundreds of millions of fellow Europeans across a continent in a way that has made great things possible.

Posted in Life, politics | 1 Comment

Running Vivado under Ubuntu 15.10 (Wily Werewolf)

It seems that my fresh shiny installation of Ubuntu Wily isn’t liked much by the recent versions of Vivado. After a little too long, I managed to get Vivado working just fine, though it wasn’t trivial.

Basically the problem is down to the fact that 14.04 is the most recent supported version of Vivado. The solution then is to run Vivado in a 14.04 chroot, which, with schroot, can be pretty seamless.

The following shows how I got it working, as much for my reference as anything else. See man pages, as well as the Ubuntu and Debian docs on chroot and schroot for more info on the various aspects of the problem. You’ll obviously need schroot installed as a prerequisite.

We use Vivado 2015.03 here, but anything else should be much the same with suitably twiddles to paths and so on.

Essentially, install a new root file system that can be accessed through a chroot with something like:

sudo debootstrap --variant=buildd --arch amd64 trusty /var/chroot/vivado_chroot http://www.mirrorservice.org/sites/archive.ubuntu.com/ubuntu/

This installs a 64-bit bootstrap installation of Ubuntu 14.04 Trusty Tahr into /var/chroot/vivado_chroot. You can change the installation directory, the arch or the mirror as you require or desire.

Add something like the following to the end of /etc/schroot/schroot.conf:

description=Vivado inside Ubuntu Trusty
# Preserve the environment, for e.g. X

Now append the following to your fstab so all the relevant system directories are visible (since the purpose is not specifically to sandbox Vivado, there is no problem with exposing everything through the chroot):

# Chroot setup
/proc /var/chroot/vivado_chroot/proc none rbind 0 0
/dev /var/chroot/vivado_chroot/dev none rbind 0 0
/sys /var/chroot/vivado_chroot/sys none rbind 0 0
/tmp /var/chroot/vivado_chroot/tmp none rbind 0 0
/home /var/chroot/vivado_chroot/home none rbind 0 0
/media /var/chroot/vivado_chroot/media none rbind 0 0

At this point I installed lsb. I’m not sure this is necessary for running Vivado itself, but some of the support tools may fail quietly if it’s not present (specifically, this may be a problem for the license acquisition step). This can be done from the chroot, accessing it as root:

sudo schroot -c vivado-trusty
apt-get install lsb

Now you should have the chroot set up properly and can enter it with schroot -c vivado-trusty.

Now at this point, I simply used a previously installed version of Vivado, though it would also be a good time to install Vivado from scratch if you haven’t already.

If you use a previously installed version, it needs to be visible to the chroot. In my case it was in home due to the way I’d set up my partitions, root being small by comparison, so was visible inside the chroot with the mounts that I’d set up.

It was necessary for me to run install_fnp.sh (inside Xilinx/Vivado/2015.3/bin/unwrapped/lnx64.o) which sets up the “trusted” server (or whatever it’s called).

One important thing that was problematic for the generation of the license was that the ethernet device has a different name in 15.10 (due to some persistent naming changes). The Vivado license manager expects the older style naming of “eth0”, “eth1” etc to look up the mac address. Without changing the ethernet device name, every attempt to get a new license presented my with a greyed out box.

I reverted this by adding net.ifnames=0 biosdevname=0 to the GRUB_CMDLINE_LINUX_DEFAULT line in /etc/default/grub, so the line now looks something like:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash net.ifnames=0 biosdevname=0"

and running sudo update-grub. At this point, the license manager was able to generate an html page that allowed me to generate a new license.

This all got Vivado up and running. For a bit of polish I created two additional files to make running Vivado simpler. In ~/bin I created


    source /opt/Xilinx/Vivado/2015.3/settings64.sh
    vivado $@


schroot -c vivado-trusty -- vivado_in_env $@

chmod both those to be +x and you can now run vivado from outside the chroot and it will all automagically work. You shouldn’t even notice it’s running in a chroot. For some reason WordPress seems to be removing the underscores from vivado_in_env in the above script. Of course they need to be there.

(if you’re interested, the parentheses in vivado_in_env creates a sub-shell that prevents the annoying environment leakage that breaks other commands when source /opt/Xilinx/Vivado/2015.3/settings64.sh is run.)

Posted in Uncategorized | 5 Comments

From tests to specifications (and everything in between)

I came to a really neat realisation a little time ago that when using test-drive development (TDD), the testing of your software defines the specification of your software. This may be obvious but it really highlights the fundamental difference in approach between the agile approach to software and other more specification first approaches (like the waterfall model) – with agile methodologies, the spec of the written software is only defined by the tests. Done properly, there is no disconnect between the specification and the implementation, only between the backlog and the specification.

I’ve recently being looking at behaviour driver development (BDD) and how that fits with TDD and general good practice in development. There’s loads of bluster around BDD and no end of frameworks and new languages that define special new ways of generating your integration tests such that normal people can read the tests and yet it still magically actually performs the test. I was always uncomfortable though with the notion that BDD and TDD were actually all that different when it comes down to it; I couldn’t see what was offered by the various tools.

I came across this lovely blog post, which very much clarified my own thoughts (and confirmed a few too). The central point of the post, which I’ll restate here, is BDD is all about communication. Given that, BDD is just TDD with proper communication.

There are TDD purists who will argue that if you’re doing more than unit testing (e.g. you’re doing integration testing too), you’re not doing TDD. I would argue tosh, the whole of your code should benefit from a genuine test driven approach. At the highest level, the test should be derived from the user stories, and as subsequent requirements are introduced, they are manifested as tests that should be satisfied, which then may well depend on further capabilities which are described by more tests. At some point (depending on the size of the project), you reach the unit test stage. As already described, this hierarchy is then your specification.

So how does one fit BDD into all this? Well, it simply then becomes about the process by which tests are defined. A test is always the satisfying of a need, from the highest level which the end user probably cares about (which maps to the user stories), down to the lowest level which the developer cares about (e.g. there should be a frobinator method on this class which frobinates the two inputs). But here’s the point, all the language used to describe those tests are in terms of what behaviour is required, which is exactly what BDD is about!

In python, my implementation strategy for this is simply to use unittest and then describe every test set and individual test in terms of what the software should do. So something like “There should be a widget which calculates how many tins of custard I need to buy” or “This class should have a dot product method that returns the dot product of itself with an argument.”. This makes it crystal clear what the intended behaviour is and just thinking along these documentation lines clarify the purpose of the test in a way that I never had with simply writing tests to satisfy some hypothetical functionality.

In effect, the docstring becomes a first class component of the tests. Without the descriptive behaviour driven docstring, the test is invalid.

Now, I think this is great for various reasons, but two are worth point out: firstly it totally removes the disconnect between BDD and TDD and secondly, it acts as a really neat guide for architecting the software. I found myself neatly guided through the design process by describing my software using tests. I think it’s because the process of defining tests forces one to break the expected functionality into chunks. If you find later that your architecture can be improved, then great, isn’t refactoring one of the central tenets of TDD? (which neatly maps to rearchitecting in the general sense).

This all leads to an interesting result, which is particularly relevant in the context of medical software, which is in the world of IEC 62304, which fits neatly into the waterfall model, but not necessarily very well with agile techniques. When one has the testing system described as above, one already has a plain English representation of the specification – and better than that, all the tests that define it and by extension an implementation of the software that satisfies the spec perfectly! It’s a small step to write a sphinx plugin to auto-generates the spec on demand.

There’s an interesting series of blog posts (beginning here) I was reading in which agile methodologies for medical software is discussed. One solution (described as “Agile methods, solution 2”) is supported very nicely by what I describe above.

Anyway, back to writing some tests…

Posted in Engineering, Programming | Leave a comment

Keeping accounts in the cloud

And so it has come to pass that Gnucash no longer suffices as my accounting software. It saddens me as I think the project is great and for many applications it would still be a great solution. There is not one overwhelming flaw with it, but a series of small deficiencies that occur simply because it’s a project run by volunteers with limited time.

The experience of using Gnucash has been great – I know far more about bookkeeping as a consequence of having to think hard about how to do it properly in Gnucash than I expect I would have if I’d jumped straight to a more modern solution. I’ve had a few problems that sunk time, including the handling of VAT just being more effort than it should be, but the final problem that brought me to a decision was the difficulty in exporting a suitable set of reports for the accountant at the end of the year. It eventually worked, but not without much effort and various frustrating bugs in layout. I dare say I could spend a while learning Scheme and writing my own reports, but the time has come to throw money at the problem.

For me, a cloud solution just seems to be the right way to do this. Web based solutions generally provide all the benefits of low up-front cost, device independent access and continuous upgrades along with perfect cross-platform support, which as a Linux user I truly value.

A cursory investigation flagged up three possible options for me as a fledgling micro business (albeit with a few complicated transactions to handle): Freeagent, Xero and Kashflow. Freeagent was the first I looked into – my (naive as it turned out) assumption was that being UK based, Freeagent would better handle the UK tax situation, though given the apparent pervasiveness of Xero, that was the one I tried first.

It was a little frustrating getting Xero set up – primarily because I had to fully grasp the notion of all the start up balances (still learning!), but I got there in a reasonable couple of hours. The online help was really excellent here. Another hour later and I had the historical data since the last year end (actually slightly earlier to begin on a new VAT quarter). What really impressed me was the single click to get the VAT return that I’d spent 3 hours fiddling with in Gnucash earlier (and, pleasingly, it was exactly what I’d written in my return to HMRC!). So much for a non-UK company having difficulty with UK tax.

I’m very impressed with Xero. I initially felt pretty constrained by all the transactions happening through actions (sales, purchases, etc) which was at odds with me being used to the general ledger being the top level concept in Gnucash, which invoices and bills influence. I had to think for a few mins how I wanted to do something in this new way, but my conclusion is that Xero actually presents it better. The only account that one can easily modify are the bank accounts (though of course, that carries an implicit change in some other account because it is a true double entry system behind the scenes), and I wasn’t convinced at first that was flexible enough. I now think it probably is for almost all the situations I’m ever going to need.

The invoices in Xero are (almost) delightful. Once the data is entered, a beautiful invoice is rendered (with almost no work on my part beyond adding a logo and payment details). One click sends it off the intended recipient (as well as me if I so desire). I played around a bit with modifying the contacts (the general term for people you might interact with, i.e. an invoice recipient) and invoice settings and they seem really quite flexible – multiple recipients, alternative email content etc. The one issue I had, which may well be down to me now knowing how to do it, was having to enter every line on the invoice afresh – there was no attempt to autofill based on what I’d written on previous invoices or the line above or anything, or the ability to copy a line. This is something Gnucash does better :).

Overall Xero will fit well with my needs and the interface is simple, clean and intuitive (to me with a reasonably understanding of double entry bookkeeping).

I thought I ought to compare with an alternative, so I signed up to the free trial of Freeagent. Kashflow I’m sure is great – I was put off by something, which I’ve forgotten. Also, Kashflow seeming to not do much in the way of bank feeds (Yodlee does sound like a dodgy solution, and that seriously puts me off Freeagent. Xero has a partner agreement with HSBC, so they seem to be able to do bank feeds “properly”, though Yodlee is an option). If I get a chance to play with Kashflow I’ll report back.

The Freeagent sign up process was very quick and simple, but left me wondering how I insert the opening balances or change the chart of accounts. After a bit of searching I found the relevant help page, but I do think that should have been a somewhat guided process at sign-up – the system wasn’t useful to me until that essential information had been entered. It is also possible to change at least some of the accounts – like income and expenditure accounts. I could not find out how to change the asset accounts though. This is substantially less flexible than Xero and seems oddly restrictive.

On the whole Freeagent seems slightly less slick that Xero. For example, it’s very prescriptive in how I need to enter a date. In Xero, I just type 3/4 and it gets interpreted on the fly as 3rd April 2014 (so I could edit it if necessary). In Freeagent, I get a complaint that the date was mis-entered. This is superficially a minor issue, but something that is just unnecessary and IMO shows a lack of attention to detail. It was a more general perception as well – the interface in Xero is a lovely balance of presenting and inferring the right information at the right time; once i’d worked out the general layout, everything seemed to be where I expected it to be. In Freeagent, things just don’t seem as intuitive. That’s probably not a wholly fair assessment, but it was certainly a feeling I had. Despite being very similar in the way they operate, I still feel pretty lost in Freeagent. I’ve no doubt that will fade with time, but the feeling went very fast with Xero.

On the whole, I actually think both Xero and Freeagent would be great for my purposes. The extra flexibility of Xero might help in the long run, but many people wouldn’t need that. Despite that, my opinion so far is I’ll be going with Xero. That’s not to say that Freeagent isn’t very good, I just think Xero is fantastic.

Posted in Uncategorized | Tagged , | Leave a comment

pyFFTW coming together

It has been sometime since I posted about pyFFTW, the last being well over a year ago when I introduced the wisdom functionality. Despite that, the development has been plodding along steadily (and quiet releases made!) and with the 0.9.1 release it has now reached the stage where it satisfies much of my wishlist.

Please go and check out the latest docs for all the features (the tutorial gives a pretty good overview). The main big improvements since the last post are the pretty neat interfaces module, which offers a drop in replacement for both numpy.fft and scipy.fftpack. Though not all of the functionality of those modules is provided by code within pyFFTW, the namespace is complete through importing parts of numpy.fft or scipy.fftpack as appropriate. There is some slowdown with using pyfftw.interfaces, but this can be largely alleviated with the cache functionality.

Both the above interfaces depend build on the simplified interface to construct FFT objects, pyfftw.builders, which is itself of potential interest to end users who want to squeeze out every drop of performance without getting bogged down in the details of creating pyfftw.FFTW objects directly.

In addition, some nice helper functionality has been added to pyfftw.FFTW objects. Mainly, the objects can now be called to yield the suitable output, updating the arrays as desired, with normalisation optionally applied to inverse FFTs. This makes creating and calling a pyfftw.FFTW object as simple as:

>>> import pyfftw
>>> a = pyfftw.n_byte_align_empty(4, 32, dtype='complex128')
>>> fft = pyfftw.builders.fft(a)
>>> a[:] = [1, 2, 3, 4]
>>> fft() # returns the output
array([ 10.+0.j,  -2.+2.j,  -2.+0.j,  -2.-2.j])
>>> fft([5, 6, 7, 8])
array([ 26.+0.j,  -2.+2.j,  -2.+0.j,  -2.-2.j])

Though in the above example, a is set to be aligned to 32 bytes, the builder will by default handle that automatically and the FFTW object will consequently align all the array updates to make sure the alignment is maintained (the optimum alignment is inspected from the CPU). In the above example, the update to the array is not explicitly aligned and is of the incorrect dtype for the FFTW object. The correction is automatically handled by the call.

That is in a nutshell most of important improvements, though there are loads of little usability enhancements and bug fixes. Please, everyone, go and have a play with it and report back bugs to the github page.

Posted in Programming | 8 Comments

Squishing Matlab mex files into Octave

Common wisdom says that the mex files that Matlab builds are good for Matlab, and Matlab only. Not having trivial access to an installation of Matlab and needing to access the very neat and useful Field II package, this was not wholly satisfactory for me.

Field II is a package for simulating an ultrasound field. It doesn’t matter what it does really, except that it only comes with precompiled mex files (no source :(). These mex files (in my case, the 64-bit version with extension .mexa64) are really just poorly disguised elf shared libraries so all the usual library inspection tools work – ldd, nm, objdump, as well as all manner of nefarious library hacking.

It turns out that it actually wasn’t that big a deal to get the library to work. Octave handily provides a matlab API, so implicitly has a chunk of code that does the right stuff, the only problem then is one of ABI compatibility. Now, a quick glance through the Mex documentation seems to suggest that only standard types, or pointers to exotic types are passed around on the stack, and also dictates that the Matlab internal objects are opaque, so the ABI for the Octave library is probably going to be compatible (another learning point for me, be nice to all potential users and keep function parameters clean of exotic types that aren’t references – thank you Mathworks!).

Before we even have a hope of linking the library though, we must first find out what’s missing. A peer into the distributed mex file tells us what we need to worry about:

$ ldd Mat_field.mexa64

gives us, among other things, the missing libraries (Mat_field.mexa64 is the mex file of interest):

libmx.so => not found
libmex.so => not found
libmat.so => not found

We can easily create a set of dummy libraries to keep the linker happy with these:

gcc -shared -lc -o libmx.so
gcc -shared -lc -o libmex.so
gcc -shared -lc -o libmat.so

Now we only need to tie it all together with the octave libs whence the missing symbols will come. You can see what octave is doing when it builds an oct file by passing the -v (verbose) flag to mkoctfile. Since we don’t actually need to compile everything, we just want to relink, so we hope the following will work (fortunately, none of the distributed mex files had the .mex extension, so I could claim it myself, otherwise there might well be a name collision that needs working around)…

# Create usual .so file name
ln -s Mat_field.mexa64 libMatfield.so

ld -shared -Bsymbolic -Bsymbolic-functions -o Mat_field.mex -L/usr/lib/x86_64-linux-gnu/octave/3.6.2 -L. -loctinterp -lMatfield -rpath '$ORIGIN'

Unfortunately, the distributed mex file uses a couple of symbols to functions that are not documented by Mathworks. These functions seemed to be version specific implementations of documented functions, so I just reimplemented them as wrappers of the documented functions (and by extension, those that are implemented in Octave). In my case, the missing functions were mxCreateDoublforce the old behavioureMatrix_700 and mxGetString_700 (octave tells you the missing symbols when you try to use the library). The following code did the trick:

#include "mex.h"

int mxGetString_700(const mxArray *pm, char *str, mwSize strlen)
    return mxGetString(pm, str, strlen);

mxArray *mxCreateDoubleMatrix_700(mwSize m, mwSize n, mxComplexity ComplexFlag)
    return mxCreateDoubleMatrix(m, n, ComplexFlag);

Saving that file as undocumented_funcs.c, this can be compiled using the mkoctfile helper program, and then linked in as before.

mkoctfile -c undocumented_funcs.c
ld -shared -Bsymbolic -Bsymbolic-functions -o Mat_field.mex -L/usr/lib/x86_64-linux-gnu/octave/3.6.2 -L. -loctinterp -lMatfield -rpath '$ORIGIN' undocumented_funcs.o

Success! You use it exactly as you would the original mex file. The ‘$ORIGIN’ argument to -rpath means you can twiddle the paths in Octave and it all works fine.

All this was under Linux. I’ve no idea what happens if you try something similar under Windows.

Edit 25/07/2019

It seems relatively recent behaviour change in the way -rpath is handled means that it the above approach will result in a error about being unable to find our little placeholder libraries (libmx.so, libmex.so and libmat.so). This is because -rpath now changes the RUNPATH property of the library rather that RPATH. One result of this (aside from the desirable precedence of LD_LIBRARY_PATH) is to have dependent libraries not inherit the additional search location.

The solution is to either force the old behaviour, or to update the rpath on the original library with the following:

patchelf --set-rpath '$ORIGIN' libMatfield.so

This should now work as expected.

Posted in Engineering, Programming | 23 Comments

Speedy fast 1D convolution with SSE

I’ve done a bit of coding in the past with SSE instructions and achieved quite a significant speedup. I’ve also been playing recently with OpenCL as a means of implementing a fast, cross-device version of the Dual-Tree Complex Wavelet Transform (DT-CWT).

Since I lack a GPU capable of supporting OpenCL, I’m rather limited to targetting the CPU. This is a useful task in itself, as I would like a DT-CWT library that is fast on any platform, making use of the whatever hardware exists.

OpenCL implements a vector datatype, which the CPU compilers will try to map to SSE instructions. If no vector instructions are available, the compiler should just unroll the vector and perform scalar operations on each element (as would likely happen on most GPUs).

At the core of the DT-CWT is a convolution. The faster the convolution can be done, the faster the DT-CWT is done. This means a good starting point for a DT-CWT implementation is producing an efficient implementation of a convolution.

Given the difficulty of writing and debugging OpenCL, and the need to initially target a CPU, it made sense to me to create, in the first instance, a pure C version of an efficient convolution.

All the code is included on a github repository, with the meat in convolve.c.

As a starting point, we can write a basic naive convolution:

int convolve_naive(float* in, float* out, int length,
        float* kernel, int kernel_length)
    for(int i=0; i<=length-kernel_length; i++){

        out[i] = 0.0;
        for(int k=0; k<kernel_length; k++){
            out[i] += in[i+k] * kernel[kernel_length - k - 1];

    return 0;

This is compiled with the following options:
gcc -std=c99 -Wall -O3 -msse3 -mtune=core2

Running this with a 1024 sample input vector and a 16 sample kernel, it takes about 24.5μs per loop (on my rather old Core2 Duo).

The first modification is to reimplement this with SSE instructions. We can do this by computing 4 output values in parallel. If the kernel is of length N, then we create N 4-vectors. We then copy each element of kernel into each element of the corresponding 4-vector. We can then compute 4 output values in parallel. For this, we require that the input vector is a multiple of 4.

Note that we need to compute the last value in the output separately as the output is not a multiple of 4. We could also extend this principle to inputs that are not a multiple of 4 (but I don’t here!).

I use the SSE intrinsics (specifically, SSE3), included with
#include <pmmintrin.h>
#include <xmmintrin.h>

The __m128 type is used to denote an SSE 4-vector that can be used with the intrinsics.

int convolve_sse_simple(float* in, float* out, int length,
        float* kernel, int kernel_length)
    float kernel_block[4] __attribute__ ((aligned (16)));

    __m128 kernel_reverse[kernel_length] __attribute__ ((aligned (16)));
    __m128 data_block __attribute__ ((aligned (16)));

    __m128 prod __attribute__ ((aligned (16)));
    __m128 acc __attribute__ ((aligned (16)));

    // Reverse the kernel and repeat each value across a 4-vector
    for(int i=0; i<kernel_length; i++){
        kernel_block[0] = kernel[kernel_length - i - 1];
        kernel_block[1] = kernel[kernel_length - i - 1];
        kernel_block[2] = kernel[kernel_length - i - 1];
        kernel_block[3] = kernel[kernel_length - i - 1];

        kernel_reverse[i] = _mm_load_ps(kernel_block);

    for(int i=0; i<length-kernel_length; i+=4){

        // Zero the accumulator
        acc = _mm_setzero_ps();

        /* After this loop, we have computed 4 output samples
         * for the price of one.
         * */
        for(int k=0; k<kernel_length; k++){

            // Load 4-float data block. These needs to be an unaliged
            // load (_mm_loadu_ps) as we step one sample at a time.
            data_block = _mm_loadu_ps(in + i + k);
            prod = _mm_mul_ps(kernel_reverse[k], data_block);

            // Accumulate the 4 parallel values
            acc = _mm_add_ps(acc, prod);
        _mm_storeu_ps(out+i, acc);


    // Need to do the last value as a special case
    int i = length - kernel_length;
    out[i] = 0.0;
    for(int k=0; k<kernel_length; k++){
        out[i] += in[i+k] * kernel[kernel_length - k - 1];

    return 0;

We can easily verify that the output from this routine is identical to that from the naive implementation.

We’ve immediately achieved a substantial speedup, going from 24.4μs to 10.8μs (2.3x).

If we also impose a constraint on the size of the kernel, making it a multiple of 4, we can specify this with a literal, allowing the compiler to partially unroll the inner-most loop.

    for(int i=0; i<length-kernel_length; i+=4){

        acc = _mm_setzero_ps();

        for(int k=0; k<kernel_length; k+=4){

            int data_offset = i + k;

            for (int l = 0; l < 4; l++){

                data_block = _mm_loadu_ps(in + data_offset + l);
                prod = _mm_mul_ps(kernel_reverse[k+l], data_block);

                acc = _mm_add_ps(acc, prod);
        _mm_storeu_ps(out+i, acc);


This tweaks the time a bit lower, down to about 9.9μs per loop.

We’re still not done though! Each load from the dataset is an unaligned load (_mm_loadu_ps). Aligned loads using SSE are significantly faster than unaligned loads. To make the load aligned, the address of the data to be copied in must be a multiple of 16 (with AVX, Intel’s latest vector instruction set incarnation, I believe this has been increased to 32).

Unaligned loads are a necessity in the code so far as we step over every sample, meaning we never know what the byte-alignment of any given block of 4 floats will be. By somehow persuading every load to be aligned, we might be able to get more speed.

We can do this by making 4 copies (well, almost whole copies) of the input array, each with a differing alignment, such that every possible set of 4 floats can be found on the crucial 16-byte alignment.

If we start with the original data, with the numbers showing the sample numbers:
original: [0, 1, 2, 3, 4, 5, …]

We can create 4 copies as follows:
copy 1: [0, 1, 2, 3, 4, 5, …]
copy 2: [1, 2, 3, 4, 5, 6, …]
copy 3: [2, 3, 4, 5, 6, 7, …]
copy 4: [3, 4, 5, 6, 7, 8, …]

Since a float is 4 bytes, this means every element in the array now lies on a 16-byte boundary.

It’s now a simple case of referencing the correct copy of the input array and switching to an aligned load (_mm_load_ps).

int convolve_sse_in_aligned(float* in, float* out, int length,
        float* kernel, int kernel_length)
    float kernel_block[4] __attribute__ ((aligned (16)));
    float in_aligned[4][length] __attribute__ ((aligned (16)));

    __m128 kernel_reverse[kernel_length] __attribute__ ((aligned (16)));
    __m128 data_block __attribute__ ((aligned (16)));

    __m128 prod __attribute__ ((aligned (16)));
    __m128 acc __attribute__ ((aligned (16)));

    // Repeat the kernel across the vector
    for(int i=0; i<kernel_length; i++){
        kernel_block[0] = kernel[kernel_length - i - 1];
        kernel_block[1] = kernel[kernel_length - i - 1];
        kernel_block[2] = kernel[kernel_length - i - 1];
        kernel_block[3] = kernel[kernel_length - i - 1];

        kernel_reverse[i] = _mm_load_ps(kernel_block);

    /* Create a set of 4 aligned arrays
     * Each array is offset by one sample from the one before
    for(int i=0; i<4; i++){
        memcpy(in_aligned[i], (in+i), (length-i)*sizeof(float));

    for(int i=0; i<length-kernel_length; i+=4){

        acc = _mm_setzero_ps();

        for(int k=0; k<kernel_length; k+=4){

            int data_offset = i + k;

            for (int l = 0; l < 4; l++){

                data_block = _mm_load_ps(in_aligned[l] + data_offset);
                prod = _mm_mul_ps(kernel_reverse[k+l], data_block);

                acc = _mm_add_ps(acc, prod);
        _mm_storeu_ps(out+i, acc);


    // Need to do the last value as a special case
    int i = length - kernel_length;
    out[i] = 0.0;
    for(int k=0; k<kernel_length; k++){
        out[i] += in_aligned[0][i+k] * kernel[kernel_length - k - 1];

    return 0;

A quick test demonstrates that the speed up from the aligned load overwhelms any increase in time due to extra copies being made.

The time with the same input and kernel is now about 5.4μs.

One final tweak is to fix the length of the kernel loop, replacing kernel_length with a constant. This allows the compiler to unroll the whole of the kernel loop and yields a time of about 4.0μs (code here).

Inspecting the yielded assembly of the inner loop suggests that we’ve pretty much reached the limit of what is possible, with most of the time being spent doing an aligned load, followed by a 4-vector multiplication, followed by a 4-vector add, the essence of a convolution.

Running on larger arrays suggests that speedup scales well across larger arrays (I tested up to 32,768). In the repository is a python script that you can play with to convince yourself.

Posted in Engineering, Programming | 3 Comments

Science in policy, an engineer’s take

I went to a talk on Thursday evening hosted by CSaP. It was Mark Henderson talking about his new book, The Geek Manifesto. The talk was interesting and Mark had lots of good points to make about science in government policy (specifically, the lack of science in policy). His basic points boiled down to:

  1. Science should be better at promoting itself in government policy.
  2. Policy creation should be more scientific in its process.

At the end of the talk I tried to ask a question picking up on something that was raised during the discussion. Mark made the point that science should just be one of the factors that is invoked during policy making, the others being such things as voter wishes, human rights, ethics etc.

In the long and rambling question, I tried and failed to articulate the point that science should be the overarching framework for policy, and that these other factors are just parameters within that framework. Science should not be subservient to other points, but these other points need to be made to fit within the scientific policy framework. The question was accompanied by lots of shaking heads and a response that reiterated Mark’s original point. This got me thinking about how better explain what I am talking about.

The basic point is that to talk about science as being one of several factors in decision making is rather missing the point about science. It is the only reliable way we have of building knowledge about the world. For this reason, it makes perfect sense that any mechanism that attempts to define the world needs to do some from an exclusively scientific perspective. Any attempts to do otherwise are misguided at best and dishonest at worst. Science makes no claims about the nature of the knowledge it discovers, and it makes no claim about the tractability of the discovery process.

The core of it though, and I think the point where I got lost initially, is that policy making isn’t really the acquisition of knowledge (and hence science) at all; at its heart, it is engineering. What we have is a massively multivariate optimisation problem with some poorly defined cost function. It is this cost function that needs to be debated, and it encapsulates all those issues that were argued need to be considered in parallel to science.

The dimensions of the optimisation problem correspond to policy parameters – laws, taxation, incentives etc. The cost function then reflects how those laws are translated into real world consequences, giving a metric of “goodness”. This means that ethics, maintenance of democracy, not locking everyone up etc are necessarily encapsulated in the cost function. I would argue this cost function is something akin to a utilitarian type total happiness metric.

Of course, such a cost function is something that is inordinately difficult to both define and measure in the most general sense, but I expect the problem can be considered as a whole set of subdomains with reasonable separation – e.g. the economy, social welfare, foreign policy etc

Science then does have something to say about measuring the cost function. A whole discipline will arise around honestly quantifying the impact of a given policy change. Done properly, slowly but surely a body of knowledge will develop around how certain outputs can be achieved through policy change, as well as a body of skills and knowledge about how to measure and trial policy changes.

Any attempt to say, trample on human rights, is prevented by a huge negative impact to the cost function.

Of course, the real problem, and the essence I think of the political problem that underpins a lack of scientific scrutiny in policy making, is that there is a strong political will to not ever define the cost function carefully. Never properly defining what you’re trying to achieve is one sure way of stopping people telling you you haven’t achieved it.

This brings me to my final point – wouldn’t it be wonderful if politics became no longer about defining policy, but about defining that cost function. Civil servants could then go away and optimise policy to improve the cost function, drawing in new research and optimisation techniques as they become available.

I discussed this at length with my house mate and he made the point that the only cost function that matters is one’s own personal cost function. Whether politicians and civil servants can escape this trap would dictate whether such a utopia is possible.

Posted in Engineering, Life | 5 Comments

The Wisdom of FFTW

Since the last post on my python wrappers for FFTW, I’ve advanced the code substantially.

It now supports all the basic transforms using the same unified pythonic interface in which the transform is inferred from the dtype. In addition, I’ve added support for importing and exporting wisdom. Wisdom is FFTW’s way of remembering plans that have already been created, thereby speeding up the planning process in future. In particular, the slow planning routines like FFTW_MEASURE will benefit at the first running it the wisdom can be loaded from disk.

The wisdom routines don’t actually write to disk at present, this is because the nice api feature of FFTW that makes this trivial wasn’t added until FFTW 3.3 which is not widely distributed yet. I’ve written the code for this, but commented it out at present. The wisdom is exported as a tuple of strings, which can be pickled and saved as necessary. I suppose the strings could be saved too, but I’ve not tried this. There may be some problems with unicode conversions (which the output from FFTW is not), but I’m happy to be proven wrong on this.

My next goal is to implement the numpy fft interface, making pyfftw a drop in replacement for numpy.fft. The one small problem I’ve encountered so far is that numpy.fft will happily run over repeated axes, which FFTW doesn’t seem to like (at least, using my wrappers). I may just ignore this limitation – who is likely to use it anyway? (serious question!)

As usual, code on github, release on python package index, and docs here.

Posted in Engineering, Programming | 12 Comments

The joys of Cython, Numpy, and a nice FFTW api

This is about my new FFTW python wrapper.

The FFT routines shipped with Numpy are rather slow and have been the performance bottleneck in my code for some time. Last week I decided I needed to move to FFTW for some of the new code I was writing, at least for the prototyping stage  – FFTW is GPL, which limits its use when it comes to distribution (though it is possible to buy a license, and apparently the Intel Math Kernel Library uses the FFTW api, which means the code is more widely useful).

I looked at an existing set of python wrappers, but didn’t really like the interface. The issues I had with it were as follows:

  1. It carried over the requirement of FFTW that a different library is used for each data type, so a different interface was used for complex64, complex128 and complex256.
  2. It cannot handle arbitrary striding of arrays. This rather breaks the wonderful way in which Numpy can handle views into memory, in which sub-arrays can be created which look and work like a normal array, but the dimensions are not contiguous in memory.
  3. There didn’t seem to be a way to choose arbitrary axes over which to take the DFT. Numpy’s fftn handles this with an axes argument (which is just a list of axes).

Anyway, the upshot of my difficulties was that I decided to write my own set of wrappers. It also gave me a good little project for working with Cython, which I needed to know about for some other things.

I had the core of what I needed written in a day, solving the second two of the issues above. This was in no small part down to just how fantastically nice Numpy is, as well as the neat fit is has to the guru interface to FFTW. I can only assume that the other wrapper writers didn’t look in too much detail at that interface. Basically, there is a clear and simple translation to be made between the strides attribute of a Numpy array, and the arguments to the FFTW guru planner. I actually got too confused by the ‘lesser’ advanced interface to do anything useful with it. I think the FFTW people are doing themselves a disservice by calling it the ‘guru’ interface – it sounds hard then!

A bit more work later and I had the unified interface for all the complex Numpy types supported by FFTW (which happens to be all those supported by Numpy on my platform), as well as a pretty comprehensive test suite and documentation. So far, I only have the complex DFT enabled, but the code should be sufficiently flexible to extend easily to the real DFT and other FFTW routines.

The docs are here, the code is here, and the python package index page is here.

One final point… Cython is just wonderful. One can write in python, and C (ish) and python again all in the same file, and then compile it, and end up with a proper python module, all with fantastic distutils support. That is how extensions should be written. Why is anyone still using Matlab?

Edit: In case anyone was wondering, my crude benchmark puts FFTW about 5 times faster than the numpy fft functions.

Another edit: The wrappers are now somewhat more capable, supporting real transforms and multi-threaded mode.

Posted in Engineering, Programming | Tagged , , , , , | 45 Comments