The joys of Cython, Numpy, and a nice FFTW api

Posted on February 1, 2012 by Henry Gomersall

This is about my new FFTW python wrapper.

The FFT routines shipped with Numpy are rather slow and have been the performance bottleneck in my code for some time. Last week I decided I needed to move to FFTW for some of the new code I was writing, at least for the prototyping stage – FFTW is GPL, which limits its use when it comes to distribution (though it is possible to buy a license, and apparently the Intel Math Kernel Library uses the FFTW api, which means the code is more widely useful).

I looked at an existing set of python wrappers, but didn’t really like the interface. The issues I had with it were as follows:

It carried over the requirement of FFTW that a different library is used for each data type, so a different interface was used for complex64, complex128 and complex256.
It cannot handle arbitrary striding of arrays. This rather breaks the wonderful way in which Numpy can handle views into memory, in which sub-arrays can be created which look and work like a normal array, but the dimensions are not contiguous in memory.
There didn’t seem to be a way to choose arbitrary axes over which to take the DFT. Numpy’s fftn handles this with an axes argument (which is just a list of axes).

Anyway, the upshot of my difficulties was that I decided to write my own set of wrappers. It also gave me a good little project for working with Cython, which I needed to know about for some other things.

I had the core of what I needed written in a day, solving the second two of the issues above. This was in no small part down to just how fantastically nice Numpy is, as well as the neat fit is has to the guru interface to FFTW. I can only assume that the other wrapper writers didn’t look in too much detail at that interface. Basically, there is a clear and simple translation to be made between the strides attribute of a Numpy array, and the arguments to the FFTW guru planner. I actually got too confused by the ‘lesser’ advanced interface to do anything useful with it. I think the FFTW people are doing themselves a disservice by calling it the ‘guru’ interface – it sounds hard then!

A bit more work later and I had the unified interface for all the complex Numpy types supported by FFTW (which happens to be all those supported by Numpy on my platform), as well as a pretty comprehensive test suite and documentation. So far, I only have the complex DFT enabled, but the code should be sufficiently flexible to extend easily to the real DFT and other FFTW routines.

The docs are here, the code is here, and the python package index page is here.

One final point… Cython is just wonderful. One can write in python, and C (ish) and python again all in the same file, and then compile it, and end up with a proper python module, all with fantastic distutils support. That is how extensions should be written. Why is anyone still using Matlab?

Edit: In case anyone was wondering, my crude benchmark puts FFTW about 5 times faster than the numpy fft functions.

Another edit: The wrappers are now somewhat more capable, supporting real transforms and multi-threaded mode.

About Henry Gomersall

I'm a engineer wanting to make things actually useful. I'm someone that wants to drive technology and ideas to be helpful for everyone. I'm someone that realises the disconnect between technology and utility and I hope to correct that...

View all posts by Henry Gomersall →

This entry was posted in Engineering, Programming and tagged Cython, FFTW, math kernel library, numpy, Python, python wrapper. Bookmark the permalink.

45 Responses to The joys of Cython, Numpy, and a nice FFTW api

AMCDawes says:

March 16, 2012 at 2:50 pm

Excellent timing on this post, I just started to port some of my old c code to python for simulating beam propagation. It is mostly fft’s (and inverse fft’s) and since I used FFTW in the c code, I wanted to use it in python as well.

Reply
- AMCDawes says:
  
  March 18, 2012 at 2:52 pm
  
  Just to follow up, I’ve installed pyFFTW on both macs that I use (an older macpro and a new macbook pro) the install is straightforward once FFTW libraries are in place (these compile easily with gcc).
  
  However, I only see a small speed improvement over numpy. For example: “test_time (__main__.Complex64FFTWTest) … One run: 31.06 ms (versus 32.39 ms for numpy.fft)” is a typical ratio. What architecture were you using to see a ~5x speedup? Do you happen to know which FFTW compile options were used for your libraries? I’ve been playing with the various options and haven’t seen a drastic change from these initial results.
  
  Reply
  - Henry Gomersall says:
    
    March 19, 2012 at 3:57 pm
    
    Hmmm interesting. I acheived the same speed-up with both 64-bit linux and 32-bit windows on a (fairly old) core 2 duo. The linux build was using those that were shipped with Ubuntu.
    
    The output for my machine on the same test is:
    test_time (__main__.Complex64FFTWTest) … One run: 5.78 ms (versus 26.47 ms for numpy.fft)
    
    Did you enable sse and so on? http://www.fftw.org/fftw3_doc/Installation-on-Unix.html#Installation-on-Unix It might be that without SSE one would expect very similar times to numpy.fft – the test array is certainly a power of 2.
    
    Reply
  - Henry Gomersall says:
    
    March 19, 2012 at 4:45 pm
    
    Oh, also, you probably want to enable threading.
    
    Reply
AMCDawes says:

March 19, 2012 at 5:57 pm

Wow, 5ms would be great. I’ve done it with SSE, SSE2 and threaded. So far they only differ by a few ms from the numpy version. This is a dual boot mac with Ubuntu on the other (better?) half, so I’ll try it on the Ubuntu side just to see. I imagine there is another magic compiler flag I’m missing. I see similar times (25-30 ms) on both my macbook pro (core 2 duo) and an 8-core Xeon Mac Pro (2008).

Reply
- Henry Gomersall says:
  
  March 19, 2012 at 6:04 pm
  
  That’s frustrating. Let me know what the Ubuntu result looks like…
  
  Have you tried the fink binaries? http://www.fftw.org/install/mac.html
  
  Reply
- Henry Gomersall says:
  
  March 21, 2012 at 5:17 pm
  
  Have you made any progress on this?
  
  Reply
  - AMCDawes says:
    
    March 22, 2012 at 4:38 am
    
    Yes, I had to install everything on the ubuntu side of my machine (it didn’t have scipy/numpy etc). I’m using the intel compiler just on a whim so it’s taken a bit to sort out the various ways to do that correctly.
    
    My first run of your test suite (just moments ago) shows 6.76 ms (compared to 20.73 ms for numpy)!
    
    That is more like what I expected. This is a bad experiment in one sense because I changed platforms and compilers, but it sounds like your results are comprable using gcc? I guess I’m running my code in linux from now on (not that I needed another reason).
    
    Reply
    - Henry Gomersall says:
      
      March 22, 2012 at 8:39 am
      
      I’d still be really interested to know what happens with a Mac when you use the fink version of fftw. I’d like to know it works or not! If you get a chance, your input would be much appreciated! (I don’t have a mac to play with)
      
      Reply
      - AMCDawes says:
        
        March 22, 2012 at 4:11 pm
        
        I tested macports first since it is easier to install (Fink on Lion is compile-from-source only). More information about macports here: http://www.macports.org/ There are three packages to install: fftw-3, fftw-3-long, and fftw-3-single. This puts all of the relevant libraries in /opt/local/lib on a mac.
        
        The pyFFTW benchmark: One run: 6.15 ms (versus 26.22 ms for numpy.fft)
        
        This is now comparable to what I saw in Ubuntu with locally-compiled FFTW libraries. Macports lists their compile options for FFTW so I may try to compile locally again with those settings to as a sanity check.
        
        Caveat: my earlier results were tested in 32-bit python 2.7.2 and these new results are in 64-bit python 2.7.2 (both from the EPD suite). Perhaps that has a bigger impact than the compiler flags! I should have realized this earlier.
        
        Reply
        
        Henry Gomersall says:
        
        March 22, 2012 at 4:33 pm
        
        32-bit windows shows a substantial difference as well, so I doubt it’s entirely down the the number of bits! Anyway, it’s good to know it’s possible to get it working. 🙂 Thanks!
        
        Reply
Pingback: Fast Python Fourier Transforms on Mac « The Daily Photon
Rich says:

April 7, 2012 at 9:25 pm

Hey, this is great, thanks for writing this! Any plans on implementing importing/export FFTW wisdom?

Reply
- Henry Gomersall says:
  
  April 8, 2012 at 9:19 pm
  
  Yeah, I’ve just pushed to github updates for importing and exporting of wisdom. I can’t push the new packages to pypi at the moment as there seems to be a bug with the ssh upload stuff and I can’t be bothered to work around it at the moment. I suggest building from the github source if you need it immediately.
  
  Reply
  - Henry Gomersall says:
    
    April 8, 2012 at 9:31 pm
    
    Also, I have the code mostly written (but commented out) for doing export to and import from files directly. I haven’t included it as it requires FFTW 3.3 which is not widely distributed yet. If you need to export to a file, I suggest pickling the exported object. I haven’t tried to see if the string can be saved and then loaded with the FFTW file wisdom functions. There may be some subtlety in that the string is not a unicode string, so certain operations on it may cause breakage, but I haven’t experimented.
    
    Reply
  - Rich says:
    
    April 9, 2012 at 5:41 pm
    
    That was really quick. Thanks again!
    
    Reply
Pingback: The Wisdom of FFTW | Engineering the world
4fire says:

May 3, 2012 at 9:05 pm

Hello, I am starting Python for Computer Vision and I think at this time, Python 3.2 is good. I known FFTW is a very important library for CV and Image Processing. Could you show me how to build you pyFFTW with Python 3.2 or release pyFFTW with built with Python 3.2?

Reply
- Henry Gomersall says:
  
  May 3, 2012 at 10:15 pm
  
  Have you tried just building it from source? The basic instructions are provided on the github page.
  
  Reply
4fire says:

May 3, 2012 at 10:33 pm

Yes, I tried and encoutered error:
File “setup.py”, line 81
print “build %s from template %s” %(outfile, tmplfile)
Syntax error: invalid syntax

Reply
- Henry Gomersall says:
  
  May 4, 2012 at 7:06 am
  
  What did you try?
  
  So, to build for windows, do the following:
  
  Download the source, either from pypi, or from github (but you’ll need to cythonise the code first from github).
  
  Download the fftw dlls for your system and the header file from here (they’re in a zip file): http://www.fftw.org/install/windows.html and place them in the pyfftw directory. The files are libfftw3-3.dll, libfftw3l-3.dll, libfftw3f-3.dll and libfftw3.h.
  
  Then run:
  python setup.py bdist_msi
  or if you got the code from github:
  python cython_setup.py bdist_msi
  
  That should place a .msi file in the dist directory, which you should be able to install.
  
  Reply
4fire says:

May 4, 2012 at 8:11 am

Thank you. I did as you said but encoutered an error:
error: Unable to find vcvarsall.bat.
Could you help me? And I wonder if it easy for you, why don’t you release a compiled version of pyFFTW built with Python 3.2?

Reply
- Henry Gomersall says:
  
  May 4, 2012 at 5:32 pm
  
  You’ll need to get your compiler set up. I strongly recommend using mingw in conjunction with msys. Read this (specifically the section on non-MS compilers) and then this (specifically the comments). I’m not sure it’s necessary to force the compiler to mingw if you only have that installed, but YMMV.
  
  Also, would you mind awfully replying to the thread rather than creating a new one. Cheers!
  
  Reply
- Henry Gomersall says:
  
  May 4, 2012 at 5:40 pm
  
  In answer to your second question, it’s not easy for me. Although 3.2 is released, it’s currently more important to track 2.7, which I do and will do. I don’t use 3.2 under Windows at all (and barely at all with 2.7), and so it’s something else that I need to maintain and think about (and I have enough of those!). If you’d like to pay me to maintain a 3.2 version, I will certainly do so.
  
  Reply
  - 4fire says:
    
    May 4, 2012 at 5:49 pm
    
    OK, I am very sorry if I disturbed you that much. I read all the link you show me but the result is the same. I use Windows 32 bit and in my system, there is only MinGW with msys. I run python to build pyFFTW in both command line of Windows and in msys system, but the results are the same. May be it not the right time for Python 3.2 to well come pyFFTW. Thank you any way.
    
    Reply
    - Henry Gomersall says:
      
      May 4, 2012 at 9:26 pm
      
      It’s ok, I don’t mind you asking, just telling you as it is! But seriously, feel free to report back on the build if you get it working.
      
      The problem you are encountering is because distutils is trying to use the MSVC compiler which it can’t find. Google is your friend on this one (I expect you haven’t followed point 4, though you’ll need to change the path so it uses the 3.2 installation of python, and ignore point 5 as it’s irrelevant).
      
      The point is, so far this isn’t a problem with the extension, but with your ability to build it. I’m happy to help, but you need to make some effort yourself. I’ll have a look at python 3.2 next week if I get a chance.
      
      I accept that there is a bit of a learning curve, but it’s all good stuff!
      
      Reply
Jhl says:

August 3, 2012 at 9:14 pm

Could you please make a build of pyFFTW for Python 2.6? And is there anything special I need to do if I attempt to build it myself?

Reply
- Henry Gomersall says:
  
  August 4, 2012 at 6:42 am
  
  Is this for Windows? Once all the build system is set, it’s not difficult to build, but it’s a bit of a faff to set up if you’re not familiar. Under Linux, it’s trivial.
  
  The easiest way to get it working (imo) is to install mingw. Take a look at some of the other links and comments I posted in comments up the page in response to 4fire. Saves me repeating myself 🙂
  
  Reply
- Henry Gomersall says:
  
  February 1, 2013 at 6:02 pm
  
  I’ll try to add a 2.6 version for the next release. I haven’t tested it yet, but I think it will probably work.
  
  Reply
Boyd Blackwell says:

April 17, 2013 at 11:00 pm

Great work! I got 5x speedup too. I was agonising about how to deal with the three precisions (single, double,quad) in wisdom, and I see your wrappers do it all! This is a much nicer and more complete wrapper than the other pyfftw3 , and so far only loses 10us for the convenience.
I have an optimisation problem in relation to spectrograms (e.g. matplotlib.specgram). THe problem here is many small FFTs e.g. 4096 x 512 samples out of a 1M samples array (allowing for overlap). This would be great to do with multicores, but intialization overhead means that 512 sample FFTS show no gain in multicore.
SO the solution is to break the 1M sample array into (say) 4 chunks and dispatch chunks to 4 cores. This brings me to the question – how to share these numpy arrays between processes efficiently –
ipython> time a = mutliprocessing.sharedctypes.RawArray(‘f’,np.zeros(2**20)) takes 0.5sec
which is longer than all the ffts etc. Other Process options want to pickle the args, which is just as slow or worse.

Reply
- Henry Gomersall says:
  
  April 18, 2013 at 9:46 am
  
  I unfortunately don’t have much experience of the multiprocessing package. However, this might actually be a situation in which python threading is a reasonable approach. In a nutshell, with the current code base, if you create a new FFTW object for each thread of execution in which you enable threads in pyfftw (by passing a value to the threads argument greater than 1) you should be able to call each object with different arrays and everything should work as expected (the caveats being: 1. the outputs don’t clobber each other and 2. the transform does not overwrite the input). I say should because I’ve never actually tried it (feel free to write and submit a test suite for this!).
  
  It’s necessary to make the threads argument greater than 1 in order that the GIL be released during execution, which means each Python thread will continue whilst another is executing an FFT. On reflection, this is not ideal as it might be the case that you want FFTW to internally run single threaded, but in multiple Python threads (as you’ve highlighted!). I’ve submitted this as an issue to fix at some point.
  
  I’d be very interested to know how doing it this way compares to doing each FFT serially but with the FFTW object threads argument maxed out (or more).
  
  Again, although FFTW supports concurrent execution of the same plan with different arrays, the pyfftw.FFTW objects keep a track internally of the arrays that are used so at least one FFTW object is required per concurrent thread.
  
  Any questions, please ask. I’ll be fascinated to know how you get on with this (as well as any resultant bug reports).
  
  Reply
- Henry Gomersall says:
  
  April 18, 2013 at 10:15 am
  
  The other point to think about when you get into hardcore optimisations is how you deal with the SSE alignment. Experimentation shows in some instances it is faster to copy the array into aligned memory and do the FFT on that than to construct an unaligned FFTW object. If you use the __call__ interface of the FFTW object that was constructed with an aligned input array (and so uses SSEn where it can), then the input array will automatically be copied in if it’s found to be not aligned. To have more precise control over this, you’d need to use the update_arrays method (which will raise an exception if the new array is not aligned correctly).
  
  I spent quite a bit of time thinking about this in the context of AVX as well (which is also supported by FFTW). Imagine your input array is aligned on either 16 or 32-byte boundaries, but the FFTW object was constructed with 32-byte aligned arrays (and so supports AVX). The best policy is not clear; I can imagine that it might be better to copy and continue to use AVX. Equally, it might be better to enforce only SSE (by constructing with 16, but not 32-byte aligned arrays) and use the same plan with no copies for all arrays. I guess the only solution is to be aware of all this when trying to do this kind of optimisation and do some tests to see.
  
  Reply
  - Boyd Blackwell says:
    
    May 13, 2013 at 11:45 am
    
    Thanks for your advice about the GIL and nogil! I successfully applied it to achieved a 2.5x speed improvement of a number cruncher using threading, and wrapping the key loop in cython with nogil:
    It was a way to ease into cython, and something I needed in clustering a 20M instance data set with 16 numerical attributes (16 dimensional distance calculation).
    Because of the 1GB+ array, it was much easier to multithread than try to pass the data.
    Hope to apply the same idea to multi ffts eventually, although it turns out that the fft part is only a small fraction of time, once I used fftw3.
    
    Reply
    - Henry Gomersall says:
      
      May 13, 2013 at 1:26 pm
      
      Great! 🙂 You know, there is little point in rewriting the FFT stuff if that’s what you’re thinking of. I suspect that the time penalty in reclaiming the GIL in order to execute python code will be small compared to the time spent outside of it (though, as always, the proof is in the testing!).
      
      Reply
Marko Havu says:

May 27, 2013 at 9:49 am

Thanks for making pyFFTW available! It’s a great improvement over NumPy’s fftpack, which is really slow on large datasets. For example, I have a list of four NumPy float32 arrays with lengths 504001, 819001, 2457001, and 1575001. With numpy.fft.rfft the FFT of these takes almost 4 hours. With Mathematica, the same transformation takes less than 8 seconds. Replacing numpy.fft.rfft with pyfftw.interfaces.numpy_fft.rfft gives the transformation in 23 minutes, which is a tenfold improvement in performance. That’s pretty good, but there’s still ways to go to match Mathematica (or MATLAB, which has similar performance and uses FFTW3).

Is there a way to get still better performance out of pyFFTW? I think it might be an alignment issue. Since pyFFTW seems to call is_n_byte _aligned and n_byte_align_empty, I first thought that can’t be the case, but then I noticed that pyfftw.simd_alignment returns 4, even though I compiled FFTW3 with SSE2 support. Any ideas?

Reply
- Henry Gomersall says:
  
  May 27, 2013 at 11:33 am
  
  Hi Marko. It sounds like something funny is going on with numpy.fft.rfft with you; you should not have that kind of slowdown!
  Are you sure that most of that time is not the planning stage? Have you enabled any flags? As a starting point, I would try saving the wisdom, using pyfftw.export_wisdom() to get it and then saving a pickle of the resultant tuple – importing the wisdom again with pyfftw.import_wisdom(). Also repeated calls should be much faster during one session if it’s the planning stage that’s taking a while. Also, make sure you enable the cache if you’re using pyfftw.interfaces
  If you really want raw speed, using pyfftw.FFTW objects directly (which can be created with pyfftw.builders). This gives you much tighter control over the creation of arrays and so on. I intended pyfftw.interfaces for use in legacy situations or where one didn’t want to maintain any state.
  Matlab uses FFTW behind the scenes, so I would expect that pyfftw would be at least as fast as Matlab (I think the Matlab code is akin to pyfftw.interfaces)
  Your point about simd_alignment is interesting. pyfftw.simd_alignment() actually finds the alignment by querying the cpuid register, so has nothing to do with FFTW. If your CPU has SSE and you’re getting a simd_alignment of 4, this is a bug and I’d be much obliged if you could send me some more info (CPU, OS etc).
  
  Reply
  - Marko Havu says:
    
    May 27, 2013 at 12:00 pm
    
    Thank you for the tips. All I did so far was substitute pyfftw.interfaces.numpy_fft.rfft for numpy.fft.rfft. I haven’t tried any flags yet. I’ll try tweaking those and saving the wisdom first, but I don’t mind going the pyfftw.FFTW way either.
    
    I’m running 64-bit Python 2.7.3 and GCC 4.2.1 on a Mac with Mac OS X 10.6.8 and Intel Core 2 Duo CPU. I can email you the cProfile output, if that is of any use.
    
    Reply
    - Henry Gomersall says:
      
      May 27, 2013 at 5:28 pm
      
      I’m not sure what the problem with the cpu_id lookup is. Would you mind running a couple of quick tests on your machine for me? Using https://github.com/hgomersall/pyFFTW/blob/master/pyfftw/cpu.h as a template, create a couple of small test C files. The first taking lines 24 to 67 and creating a bit of output using printf (and of course #include ) depending on the tests (lines 62, 64, 66).
      The second seeing whether the preprocessor has any of the necessary variables defined on line 23 – so just a simple bit of code that prints some output surrounded by that #if statement.
      
      That would be really helpful! I don’t have access to any kind of Mac hardware. Cheers!
      
      Reply
      - Marko Havu says:
        
        May 27, 2013 at 8:35 pm
        
        On i386, __i386__ is defined, and on x86_64, __amd64__ is defined. Your simd_alignment function seems to returns 16 in both cases as it should.
        
        Reply
        
        Henry Gomersall says:
        
        May 27, 2013 at 8:46 pm
        
        hmmm but not when you use it from within pyfftw? (it just returns an integer that it looks up from that function)
        
        Reply
        
        Marko Havu says:
        
        May 27, 2013 at 9:31 pm
        
        That seems to be the case. I tried it also with all the gcc options distutils uses, but I still get 16 from simd_alignment directly and 4 from pyfftw.simd_alignment in Python. Beats me.
        
        Reply
        
        Henry Gomersall says:
        
        May 27, 2013 at 9:46 pm
        
        Well, thanks for pointing this out and your help in trying to find out what’s going on. I’ll try to borrow a Mac to play with.
        
        Reply
        
        Henry Gomersall says:
        
        May 29, 2013 at 4:50 pm
        
        Hi Marko. I had a friend try to replicate the simd_alignment issue with no success (it worked just fine). Without ready access to a Mac that fails I’m somewhat stymied. If you want to do some more investigations, that would be much appreciated, but I don’t think it’s crucial given recent systems seem to work fine.
        
        Reply
  - Marko Havu says:
    
    May 27, 2013 at 1:34 pm
    
    Quick update: after exporting and importing the wisdom, the FFT takes only about 3.2 seconds! Does pyFFTW use the system wisdom file, if it exists?
    
    Reply
    - Henry Gomersall says:
      
      May 27, 2013 at 5:13 pm
      
      No, not at present. Though it would be simple to add that capability, I’m not entirely sure I like the idea (pyfftw is very much a user level tool designed to abstract away the FFTW backend). I have been meaning to add the write to file capability – the code is written but I delayed exposing it because it requires a later version of FFTW than was common at the time. Perhaps now is a reasonable time…
      
      I’m pleased you’ve got an appreciable speedup! 🙂
      
      Reply