I needed a method for copying data from one Numpy array to another such that the byte alignment is correct. This is in order that I can use SSE instructions properly on the resultant arrays (64-bit machines seem to reliably return arrays with the correct 16 byte alignment, so this problem is largely 32-bit specific). I would have thought that copying an array in Numpy byte by byte would take much the same amount of time regardless of what method is used. The following suggests that this isn’t true.

Let’s say I have an array of 8-byte doubles of size 584x 256 (an apparently arbitrary size that happens to be the size of the datasets I’m working with). I want to copy those arrays into a new array with the correct alignment such that the first byte of the new array lies on a 16 byte boundary. This should ideally correct an arbitrary byte offset.

Here are four different methods for copying an array. The first three do as required and copy a per byte array with an arbitrary offset less than 16, the fourth method copies the array without changing the type (which is a reasonable method if the data item size is a factor of 16, and the prior alignment is a multiple of that item size).

The timer objects are defined as follows:

timer_a = Timer(stmt='numpy.frombuffer(a.data, dtype=\'int8\')[offset:offset-16]=numpy.frombuffer(b.data,dtype=\'int8\')[:]', setup='import numpy; b = numpy.random.randn(584,256); a = numpy.zeros(149504+2); offset=8') timer_b = Timer(stmt='a.data[offset:offset-16]=b.data[:]', setup='import numpy; b = numpy.random.randn(584,256); a = numpy.zeros(149504+2); offset=8') timer_c = Timer(stmt='a.data[offset:offset-16]=b.data', setup='import numpy; b = numpy.random.randn(584,256); a = numpy.zeros(149504+2); offset=8') timer_d = Timer(stmt='a[1:-1]=b.flatten()[:]', setup='import numpy; b = numpy.random.randn(584,256); a = numpy.zeros(149504+2); offset=8')

The only difference between timer_b and timer_c is the inclusion of the [:] after b.data within the stmt string. The timers are run with:

timer_a.timeit(number=1000)/1000

timer_a and timer_c time result in pretty much the same (although they didn’t seem to initially when I started writing this post!), with 0.17ms per copy. timer_b with its oh-so-minor change over timer_c is about 0.87ms per copy, which is a pretty substantial difference. This minor change doesn’t seem to have the same impact on timer_a.

timer_d comes out at about 0.4ms per copy.

Changing the offset to be some number that isn’t a multiple of 8 slows down the copies, as might be expected.

Interesting.

This post is probably mostly just for my benefit to remind myself what I did. It doesn’t seem quite as profound now as it did when I started!