Tuesday, June 4, 2013

Faster image filters in Python with Parakeet

In the early days of computer vision — before Big Data, small data, or even very much data at all — it was popular to manually construct vision algorithms out of neighborhood operations. A neighborhood operation is an image transformation which replaces each pixel with some simple function of the surrounding pixels' values. For example, if you replace each pixel with the median of its neighborhood, you get back an image which looks similar but is somewhat less detailed and (usefully) less noisy. Alternatively, instead of the median value, you can replace each pixel with the minimum or maximum of its neighbors. These two operations turned out to be useful enough to warrant getting their own names. Windowed maximum, which smears bright parts of the image and makes them grow outward, is called dilation. Windowed minimum, which makes images look pitted and creepy, is called erosion.

Dog with low self-regard Dilation Erosion

Using only successive applications of dilation and erosion it is possible to express a wide array of interesting image transformations. The composition of these two operators was considered important enough to warrant becoming a field unto itself called mathematical morphology. Morphological image transformations are still widely used for preprocessing, feature extraction, and if you squint you'll even see their likeness in the max-pooling operation used for subsampling by convolutional neural networks.

Python Implementation

Now that I've hopefully convinced you of their usefulness, let's see how to implement morphological image filters in Python. As a first cut, let's just loop over all the positions (i,j) in an image and at each pixel we'll loop over the surrounding k×k square of pixels to pick out the maximum. Technically, we could use a more complicated neighborhood (i.e. a surrounding circle of pixels or really any shape at all), but the humble square will do for this post.

Simple enough, but how quickly does it run? Anyone who has tried writing numerically intensive code in pure Python knows the answer will probably fall somewhere between "not well" and "would have been faster on pen and paper". Running this algorithm on a 1024×768 image with a 7×7 neighborhood on my MacBook Air takes almost half a minute. Pathetic!

Python programmers have become trained to reach for a compiled library whenever they encounter a significant performance bottleneck. In this case, SciPy comes with a full suite of morphological operators that have been implemented efficiently in C. Performing the same dilation using SciPy takes only 34 milliseconds. That's an approximately 750X performance gap between Python and native code.

Runtime Compilation à la Parakeet

So, Python's bytecode interpreter is hopelessly slow, but is there any alternative to relying on lower-level languages for performance? In my ideal world, all of the code I write would sit comfortably inside a .py file. Toward this end, I've been working on Parakeet, a runtime compiler which uses Python functions as a template for dynamically generated native code. By specializing functions for distinct argument types and then performing high-level optimizations on the typed code, Parakeet is able to run code orders of magnitude faster than Python. It's important to note that Parakeet is not capable of speeding up or even executing arbitrary programs — it's really a runtime compiler for an array oriented subset of Python. Luckily, all of the operations used to implement dilation (array indexing, looping, etc..) fit within this subset.

To get Parakeet to accelerate a Python function you have to:

  • Wrap that function in Parakeet's @jit decorator.
  • Once wrapped, any calls to the function get routed through Parakeet's type specializer, creating different versions of the function for distinct input types.
  • Each typed version of a function is then extensively optimized and turned into native code using LLVM.

If I stick an @jit decorator above dilate_naive, I find that it now runs in only 51 milliseconds. That is, almost 500X faster than Python, though still a lot slower than the SciPy implementation. Not bad for code that at least looks like it's written in Python.

Can We Get Faster Than SciPy?

If you poke around the source of SciPy morphology operators, you'll discover that SciPy doesn't actually use the naive algorithm above. SciPy's sagacious authors took note of the fact that windowed minima/maxima are separable filters, meaning they can be computed more efficiently by first performing 1D transformations on all the rows of the image, followed by 1D transformations on all the columns. Below is a Python implementation of this two-pass dilation which only inspects 2k neighboring pixels per output pixel, unlike the less efficient code above which has to look at k2 neighbors.

When I wrap this version with Parakeet's @jit decorator, it only takes 17 milliseconds. That's 2X faster than the precompiled SciPy version!

Comparison with Numba and PyPy

Out of curiosity, I wanted to see how Parakeet's performance compares with the better known projects that are attempting to accelerate Python: Numba and PyPy.

Numba is a runtime compiler which also aims to speed up numerical code by type specializing functions for distinct argument types. Like Parakeet, Numba uses LLVM for code generation on the back end and aims to speed up algorithms which consume and manipulate NumPy arrays. Numba is more ambitious in that it seeks to support all of Python, whereas Parakeet carves a subset of the language which is amenable to efficient compilation.

Switching from Parakeet to Numba at first seemed simple, merely a matter of replacing @parakeet.jit with @numba.autojit. However, though the dilation benchmark managed to execute successfully, it was slower than CPython! To add insult to injury, Numba's compilation times were an order of magnitude slower than Parakeet. Since I know that Numba's authors are no chumps, I emailed Mark Florisson to figure out how I was misusing their compiler.

It turns out that when Numba is confronted with any construct it doesn't support directly, it switches to a (very) slow code path. From my perspective, it seemed like it had taken a half-day off and gone for a long lunch. Parakeet is very different in this regard: if Parakeet can't execute something efficiently it complains loudly during compilation and gives up.

In this case, it was the min and max builtins which were giving Numba trouble. Once I changed the filter implementation to use an inline if-expression instead — xrange(stop_i if stop_i >= m else m) instead of xrange(min(stop_i, m)) — then Numba ran with compile and execution times uncomfortably close to Parakeet's. I'm feeling the heat.

PyPy differs dramatically from both Parakeet and Numba in that rather than generating native code from within the Python interpreter it is a completely distinct implementation of the Python language. PyPy uses trace compilation to generate native code. Getting numerical code to run in PyPy can be tricky since it currently only supports a rudimentary subset of NumPy via a reimplementation called NumPyPy. The project seems to be making steady progress, but due to the daunting scope of NumPy, there's still a lot of basic functionality missing.

The source for the timings below is in the Parakeet repository. Parakeet and Numba both perform type specialization and compilation upon the first function invocation, so that time is show below separate from the execution time of a second function call. CPython's bytecode compilation is so fast as to be negligible, so I didn't even attempt to time it. Finding out how long PyPy takes to generate code from a hot path would be interesting but I have no idea how to access that kind of information, so I also left PyPy's compile times as "n/a".

Algorithm Compile Time Execution Time
SciPy Separable O(k) n/a 34ms
CPython Naive O(k2) n/a 25,480ms
PyPy Naive O(k2) n/a 657ms
Numba Naive O(k2) 286ms 61ms
Parakeet Naive O(k2) 180ms 51ms
CPython Separable O(k) n/a 7,724ms
PyPy Separable O(k) n/a 429ms
Numba Separable O(k) 407ms 19ms
Parakeet Separable O(k) 238ms 17ms

Parakeet wins on performance over Numba by the thinnest margin. When Mark (the Numba developer) ran the same benchmark on a different computer, he actually saw Numba coming in slightly ahead. Maybe the difference doesn't even rise above the level of statistical noise? Either way, both Numba and Parakeet seem like reasonable choices for generating type-inferred native code from Python functions.

A crucial distinction between the two projects is that Parakeet has been designed as a domain specific language. Parakeet is embedded array-oriented language within Python but is not and never will be itself Python. If you wanted to, for example, create user-defined objects inside of compiled code, with Parakeet you're out of luck. The advantage of this approach is that Parakeet can guarantee that whatever it compiles will have reasonably good performance. Numba, on the other hand, can technically execute arbitrary Python but still has some implicit language boundaries demarcating what will or won't run efficiently.

Wither Parallelism?

A nice property of image filtering algorithms that I have completely ignored in this post is that they are usually perfectly parallelizable. Parakeet started out as a parallelizing compiler and only recently got stripped down to generating single core code. Parallelism is coming back in a cleaner form, so the next post will hopefully be about doing image filtering in Parakeet an order of magnitude (or two) faster.

For now, you can try out Parakeet and it might speed up your code. Or it might fail mysteriously — it's still a work in progress and you've been warned!

14 comments:

  1. Have you tried shedskin compiler on your code?

    https://code.google.com/p/shedskin/

    I would be interested in the results.

    ReplyDelete
  2. Hey phsorx,

    I haven't tried shedskin, though I'm curious how both shedskin and Nuitka would perform. My suspicion is they would do very badly until they start doing something intelligent to unbox scalars instead of representing them as PyObjects. I'm leaving my computer for a few days but I'll check out the static Python compilers when I get back.

    ReplyDelete
    Replies
    1. actually, shedskin never boxes a scalar..

      Delete
  3. This looks good, I'm interested in something like this that would provide image effects + work with cairo images (precalc BGRA).

    ReplyDelete
  4. Do you have an array version lying around? I'd like to try it in NumPyPy.

    ReplyDelete
    Replies
    1. I'm not sure what you mean by an "array version". As for NumPyPy, check out the example code I linked to.

      Delete
  5. This is very nice work and a great comparison of approaches. It's nice to have other projects to compare to that are recognizing the power of using LLVM from CPython. Perhaps we can collaborate more in the future.

    One correction to make. Numba does have a "no_python" mode which seems to be exactly the spirit of Parakeet --- complain loudly if you have to resort to using Python objects in the middle.

    Does your "jit" do run-time detection of types like Numba's autojit? It would be interesting to compare type-inference approaches between Numba and Parakeet.

    ReplyDelete
    Replies
    1. Hey Travis,

      Thanks a lot! Sorry that I didn't know about no_python when I wrote the post. I agree that the spirit is the same.

      As for type inference: Parakeet uses an abstract interpreter to infer local & return types from inputs and specializes function bodies for those inferred types. Parakeet also does shape inference in a similar fashion.

      Delete
  6. Didn't know about the "no_python" mode -- thanks!

    ReplyDelete
  7. GPUs have specific hardware optimizations for neighborhood operations and parallel reductions. Check out PyCUDA and PyOpenCL. Also the copperhead library is pretty amazing! It compiles Python into an AST and then into a CUDA function and comes with some great real-world examples.

    https://github.com/copperhead/copperhead

    ReplyDelete
    Replies
    1. Hi Michel,

      PyCUDA is great-- but it's very low level compared with Parakeet & Numba. You still usually end up having to write CUDA code in an inline string.

      Copperhead, on the other hand, is very neat, albeit a little limited semantically (i.e. purely function, HM type inference, etc..).
      I wrote a little bit about it in this older post: http://www.phi-node.com/2013/01/a-journey-through-parakeet.html

      Delete
  8. My first question would be why your version beats scipy's by a factor of two.

    Is it because scipy version is more generic?

    ReplyDelete
    Replies
    1. Hey Usov,

      From a brief exploration of the Scipy source it seems that the call chain is:

      1) morphology.py: grey_dilation
      2) filters.py: min_or_max_filter
      3) nd_image.c: Py_MinOrMaxFilter1D
      4) ni_filters.c: NI_MinOrMaxFilter1D

      That last function is where the actual work happens. Aside from some copies to buffers it seems to do essentially the same as the Numba/Parakeet code above:

      /* iterate over all the array lines: */
      do {
      /* copy lines from array to buffer: */
      if (!NI_ArrayToLineBuffer(&iline_buffer, &lines, &more))
      goto exit;
      /* iterate over the lines in the buffers: */
      for(kk = 0; kk < lines; kk++) {
      /* get lines: */
      double *iline = NI_GET_LINE(iline_buffer, kk) + size1;
      double *oline = NI_GET_LINE(oline_buffer, kk);
      for(ll = 0; ll < length; ll++) {
      /* find minimum or maximum filter: */
      double val = iline[ll - size1];
      for(jj = -size1 + 1; jj <= size2; jj++) {
      double tmp = iline[ll + jj];
      if (minimum) {
      if (tmp < val)
      val = tmp;
      } else {
      if (tmp > val)
      val = tmp;
      }
      }
      oline[ll] = val;
      }
      }
      /* copy lines from buffer to array: */
      if (!NI_LineBufferToArray(&oline_buffer))
      goto exit;
      } while(more);

      Delete
    2. On the first look there isn't anything dramatically wrong with this code.

      Additional "if(minimum)" should be handled by branch prediction machinery in the CPU.
      Another difference is that scipy version allocates line buffers and copies data to/from them while parakeet version must be doing calculations in-place. But this shouldn't result in factor of two difference in the performance.

      Can it be the difference between optimization levels? I imagine parakeet uses instruction set of the CPU it's running on, while distro package is compiled for i686(?).

      Delete