Parallel and cluster computing with MPI4Py

First off, I'd ask your pardon for the lull in postings - this spring has been insane. It has been very much fun - traveling the world, talking about my research and meeting people I only knew electronically - and also very intense.

To break the lull, I thought I'd try to pick up what I did last summer: parallel computing on clusters. It's been a bit of blog chatter about SAGE and how SAGE suddenly has transformed from a Really Good Idea to something that starts to corner out most other systems in usability and flexibility.

Matlab? SciPy bundled with SAGE and the Python integration seems to be at least as good, if not better.
Maple? Mathematica? Maxima? Singular? GAP? SAGE interfaces with all those that it doesn't emulate.

And also, it allows you to install and interface with almost anything that has a Python module written. Specifically, a SAGE installation comes complete with OpenMPI and MPI4Py, allowing for multi-processor parallelity - either on an SMP machine, or on a cluster that runs some sort of MPI system. So, using the Python and MPI4Py that arrives with SAGE, it should be possible to start experimenting with parallel programming; and the rest of the SAGE integration should make it easy to start playing with, say, parallel algorithms for commutative or homological algebra.

But as a first step, I thought I'd do a "Hello world" for mathematics. Computing the Mandelbrot fractal.

Now, a Mandelbrot fractal is reasonably easy to compute: we step through all pixels in our image, and we repeat the transformation [tex]z_{n+1} = z_n^2+c[/tex] for c the complex plane point corresponding to the pixel, and [tex]z_0=0[/tex]. We see whether this iteration diverges within a limited amount of steps.

I'm going to, for benchmarking reasons, fix some parameters. I'll have a 600x600 picture, spanning the area [-2.5,1.5]x[-2.0,2.0] of [tex]\mathbb R^2 = \mathbb C[/tex]. I'll allow 80 iterations, and after that, I want to see whether the square of the magnitude is larger than 4.0.

At the core of the iteration lies the function

def step(z,c):
        return z**2+c

and for a specific point [tex]c=x+yi[/tex], we can compute the number of iterations needed with the function

def point(c,n,d2):
        zo = 0.0
        zn = zo
        i = 0
        while abs(zn)**2 < d2 and i
Repeating this for each point in the span we look at, modified for the granularity our pixels force, and choosing a color for each iteration number, we'll get the familiar Mandelbrot fractal. I choose to assign colors using the hue in the HSV color space. This gives us rich, nice colors with a continuous and neat progression through them. So, given the number of iterations at a point, and the maximal number of iterations allowed, we compute the corresponding color by

def colnorm((r,g,b)):
        return (int(255*r)-1,int(255*g)-1,int(255*b)-1)

def col(n,max):
        if n == max:
                return (0,0,0)
        return colnorm(colorsys.hsv_to_rgb(1.0-float(n)/max,1.0,1.0))

the function colnorm is in there to convert RGB triples with values in [0,1] to RGB triples with values in [0,255].

And from here it's just a matter of interfacing with the right kind of image writing library - I chose the PIL library - and then go through all pixels, computing the color implied at each pixel, and placing it in the image.

Run like this, the benchmark fractal takes some 12 seconds on our workgroup's workhorse, and 8 seconds on a single node of the Jena Beowulf cluster.

The resulting fractal is
Mandelbrot fractal

As for parallelization, the Mandelbrot fractal is a typical example of what's called an embarrassingly parallel problem: each pixel in an image is completely independent from all other pixels, and so we could just let one single processor loose on each pixel, and get things done a lot faster than if we'd work through the pixels one after the other.

So, to parallelize it, we'll need some way of distributing the work among the processors, and some way of gathering the pieces up once we're done. In MPI4Py, initialization and finalization of the MPI library is taken care of under the hood, in the import of the library and in the closing of the program.

My first idea was to use the builtin Scatter and Gather functions. The idea was something like the following:

def main():
   comm = MPI.COMM_WORLD
   rowids = range(h)
   ids = comm.Scatter(rowids)
   pixs = compute_mandel(ids)
   pixels = comm.Gather(pixs)
   img = generate_image(pixels)"filename.png")

but this fails already at the Scatter line. The thing is that you can't just hand Scatter a list of things you want distributed, you'll need to feed comm.Scatter with a list of lists, one for each recipient. This particular property of comm.Scatter is not actually described anywhere I could find on the web - the documentation for MPI4Py is minuscule at best, and this is specific to the Python interface setup, so the MPI standard documentation doesn't give it away either.

The next problem that appears is that if we just divide the rows of the picture evenly - first processor gets rows 1,2,3,...,n; second gets n+1,...,2n and so on - then some processor will get the bulk of the actual Mandelbrot set - which is by definition the pixels that take the most iterations to compute - so the speedup by pouring more processing power on the problem won't be as impressive as it could be. The processors that take care of regions outside the set will finish quickly and then just idle, and the processors that take care of most of the set will churn through as if they had computed the set alone.

There are very many ways to deal with this issue. The approach I'm taking is to use an interlacing pattern to distribute the rows. With n processors, each processor computes every n rows of the picture, thus getting easy and difficult rows reasonably evenly distributed. Also, with this easy a distribution scheme, there's no reason to compute the rows centrally and distributing them with the MPI communication scheme.

Hence, the complete program ends up with the following source code:

from mpi4py import MPI
import Image
import colorsys
from math import ceil

w = 600
h = 600
its = 80
d2 = 4.0

xmax = 1.5
xmin = -2.5
ymax = 2.0
ymin = -2.0

def step(z,c):
        return z**2+c

def point(c,n,d2):
        zo = 0.0
        zn = zo
        i = 0
        while abs(zn)**2 < d2 and i
We can run this program on a single-processor machine by issuing

$SAGE_ROOT/local/bin/mpirun -np 1 $SAGE_ROOT/local/bin/python

and on a multiprocessor machine by replacing -np 1 with the number of processors we'd want to utilize.

On the quad kernel workhorse my workgroup uses, which currently has two processors fully utilized by group cohomology computations, I get the following results (approximates):

# proc mean walltime
1 12s 0.7
2 9s 0.7
3 11s 2

So - once I used up the free processors, the wall time shoots up, and the spread shoots up. So working on a utilized SMP machine might not be the way to go.

But I do have access to the Jena Beowulf cluster, since the lecture course I audited on cluster computing. And I'm getting much better results there. Over there, I run jobs by writing a job script

#$ -j y
#$-l hostshare=1.0
#$-q fast.q

mpiexec python

and submit it to the job scheduler, to work on - say - 3 processors with

qsub -pe lam7 3 mandel.job

I added commands to time the computation part - stripping out the book-keeping and image writing once the fractal is computed, and thus can get timings for the parallel computation.

As a result, I get the following:

# proc mean walltime
1 8.33
2 4.34 0.020
3 2.96 0.033
4 2.32 0.031
5 2.00 0.060
6 1.54 0.019
7 1.29 0.040
8 1.22 0.028
9 1.04 0.017
10 1.00 0.030