First off, I'd ask your pardon for the lull in postings - this spring
has been insane. It has been very much fun - traveling the world,
talking about my research and meeting people I only knew electronically
- and also very intense.
To break the lull, I thought I'd try to pick up what I did last summer:
parallel computing on clusters. It's been a bit of blog
chatter
about SAGE and how SAGE suddenly has
transformed from a Really Good Idea to something that starts to corner
out most other systems in usability and flexibility.
Matlab? SciPy bundled with SAGE and the Python integration seems to be
at least as good, if not better.
Maple? Mathematica? Maxima? Singular? GAP? SAGE interfaces with all
those that it doesn't emulate.
And also, it allows you to install and interface with almost anything
that has a Python module written. Specifically, a SAGE installation
comes complete with OpenMPI and MPI4Py, allowing for multi-processor
parallelity - either on an SMP machine, or on a cluster that runs some
sort of MPI system. So, using the Python and
MPI4Py that arrives with SAGE, it should
be possible to start experimenting with parallel programming; and the
rest of the SAGE integration should make it easy to start playing with,
say, parallel algorithms for commutative or homological algebra.
But as a first step, I thought I'd do a "Hello world" for mathematics.
Computing the Mandelbrot
fractal.
Now, a Mandelbrot fractal is reasonably easy to compute: we step through
all pixels in our image, and we repeat the transformation [tex]z_{n+1}
= z_n^2+c[/tex] for c the complex plane point corresponding to the
pixel, and [tex]z_0=0[/tex]. We see whether this iteration diverges
within a limited amount of steps.
I'm going to, for benchmarking reasons, fix some parameters. I'll have a
600x600 picture, spanning the area [-2.5,1.5]x[-2.0,2.0] of
[tex]\mathbb R^2 = \mathbb C[/tex]. I'll allow 80 iterations, and
after that, I want to see whether the square of the magnitude is larger
than 4.0.
At the core of the iteration lies the function
def step(z,c):
return z**2+c
and for a specific point [tex]c=x+yi[/tex], we can compute the number of
iterations needed with the function
def point(c,n,d2):
zo = 0.0
zn = zo
i = 0
while abs(zn)**2 < d2 and i
Repeating this for each point in the span we look at, modified for the granularity our pixels force, and choosing a color for each iteration number, we'll get the familiar Mandelbrot fractal. I choose to assign colors using the hue in the HSV color space. This gives us rich, nice colors with a continuous and neat progression through them. So, given the number of iterations at a point, and the maximal number of iterations allowed, we compute the corresponding color by
def colnorm((r,g,b)):
return (int(255*r)-1,int(255*g)-1,int(255*b)-1)
def col(n,max):
if n == max:
return (0,0,0)
return colnorm(colorsys.hsv_to_rgb(1.0-float(n)/max,1.0,1.0))
the function colnorm is in there to convert RGB triples with values
in [0,1] to RGB triples with values in [0,255].
And from here it's just a matter of interfacing with the right kind of
image writing library - I chose the PIL
library - and then go through
all pixels, computing the color implied at each pixel, and placing it in
the image.
Run like this, the benchmark fractal takes some 12 seconds on our
workgroup's workhorse, and 8 seconds on a single node of the Jena
Beowulf cluster.
As for parallelization, the Mandelbrot fractal is a typical example of
what's called an embarrassingly parallel problem: each pixel in an image
is completely independent from all other pixels, and so we could just
let one single processor loose on each pixel, and get things done a lot
faster than if we'd work through the pixels one after the other.
So, to parallelize it, we'll need some way of distributing the work
among the processors, and some way of gathering the pieces up once we're
done. In MPI4Py, initialization and
finalization of the MPI library is taken care of under the hood, in the
import of the library and in the closing of the program.
My first idea was to use the builtin Scatter and Gather functions. The
idea was something like the following:
def main():
comm = MPI.COMM_WORLD
rowids = range(h)
ids = comm.Scatter(rowids)
pixs = compute_mandel(ids)
pixels = comm.Gather(pixs)
img = generate_image(pixels)
img.save("filename.png")
but this fails already at the Scatter line. The thing is that you can't
just hand Scatter a list of things you want distributed, you'll need to
feed comm.Scatter with a list of lists, one for each recipient. This
particular property of comm.Scatter is not actually described anywhere I
could find on the web - the documentation for MPI4Py is minuscule at
best, and this is specific to the Python interface setup, so the MPI
standard documentation doesn't give it away either.
The next problem that appears is that if we just divide the rows of the
picture evenly - first processor gets rows 1,2,3,...,n; second gets
n+1,...,2n and so on - then some processor will get the bulk of the
actual Mandelbrot set - which is by definition the pixels that take the
most iterations to compute - so the speedup by pouring more processing
power on the problem won't be as impressive as it could be. The
processors that take care of regions outside the set will finish quickly
and then just idle, and the processors that take care of most of the set
will churn through as if they had computed the set alone.
There are very many ways to deal with this issue. The approach I'm
taking is to use an interlacing pattern to distribute the rows. With n
processors, each processor computes every n rows of the picture, thus
getting easy and difficult rows reasonably evenly distributed. Also,
with this easy a distribution scheme, there's no reason to compute the
rows centrally and distributing them with the MPI communication scheme.
Hence, the complete program ends up with the following source code:
from mpi4py import MPI
import Image
import colorsys
from math import ceil
w = 600
h = 600
its = 80
d2 = 4.0
xmax = 1.5
xmin = -2.5
ymax = 2.0
ymin = -2.0
def step(z,c):
return z**2+c
def point(c,n,d2):
zo = 0.0
zn = zo
i = 0
while abs(zn)**2 < d2 and i
We can run this program on a single-processor machine by issuing
$SAGE_ROOT/local/bin/mpirun -np 1 $SAGE_ROOT/local/bin/python mandel.py
and on a multiprocessor machine by replacing -np 1 with the number
of processors we'd want to utilize.
On the quad kernel workhorse my workgroup uses, which currently has two
processors fully utilized by group cohomology computations, I get the
following results (approximates):
# proc |
mean walltime |
std.dev |
1 |
12s |
0.7 |
2 |
9s |
0.7 |
3 |
11s |
2 |
So - once I used up the free processors, the wall time shoots up, and
the spread shoots up. So working on a utilized SMP machine might not be
the way to go.
But I do have access to the Jena Beowulf cluster, since the lecture
course I audited on cluster computing. And I'm getting much better
results there. Over there, I run jobs by writing a job script
#$ -j y
#$-l hostshare=1.0
#$-q fast.q
mpiexec python mandel.py
and submit it to the job scheduler, to work on - say - 3 processors with
qsub -pe lam7 3 mandel.job
I added commands to time the computation part - stripping out the
book-keeping and image writing once the fractal is computed, and thus
can get timings for the parallel computation.
As a result, I get the following:
# proc |
mean walltime |
std.dev |
1 |
8.33 |
|
2 |
4.34 |
0.020 |
3 |
2.96 |
0.033 |
4 |
2.32 |
0.031 |
5 |
2.00 |
0.060 |
6 |
1.54 |
0.019 |
7 |
1.29 |
0.040 |
8 |
1.22 |
0.028 |
9 |
1.04 |
0.017 |
10 |
1.00 |
0.030 |