Suck-less scientific Python

Part 2: Efficient number-crunching

By Dion, Tue 29 November 2016, in category Blog

Python, Science

This example shows how simple Fortran-style loops over arrays can be implemented efficiently in Python. Since many Python operations carry a significant overhead compared to a low-level language as e.g. C or Fortran, Python is often generally classified as slow. However, satisfying speeds can be achieved if the Python code makes use of packages like Cython, Numba, or NumPy. Also, it is important to keep in mind that Python is considerably easier to learn and faster to write than e.g. C. In many situations, slower runtimes can be compensated by faster development times (which is a good thing, since your time is usually more valuable than your computer’s).

As an example for tight, simple loops, we will implement a numerical solver for a two-dimensional shallow-water model using finite differences. Explicit finite differences require multiple loops over the entire domain in each time step, so the complexity of the algorithm is at least $n^2$ in two spatial dimensions.

The Model

The unforced shallow-water equations for geostrophic flow in 2 dimensions read:

$$\frac{\partial}{\partial t} u = -g \frac{\partial}{\partial x}h$$$$\frac{\partial}{\partial t} v = -g \frac{\partial}{\partial y}h$$$$\frac{\partial}{\partial t} h + \frac{\partial}{\partial x} (hu) + \frac{\partial}{\partial y} (hu) = 0$$

where $u$ denotes the velocity field in $x$-direction, $v$ the velocity in $y$-direction, and $h$ the layer height of the water column. $g$ is the gravitational acceleration.

These equations are discretized by

$$ u^{n+1}_{i,j} = u^n_{i,j} - g\frac{\Delta t}{\Delta x} (h_{i,j+1} - h_{i,j})$$$$ v^{n+1}_{i,j} = v^n_{i,j} - g\frac{\Delta t}{\Delta x} (h_{i+1,j} - h_{i,j})$$$$ h^{n+1}_{i,j} = h^n_{i,j} - \frac{\Delta t}{\Delta x} (f^e_{i,j} - f^w_{i,j}) - \frac{\Delta t}{\Delta y} (f^n_{i,j} - f^s_{i,j})$$

with

$$f^e_{i,j} = u^+_{i,j} h_{i,j} + u^-_{i,j} h_{i,j+1}$$$$f^n_{i,j} = v^+_{i,j} h_{i,j} + v^-_{i,j} h_{i+1,j}$$$$f^w_{i,j} = u^+_{i,j} h_{i,j-1} + u^-_{i,j} h_{i,j}$$$$f^s_{i,j} = v^+_{i,j} h_{i-1,j} + v^-_{i,j} h_{i,j}$$

and

$$u^+_{i,j} = 0.5 (u_{i,j} + |u_{i,j}|)$$$$u^-_{i,j} = 0.5 (u_{i,j} - |u_{i,j}|)$$

(analogous for $v$)

The system of three coupled partial differential equations is thus solved by explicitly discretizing the domain in each dimension and in time. The used discretization schemes are explicit Euler in time, and an upwinding scheme in space. The only boundary condition is no-normal-flow (i.e., $u=0$ on the western and easter boundaries, and $v=0$ in the north and south). To ensure stability, the layer height is artificially smoothed in every time step:

$$ h^*_{i,j} = (1-\varepsilon)~h_{i,j} + \frac{\varepsilon}{4} (h_{i-1,j} + h_{i+1,j} + h_{i,j-1} + h_{i,j+1}) $$

with a smoothing coefficient $\varepsilon$.

Set-up

Let’s import all the packages we’ll need:

In [1]:
%matplotlib inline
%load_ext cython
import numpy as np
import numba
import matplotlib.pyplot as plt
import seaborn as sns
sns.set("talk")

from matplotlib import animation
plt.rcParams['animation.html'] = 'html5'

Now, we create the numerical grid ($100 \times 100$ cells) and define some constants and the initial conditions (a Gaussian bump of water in the center of the domain):

In [2]:
yy, xx = np.mgrid[-50:50:100j,-100:100:100j].astype(float)
dx = xx[0,1] - xx[0,0]
dy = yy[1,0] - yy[0,0]

d = 10
g = 9.81
dt = 0.5 * min(dx,dy) / np.sqrt(g*d)
t1 = 10
eps = 0.1

h0 = d + np.exp(-(xx**2 + yy**2)/20)
u0 = np.zeros_like(xx)
v0 = np.zeros_like(xx)

Naive Python implementation

This implementation uses explicit loops over the domain, and checks explicitly for boundary cells:

In [3]:
def iterate_py(h,u,v):
    h_new = np.zeros_like(h)
    u_new = np.zeros_like(u)
    v_new = np.zeros_like(v)
    
    n1, n2 = h.shape
    for i in range(n1):
        for j in range(n2):
            if j == n2-1:
                u_new[i,j] = 0
            else:
                u_new[i,j] = u[i,j] - g*dt/dx * (h[i,j+1] - h[i,j])
            if i == n1-1:
                v_new[i,j] = 0
            else:
                v_new[i,j] = v[i,j] - g*dt/dy * (h[i+1,j] - h[i,j])
                
    u_plus = .5 * (u_new + np.abs(u_new))
    u_minus = .5 * (u_new - np.abs(u_new))
    v_plus = .5 * (v_new + np.abs(v_new))
    v_minus = .5 * (v_new - np.abs(v_new))
    
    for i in range(n1):
        for j in range(n2):
            if j == n2-1:
                f_e = 0
            else:
                f_e = u_plus[i,j] * h[i,j] + u_minus[i,j] * h[i,j+1]
            if i == n1-1:
                f_n = 0
            else:
                f_n = v_plus[i,j] * h[i,j] + v_minus[i,j] * h[i+1,j]
            if j == 0:
                f_w = 0
            else:
                f_w = u_plus[i,j-1] * h[i,j-1] + u_minus[i,j-1] * h[i,j]
            if i == 0:
                f_s = 0
            else:
                f_s = v_plus[i-1,j] * h[i-1,j] + v_minus[i-1,j] * h[i,j]           
            h_new[i,j] = h[i,j] - dt/dx*(f_e - f_w) - dt/dy*(f_n - f_s)
    
    h_filter = np.zeros_like(h_new)
    for i in range(n1):
        for j in range(n2):
            if j == n2-1:
                h_e = h_new[i,j-1]
            else:
                h_e = h_new[i,j+1]
            if i == n1-1:
                h_n = h_new[i-1,j]
            else:
                h_n = h_new[i+1,j]
            if j == 0:
                h_w = h_new[i,j+1]
            else:
                h_w = h_new[i,j-1]
            if i == 0:
                h_s = h_new[i+1,j]
            else:
                h_s = h_new[i-1,j]
            h_filter[i,j] = (1-eps) * h_new[i,j] + .25*eps*(h_e+h_n+h_w+h_s)
            
    return h_filter, u_new, v_new

Let’s test how long the solution takes, and animate the solution:

In [4]:
def solve_shallow_water(iterate_fun,t1=t1):
    t = 0
    h = h0
    u = u0
    v = v0
    sol = []
    while t < t1:
        h, u, v = iterate_fun(h,u,v)
        sol.append(h)
        t += dt
    return sol
In [5]:
%timeit solve_shallow_water(iterate_py)
1 loop, best of 3: 9.95 s per loop
In [6]:
def animate_shallow_water(sol):
    fig = plt.figure(figsize=(8,6))
    ax = plt.gca()
    ax.set_aspect("equal")
    ax.set_xlim((xx.min(),xx.max()))
    ax.set_ylim((yy.min(),yy.max()))
    cs = plt.pcolormesh(xx,yy,sol[0],vmin=9.9,vmax=10.1,cmap="RdBu_r")
    plt.colorbar(cs,orientation="horizontal")
    plt.close()

    def animate(i):
        ax.set_title("Layer height at t = {:.1f}s".format(dt*i), y=1.1)
        cs.set_array(sol[i][:-1,:-1].flatten())
        return (cs,)

    return animation.FuncAnimation(fig, animate, frames=len(sol), interval=20, blit=True)

sol = solve_shallow_water(iterate_py)
animate_shallow_water(sol)
Out[6]:

The solution looks nice, but a runtime of 10 s is just too slow for such a short simulation time.

Using Numba

Numba is a quite recent project by Continuum Analytics (the developers of Anaconda Python. Its main module is a just-in-time compiler (jit) that aims at making math-heavy Python code more efficient by compiling it to machine instructions before execution. Using numba is very simplpe; just apply the jit decorator to the function you want to get compiled. In this case, the function code is exactly the same as before:

In [7]:
@numba.jit
def iterate_numba(h,u,v):
    h_new = np.zeros_like(h)
    u_new = np.zeros_like(u)
    v_new = np.zeros_like(v)
    
    n1, n2 = h.shape
    for i in range(n1):
        for j in range(n2):
            if j == n2-1:
                u_new[i,j] = 0
            else:
                u_new[i,j] = u[i,j] - g*dt/dx * (h[i,j+1] - h[i,j])
            if i == n1-1:
                v_new[i,j] = 0
            else:
                v_new[i,j] = v[i,j] - g*dt/dy * (h[i+1,j] - h[i,j])
                
    u_plus = .5 * (u_new + np.abs(u_new))
    u_minus = .5 * (u_new - np.abs(u_new))
    v_plus = .5 * (v_new + np.abs(v_new))
    v_minus = .5 * (v_new - np.abs(v_new))
    
    for i in range(n1):
        for j in range(n2):
            if j == n2-1:
                f_e = 0
            else:
                f_e = u_plus[i,j] * h[i,j] + u_minus[i,j] * h[i,j+1]
            if i == n1-1:
                f_n = 0
            else:
                f_n = v_plus[i,j] * h[i,j] + v_minus[i,j] * h[i+1,j]
            if j == 0:
                f_w = 0
            else:
                f_w = u_plus[i,j-1] * h[i,j-1] + u_minus[i,j-1] * h[i,j]
            if i == 0:
                f_s = 0
            else:
                f_s = v_plus[i-1,j] * h[i-1,j] + v_minus[i-1,j] * h[i,j]           
            h_new[i,j] = h[i,j] - dt/dx*(f_e - f_w) - dt/dy*(f_n - f_s)
    
    h_filter = np.zeros_like(h_new)
    for i in range(n1):
        for j in range(n2):
            if j == n2-1:
                h_e = h_new[i,j-1]
            else:
                h_e = h_new[i,j+1]
            if i == n1-1:
                h_n = h_new[i-1,j]
            else:
                h_n = h_new[i+1,j]
            if j == 0:
                h_w = h_new[i,j+1]
            else:
                h_w = h_new[i,j-1]
            if i == 0:
                h_s = h_new[i+1,j]
            else:
                h_s = h_new[i-1,j]
            h_filter[i,j] = (1-eps) * h_new[i,j] + .25*eps*(h_e+h_n+h_w+h_s)
            
    return h_filter, u_new, v_new

Let’s see how it performs:

In [8]:
%timeit solve_shallow_water(iterate_numba)
The slowest run took 29.66 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 27.1 ms per loop

Wow! A speedup by a factor of about 400, just by applying a decorator to the function. Note how the first call to the function takes much longer than subsequent calls, because Numba has to compile it first.

Even though the speedup is substantial, using Numba still has a few drawbacks. First of all, in more complicated functions, I often received compilation errors - Numba still only supports a subset of the Python and NumPy capabilities - and speedups are not always that dramatic. And second of all, in order for Numba to do its magic, it is necssary to write your function in a verbose, exlicit manner. This might seem natural when coming from a C or Fortran background, but denies the access to the elegant side of Python.

Using Cython

Cython is a project that allows the user to add static types to Python variables, which enables efficient compilation of the typed parts of your script to C. When using Cython from a Jupyter notebook, you can make use of the convenient %%cython magic, which marks that the respective code cell should be compiled with Cython. An implementation could look like this:

In [9]:
%%cython -c=-O3
import numpy as np
cimport numpy as np
cimport cython

@cython.boundscheck(False)
cpdef iterate_cython(np.ndarray[np.float_t,ndim=2] h,
                     np.ndarray[np.float_t,ndim=2] u,
                     np.ndarray[np.float_t,ndim=2] v,
                     double g, double dx, double dy, double dt, double eps):
    cdef np.ndarray[np.float_t,ndim=2] h_new = np.zeros_like(h)
    cdef np.ndarray[np.float_t,ndim=2] u_new = np.zeros_like(u)
    cdef np.ndarray[np.float_t,ndim=2] v_new = np.zeros_like(v)
    cdef np.ndarray[np.float_t,ndim=2] u_plus, u_minus, v_plus, v_minus
    cdef np.ndarray[np.float_t,ndim=2] h_filter
    cdef np.float_t f_e, f_n, f_w, f_s
    cdef np.float_t h_e, h_n, h_w, h_s
    cdef np.int_t i, j, n1, n2
    
    n1, n2 = h.shape[0], h.shape[1]
    for i in range(n1):
        for j in range(n2):
            if j == n2-1:
                u_new[i,j] = 0
            else:
                u_new[i,j] = u[i,j] - g*dt/dx * (h[i,j+1] - h[i,j])
            if i == n1-1:
                v_new[i,j] = 0
            else:
                v_new[i,j] = v[i,j] - g*dt/dy * (h[i+1,j] - h[i,j])
                
    u_plus = .5 * (u_new + np.abs(u_new))
    u_minus = .5 * (u_new - np.abs(u_new))
    v_plus = .5 * (v_new + np.abs(v_new))
    v_minus = .5 * (v_new - np.abs(v_new))
    
    for i in range(n1):
        for j in range(n2):
            if j == n2-1:
                f_e = 0
            else:
                f_e = u_plus[i,j] * h[i,j] + u_minus[i,j] * h[i,j+1]
            if i == n1-1:
                f_n = 0
            else:
                f_n = v_plus[i,j] * h[i,j] + v_minus[i,j] * h[i+1,j]
            if j == 0:
                f_w = 0
            else:
                f_w = u_plus[i,j-1] * h[i,j-1] + u_minus[i,j-1] * h[i,j]
            if i == 0:
                f_s = 0
            else:
                f_s = v_plus[i-1,j] * h[i-1,j] + v_minus[i-1,j] * h[i,j]           
            h_new[i,j] = h[i,j] - dt/dx*(f_e - f_w) - dt/dy*(f_n - f_s)
    
    h_filter = np.zeros_like(h_new)
    for i in range(n1):
        for j in range(n2):
            if j == n2-1:
                h_e = h_new[i,j-1]
            else:
                h_e = h_new[i,j+1]
            if i == n1-1:
                h_n = h_new[i-1,j]
            else:
                h_n = h_new[i+1,j]
            if j == 0:
                h_w = h_new[i,j+1]
            else:
                h_w = h_new[i,j-1]
            if i == 0:
                h_s = h_new[i+1,j]
            else:
                h_s = h_new[i-1,j]
            h_filter[i,j] = (1-eps) * h_new[i,j] + .25*eps*(h_e+h_n+h_w+h_s)
            
    return h_filter, u_new, v_new
In [10]:
iterate_cython_wrapper = lambda h, u, v: iterate_cython(h,u,v,g,dx,dy,dt,eps)
%timeit solve_shallow_water(iterate_cython_wrapper)
10 loops, best of 3: 46.6 ms per loop

So, in this case, Cython is in fact a bit slower than Numba, and requires more changes to the code. However, Cython is generally applicable to more cases than Numba.

Using NumPy

In order to showcase how a more elegant and quite efficient solution could look, consider the following implementation with NumPy. It makes heavy use of array slicing and padding to enforce the boundary conditions, which is fast with NumPy. It also avoids all explicit loops, so the overall code turns out to be much shorter and more readable:

In [11]:
def iterate_numpy(h,u,v):
    h_pad = np.pad(h,1,"edge")
    
    u_new = u - g*dt/dx * (h_pad[1:-1,2:] - h)
    v_new = v - g*dt/dy * (h_pad[2:,1:-1] - h)
    
    u_new_pad = np.pad(u_new,1,"constant")
    v_new_pad = np.pad(v_new,1,"constant")
                
    u_plus = .5 * (u_new_pad + np.abs(u_new_pad))
    u_minus = .5 * (u_new_pad - np.abs(u_new_pad))
    v_plus = .5 * (v_new_pad + np.abs(v_new_pad))
    v_minus = .5 * (v_new_pad - np.abs(v_new_pad))
    
    f_e = u_plus[1:-1,1:-1] * h + u_minus[1:-1,1:-1] * h_pad[1:-1,2:]
    f_n = v_plus[1:-1,1:-1] * h + v_minus[1:-1,1:-1] * h_pad[2:,1:-1]
    f_w = u_plus[1:-1,:-2] * h_pad[1:-1,:-2] + u_minus[1:-1,:-2] * h
    f_s = v_plus[:-2,1:-1] * h_pad[:-2,1:-1] + v_minus[:-2,1:-1] * h
         
    h_new = h - dt/dx*(f_e - f_w) - dt/dy*(f_n - f_s)
    h_new_pad = np.pad(h,1,"reflect")
    h_filter = (1-eps) * h_new + .25*eps*(h_new_pad[:-2,1:-1]+h_new_pad[1:-1,:-2]+\
                                          h_new_pad[2:,1:-1]+h_new_pad[1:-1,2:])
    return h_filter, u_new, v_new
In [12]:
%timeit solve_shallow_water(iterate_numpy)
10 loops, best of 3: 105 ms per loop

So, this implementation is still about 100 times faster than pure Python, while being short and elegant, which is why we all love Python, right?

Using NumPy + Cython

Often, performance can be enhanced even further when using both NumPy and Cython:

In [13]:
%%cython -c=-O3
import numpy as np
cimport numpy as np

cpdef iterate_numcy(np.ndarray[np.float_t,ndim=2] h
                    , np.ndarray[np.float_t,ndim=2] u
                    , np.ndarray[np.float_t,ndim=2] v
                    , double g, double dx, double dy, double dt, double eps):
    cdef np.ndarray[np.float_t,ndim=2] h_new, u_new, v_new
    cdef np.ndarray[np.float_t,ndim=2] h_pad, u_new_pad, v_new_pad, h_new_pad
    cdef np.ndarray[np.float_t,ndim=2] h_filter
    cdef np.ndarray[np.float_t,ndim=2] f_e, f_n, f_w, f_s
    
    h_pad = np.pad(h,1,"edge")
    
    u_new = u - g*dt/dx * (h_pad[1:-1,2:] - h)
    v_new = v - g*dt/dy * (h_pad[2:,1:-1] - h)
    
    u_new_pad = np.pad(u_new,1,"constant")
    v_new_pad = np.pad(v_new,1,"constant")
                
    u_plus = .5 * (u_new_pad + np.abs(u_new_pad))
    u_minus = .5 * (u_new_pad - np.abs(u_new_pad))
    v_plus = .5 * (v_new_pad + np.abs(v_new_pad))
    v_minus = .5 * (v_new_pad - np.abs(v_new_pad))
    
    f_e = u_plus[1:-1,1:-1] * h + u_minus[1:-1,1:-1] * h_pad[1:-1,2:]
    f_n = v_plus[1:-1,1:-1] * h + v_minus[1:-1,1:-1] * h_pad[2:,1:-1]
    f_w = u_plus[1:-1,:-2] * h_pad[1:-1,:-2] + u_minus[1:-1,:-2] * h
    f_s = v_plus[:-2,1:-1] * h_pad[:-2,1:-1] + v_minus[:-2,1:-1] * h
         
    h_new = h - dt/dx*(f_e - f_w) - dt/dy*(f_n - f_s)
    h_new_pad = np.pad(h,1,"reflect")
    h_filter = (1-eps) * h_new + .25*eps*(h_new_pad[:-2,1:-1]+h_new_pad[1:-1,:-2]+\
                                          h_new_pad[2:,1:-1]+h_new_pad[1:-1,2:])
    return h_filter, u_new, v_new
In [14]:
iterate_numcy_wrapper = lambda h, u, v: iterate_numcy(h,u,v,g,dx,dy,dt,eps)
%timeit solve_shallow_water(iterate_numcy_wrapper)
10 loops, best of 3: 110 ms per loop

In this case, it seems that Cython does not grant an additional speedup compared to the NumPy implementation (I guess because all expensive computations happen inside NumPy anyway).

Conclusion

Comparison between all implementations:

Implementation Runtime (s)
Python 9.95
NumPy + Cython 0.110
NumPy 0.105
Cython 0.046
Numba 0.027

Numba delivered the best performance on this problem, while still being easy to use. The (in my opinion) most elegant code is the implementation in pure NumPy (which is, by the way, also the most portable). I would thus recommend starting from there, and only use the big guns (Numba / Cython) when needed.

Oh, also, never use explicit loops over arrays in pure Python.


Before we close - here’s a longer animation of the shallow-water model, made with the Numba implementation. Enjoy!

In [15]:
sol = solve_shallow_water(iterate_numba,t1=50)
animate_shallow_water(sol)
Out[15]: