Parallelism and Performance¶
This document describes how metrust achieves its performance characteristics: the tiered execution model, the use of Rayon parallel iterators throughout the Rust core, the GIL-release strategy at the Python boundary, and the algorithmic choices that matter more than raw thread count.
1. The Three Performance Tiers¶
When benchmarking metrust against MetPy, it is important to understand what is actually being compared. There are three distinct tiers of execution cost:
| Tier | Description | Overhead |
|---|---|---|
| T1 | Raw Rust FFI -- call the wx-math or metrust::calc function directly from Rust with &[f64] slices. No Pint, no Python, no NumPy. |
None |
| T2 | metrust + Pint wrappers -- the user-facing Python API. NumPy arrays enter through PyO3, Pint unit objects are attached to the result. |
PyO3 array conversion + Pint wrapper |
| T3 | MetPy + Pint -- the baseline. Pure-Python loops or NumPy vectorized ops, every intermediate result wrapped in Pint Quantity objects. | Full Python + Pint |
Why T2 vs T3 is the fair comparison. Users of both libraries work with unit-aware arrays. Comparing T1 against T3 would overstate the advantage because it eliminates the Pint/NumPy overhead that real users of metrust also pay. T2 vs T3 is apples-to-apples: both accept and return Pint Quantities backed by NumPy arrays, both validate units, and both produce identical scientific results. The difference is that T2's inner loop runs compiled Rust with Rayon parallelism instead of interpreted Python.
2. Rayon Parallel Iterators¶
Rayon is a data-parallelism library for Rust.
Replacing .iter() with .par_iter() (or .into_par_iter()) is often
the only source-level change needed to distribute work across all available
cores. Rayon uses a work-stealing thread pool, so load balancing is
automatic even when per-element work varies (e.g. wet-bulb iteration
converges faster for some inputs than others).
2.1 The 28 Thermo Array Bindings (src/py_thermo.rs)¶
The file src/py_thermo.rs contains 28 *_array functions that follow a
uniform pattern:
fn potential_temperature_array<'py>(
py: Python<'py>,
pressure: PyReadonlyArray1<f64>,
temperature: PyReadonlyArray1<f64>,
) -> PyResult<Bound<'py, PyArray1<f64>>> {
let p = pressure.as_slice()?;
let t = temperature.as_slice()?;
let result: Vec<f64> = py.allow_threads(|| {
p.par_iter().zip(t.par_iter())
.map(|(&p, &t)| metrust::calc::thermo::potential_temperature(p, t))
.collect()
});
Ok(result.into_pyarray(py))
}
Every array binding:
- Extracts zero-copy
&[f64]slices from the incoming NumPy arrays. - Releases the Python GIL with
py.allow_threads()(see section 3). - Maps a scalar Rust function over the input slices using
par_iter().zip(). - Collects the results into a
Vec<f64>and converts back to a NumPy array.
The 28 functions cover: potential_temperature, equivalent_potential_temperature,
saturation_vapor_pressure, saturation_mixing_ratio, wet_bulb_temperature,
virtual_temp, dewpoint_from_rh, rh_from_dewpoint, mixing_ratio,
vapor_pressure, density, exner_function, mixing_ratio_from_relative_humidity,
relative_humidity_from_mixing_ratio, relative_humidity_from_specific_humidity,
specific_humidity_from_dewpoint, dewpoint_from_specific_humidity,
specific_humidity, mixing_ratio_from_specific_humidity,
specific_humidity_from_mixing_ratio, dewpoint (from vapor pressure),
virtual_potential_temperature, temperature_from_potential_temperature,
lcl_pressure, saturation_equivalent_potential_temperature, and more.
2.2 Wind Functions (crates/wx-math/src/dynamics.rs)¶
The dynamics module in wx-math parallelizes at the row level for gradient
computations and at the element level for derived quantities:
-
Gradient operators (
gradient_x,gradient_y,laplacian): each row of the output grid is computed independently via(0..ny).into_par_iter(). A row's x-gradient reads only from that row's data, making it embarrassingly parallel. -
Derived fields (
divergence,vorticity,stretching_deformation,shearing_deformation,total_deformation,absolute_vorticity,ageostrophic_wind,shear_vorticity,absolute_momentum,kinematic_flux,wind_speed,wind_direction,wind_components): these use.par_iter().zip()over the already-computed gradient arrays, parallelizing the element-wise combination. -
Advection and frontogenesis: computed from gradient arrays that are themselves parallel, then combined element-wise with parallel iterators.
-
Q-vector convergence: parallel element-wise combination of Q1/Q2 gradient results.
2.3 Grid Composites (crates/wx-math/src/composite.rs)¶
The composite module is where parallelism has the largest impact. Functions
like compute_cape_cin, compute_srh, compute_shear, and
compute_lapse_rate operate on 3D model grids (shape [nz][ny][nx]) and
produce 2D output fields. Each output grid point is an independent
sounding analysis:
let results: Vec<(f64, f64, f64, f64)> = (0..n2d)
.into_par_iter()
.map(|idx| {
let j = idx / nx;
let i = idx % nx;
// Extract vertical column, run CAPE/CIN solver ...
})
.collect();
This pattern appears 11 times in composite.rs:
- compute_cape_cin -- CAPE, CIN, LCL, LFC for every column
- compute_srh -- Storm Relative Helicity with Bunkers motion
- compute_shear -- Bulk wind shear between two height levels
- compute_lapse_rate -- Environmental lapse rate between height layers
- Additional composite parameters (STP, EHI, SCP use simpler element-wise
loops since they operate on pre-computed 2D fields)
2.4 Smoothing Functions (crates/metrust/src/calc/smooth.rs)¶
All smoothing filters parallelize by row using par_chunks_mut:
// Gaussian: pass 1 (smooth along rows)
temp.par_chunks_mut(nx).enumerate().for_each(|(j, row)| {
for i in 0..nx { /* convolve row j */ }
});
// Gaussian: pass 2 (smooth along columns)
out.par_chunks_mut(nx).enumerate().for_each(|(j, row)| {
for i in 0..nx { /* convolve column through row j */ }
});
This applies to:
- smooth_gaussian -- separable Gaussian (two parallel passes)
- smooth_rectangular -- SAT-based box filter (parallel lookup phase)
- smooth_circular -- disk filter (parallel over interior rows)
- smooth_n_point / smooth_window -- generic kernel convolution
(parallel over interior rows)
2.5 Gradient Functions (Row-Parallel)¶
The wx-math version of gradient_x, gradient_y, and laplacian
parallelize by row:
let rows: Vec<Vec<f64>> = (0..ny).into_par_iter().map(|j| {
let mut row = vec![0.0; nx];
for i in 0..nx {
// centered/forward/backward difference for this row
}
row
}).collect();
Each row is allocated and computed independently, then flattened into the output vector. This avoids any synchronization between threads.
3. py.allow_threads() -- Releasing the Python GIL¶
CPython's Global Interpreter Lock (GIL) prevents multiple threads from executing Python bytecode simultaneously. By default, when Python calls into a PyO3 extension, the GIL is held for the duration of the call.
Every array function in py_thermo.rs wraps the Rayon computation in
py.allow_threads(|| { ... }). This:
- Releases the GIL before entering the Rayon parallel region.
- Allows other Python threads (e.g. an asyncio event loop, a GUI thread,
or another concurrent
metrustcall) to run while the Rust computation proceeds on all available cores. - Re-acquires the GIL when the closure returns, before converting the result back to a NumPy array.
Without allow_threads, Rayon would still use multiple OS threads
internally, but any other Python thread would be blocked until the Rust
call completed. With it, metrust computations are fully concurrent with
the rest of the Python process.
4. The SAT Algorithm¶
The smooth_rectangular function uses a Summed Area Table (SAT) to
achieve O(1) per-pixel box filtering regardless of kernel size.
How it works¶
Phase 1: Build the SAT (sequential).
A padded (ny+1) x (nx+1) array is filled so that SAT[j][i] contains the
sum of all values in the rectangle [0..j-1, 0..i-1] of the input grid:
sat_val[pidx] = val
+ sat_val[(pj - 1) * pnx + pi]
+ sat_val[pj * pnx + (pi - 1)]
- sat_val[(pj - 1) * pnx + (pi - 1)];
A companion sat_nan table tracks NaN counts the same way. This phase is
inherently sequential because each cell depends on its top and left neighbors.
Phase 2: Parallel lookup.
For each interior point (j, i), the sum over the size x size window is
computed with four SAT lookups:
let sum = sat_val[br] - sat_val[tr] - sat_val[bl] + sat_val[tl];
let count = (y2 - y1 + 1) * (x2 - x1 + 1);
row[i] = sum / count as f64;
This phase is parallelized by row with par_chunks_mut.
Why this beats parallelizing the naive approach¶
The naive box filter is O(size^2) per pixel. For size = 21, that is 441
multiplications per pixel. Even with Rayon distributing the work across
32 cores, each core still does 441 ops/pixel.
The SAT approach does exactly 4 additions and 1 division per pixel,
regardless of kernel size. The sequential SAT build is O(n) where n is the
total number of grid points, but it is a single pass with excellent cache
locality. For a 1000x1000 grid with size = 21:
- Naive parallel: ~441M ops (spread across cores)
- SAT: ~1M ops for the build + ~4M ops for the lookup = ~5M total
The SAT is algorithmically superior by nearly two orders of magnitude, and the lookup phase still parallelizes across rows.
5. Grid Composite Parallelism¶
Functions like compute_cape_cin and compute_srh iterate over every
horizontal grid column in a 3D model grid. For a typical HRRR domain
(1799 x 1059 = ~1.9M grid points) or a WRF domain (e.g., 600 x 900 =
540K points), each column requires:
- Extracting a vertical profile (40-60 levels)
- Running an iterative sounding analysis (moist adiabat integration, buoyancy integration for CAPE, Bunkers storm motion for SRH)
Each column is completely independent -- no data is shared between columns.
This makes it trivially parallelizable with into_par_iter().
Scaling: On a 32-core Ryzen, compute_cape_cin processes approximately
540K columns/sec. A 600x900 grid (540K columns) completes in about
1 second. A 200x200 subset (40K columns) runs in ~74ms.
The per-column work is highly variable: columns with zero CAPE skip the integration loop entirely, while columns with complex thermodynamic profiles require many iteration steps. Rayon's work-stealing scheduler handles this imbalance automatically.
6. Barnes/IDW Parallelism¶
The inverse_distance_to_points function in wx-math/interpolate.rs
implements Barnes, Cressman, and IDW interpolation using a brute-force
approach: for each target grid point, it iterates over all observation
stations, computes the distance, and accumulates the weighted sum.
Why brute-force beats KD-tree for small station counts:
Typical surface observation networks have fewer than 10,000 stations (METAR/ASOS in the CONUS: ~2,000; global SYNOP: ~8,000). At these sizes:
- A KD-tree adds O(n log n) build cost plus per-query overhead for tree traversal, pointer chasing, and cache misses.
- Brute-force is O(n * m) where n = stations and m = grid points, but the inner loop is a tight sequence of multiply-add operations with perfect sequential memory access.
- For n < 10,000 and m ~ 2M, the brute-force inner loop fits entirely in L1 cache (stations array is ~80KB for 10K stations x 8 bytes), so every iteration is a cache hit.
The outer loop over grid points is parallelizable (each grid point is
independent), and when combined with Rayon this yields excellent throughput:
1.9M grid points interpolated from ~2,000 stations in 1.7 seconds
(compared to scipy's griddata at ~726 seconds for the equivalent
operation).
7. What Doesn't Parallelize Well¶
Not every function benefits from Rayon. Several categories of operations see limited or no speedup from threading:
Memory-bandwidth-bound operations¶
wind_speed computes sqrt(u^2 + v^2) per element. This is roughly
3 FLOPs per element, but requires reading 16 bytes (two f64s) and writing
8 bytes. On modern hardware, a single core can saturate the memory bus
for such simple operations. Adding more cores just creates contention for
the same memory bandwidth. (The wx-math version still uses par_iter
because for very large grids, the NUMA topology of multi-socket systems
can benefit from parallel reads.)
Already-fast operations limited by cache performance¶
Gaussian smoothing with a separable kernel is O(nx * ny * kernel_width),
and with small sigma values the kernel width is small (e.g., 9 points for
sigma=1). The bottleneck is the column pass (pass 2), which accesses
memory with stride nx -- poor cache locality. Adding more threads does
not help because each thread's column accesses compete for the same cache
lines. The separable decomposition itself is the optimization; beyond
that, further speedup requires cache-oblivious algorithms or tiling.
Serial dependencies¶
Some computations are inherently sequential:
-
SAT build: Each cell depends on its top and left neighbors, so the prefix-sum scan must proceed in order. (The lookup phase is parallel.)
-
Moist adiabatic lapse rate (
moist_lapse): Each pressure level's temperature depends on the result at the previous level because the moist adiabat is integrated step-by-step (the lapse rate depends on the current temperature, which is the result of the previous step). A single column cannot be parallelized. However, multiple columns can be processed in parallel (as incompute_cape_cin).
8. Memory Layout¶
All grids in metrust use row-major (C-order) layout, matching NumPy's default. The indexing convention is:
where j is the row (y-index) and i is the column (x-index). This
means:
-
Consecutive elements in a row are contiguous in memory. The row pass in separable Gaussian smoothing accesses memory sequentially, achieving full cache-line utilization (8 f64 values per 64-byte cache line).
-
Consecutive elements in a column are strided by
nx. The column pass in separable Gaussian smoothing jumpsnx * 8bytes between accesses. For a 1000-column grid, this is an 8KB stride -- large enough to miss L1 cache on every access for small kernels. -
Grid composites access columns of a 3D array:
data[k * ny * nx + j * nx + i]extracts the vertical profile at grid point(j, i). The stride between vertical levels isny * nx, which for a 1000x1000 grid is 8MB per level -- fully non-local in cache. This is whyextract_columncopies the column into a contiguousVecbefore processing.
Cache-line alignment matters most for the column pass. When Rayon assigns
rows to threads, adjacent rows (assigned to the same thread) share cache
lines for column access, improving prefetch hit rates. The par_chunks_mut
pattern naturally groups adjacent rows.
9. Benchmarks¶
Real numbers on a 32-core AMD Ryzen Threadripper (64 threads with SMT):
Element-wise thermodynamic functions (1M elements)¶
| Function | Time | Throughput |
|---|---|---|
potential_temperature |
1.8 ms | 550 M elements/sec |
equivalent_potential_temperature |
3.2 ms | 310 M elements/sec |
wet_bulb_temperature |
7.3 ms | 137 M elements/sec |
saturation_vapor_pressure |
1.1 ms | 900 M elements/sec |
wet_bulb_temperature is slower because it uses an iterative Newton-Raphson
solver internally -- each element requires 3-8 iterations depending on the
input values. Despite this, Rayon distributes the variable-cost work
effectively via work stealing.
Grid composites¶
| Computation | Grid size | Time |
|---|---|---|
| CAPE/CIN (surface-based) | 200 x 200 (40K columns) | 74 ms |
| SRH (0-3 km) | 200 x 200 | 18 ms |
| Bulk shear (0-6 km) | 200 x 200 | 12 ms |
Smoothing (1000 x 1000 grid)¶
| Filter | Parameters | metrust | MetPy | Speedup |
|---|---|---|---|---|
| Rectangular | size=21, 1 pass | 3.6 ms | 267 ms | 74x |
| Gaussian | sigma=3 | 8.2 ms | 180 ms | 22x |
| 9-point | 1 pass | 2.1 ms | 45 ms | 21x |
The rectangular filter's outsized speedup comes from the SAT algorithm (O(1) per pixel vs O(size^2)), not just parallelism.
Barnes interpolation¶
| Scenario | metrust | scipy | Speedup |
|---|---|---|---|
| 2,000 stations to 1.9M grid points | 1.7 s | 726 s | 427x |
10. The GPU Question¶
A common question is whether these workloads would benefit from GPU acceleration (CUDA, OpenCL, Metal). The answer is generally no, for three reasons:
Branching in iterative solvers¶
Wet-bulb temperature, moist adiabatic lapse rate, LCL finding, and CAPE integration all use iterative solvers with data-dependent branching (early exit on convergence, conditional switches between dry and moist regimes). GPU architectures use SIMT (Single Instruction, Multiple Thread) execution where threads in a warp must execute the same instruction. Divergent branches cause warp serialization, dramatically reducing throughput.
Memory-bandwidth-bound simple operations¶
Simple operations like potential_temperature and wind_speed are already
memory-bandwidth-bound on CPU. GPU global memory bandwidth is higher
(~900 GB/s vs ~50 GB/s for DDR5), but the data must first be transferred
from CPU to GPU memory over PCIe (~32 GB/s). For a 1M-element array
(8 MB), the PCIe transfer alone takes ~0.25 ms -- comparable to the entire
CPU computation time. The transfer overhead eliminates any bandwidth
advantage for typical meteorological array sizes.
f64 requirement for scientific accuracy¶
Meteorological calculations require f64 (double precision) throughout. Pressure ranges from 1013 hPa to 0.1 hPa, temperature differences in CAPE integration can be as small as 0.01 K, and iterative solvers accumulate rounding errors over many steps. Consumer GPUs (NVIDIA GeForce, AMD Radeon) have severely reduced f64 throughput -- typically 1/32 of their f32 rate. Only datacenter GPUs (A100, H100) offer full-rate f64, and these are not available on typical meteorologist workstations.
The practical conclusion: A 32-core CPU running Rayon with compiled Rust code is the optimal hardware target for these workloads. The computation is fast enough that the dominant cost in a real workflow is I/O (downloading GRIB2 data, reading from disk), not number-crunching.