Why Bayesian MCMC Parallelization Works on CPUs but Struggles on GPUs

Christopher Peters · Bayesian Computing · March 2026

Why Does Stan Parallelize Well on CPUs but Not GPUs?

Because Stan's within-chain MCMC workload is branch-heavy, memory-irregular, and driven by reverse-mode autodiff, it aligns naturally with CPU architecture and poorly with GPU architecture. Stan's reduce_sum works well on CPUs because CPU threads handle irregular control flow and shared-memory reductions efficiently. GPUs excel when many lanes execute the same operations with predictable memory access, which is not how HMC and NUTS typically behave inside a single chain.

CPU vs GPU for Stan-Style MCMC
Property CPU GPU
Branch-heavy likelihood code Good fit Warp divergence
Reverse-mode autodiff memory access Handles irregular access well Poorly coalesced
Single-chain speedup Often strong Usually limited
Best Bayesian use case Stan + reduce_sum Many parallel chains / VI
Key framing: This is not mainly an implementation gap. It is an architectural mismatch between how Stan computes gradients inside HMC and what GPUs are designed to accelerate.

Background

If you've ever profiled a complex Stan model, say a multivariate stochastic volatility model with Johnson SU innovations and leverage effects, you know where the time goes. Posterior sampling is dominated by repeated likelihood and gradient evaluation. Stan's reduce_sum helps by splitting that work across CPU threads inside a single chain, and for many models that gives a real wall-clock improvement.

That naturally raises the next question: if eight CPU threads help, why not thousands of GPU cores? GPUs dominate deep learning workloads and can deliver extraordinary throughput. So it is tempting to assume that a Stan likelihood should also become dramatically faster on a GPU.

The catch is that MCMC with reverse-mode autodiff is not a neural-network training workload. The internal structure of the computation matters, and that structure favors CPUs much more than GPUs.

What reduce_sum Actually Does

Stan's reduce_sum partitions your log-likelihood across observations or groups of observations. Each CPU thread evaluates a partial log-likelihood and returns a scalar contribution, which Stan sums into the total target density. This is within-chain parallelization: one Markov chain, many CPU cores.

real partial_sum(array[] real slice,
                 int start, int end,
                 vector mu, vector sigma, vector nu, vector tau) {
  real lp = 0;
  for (n in start:end)
    lp += jsu_lpdf(slice[n] | mu[n], sigma[n], nu[n], tau[n]);
  return lp;
}

// In the model block:
target += reduce_sum(partial_sum, y, grainsize,
                     mu, sigma, nu, tau);

The reason this works well on CPUs is simple. Threads share memory, parameter passing is cheap, reductions are cheap, and the chunk count maps naturally to the number of available cores. In the right model, eight cores can reduce runtime substantially without rewriting the inference algorithm itself.

Why CPUs Are a Natural Fit

Each chunk of a Stan likelihood often contains special functions like log, sinh, and Phi, numerical stability guards, constraint logic, and reverse-mode autodiff bookkeeping. Those computations are not just arithmetic. They involve branches, dynamic memory use, and irregular tape traversal.

CPUs are built for exactly that kind of work. They have strong branch prediction, large caches, low-latency shared memory, and cores that can execute complex control flow independently. That matters because the bottleneck in many Stan models is not raw floating-point throughput in isolation. It is the total cost of evaluating a messy, branchy, gradient-producing likelihood over and over.

CPU advantage: Each thread is doing similar high-level work, but internally it may take different branches, allocate different autodiff structures, and traverse memory irregularly. CPUs tolerate that well.

The GPU Architectural Mismatch

GPUs are optimized for throughput when many lanes do the same thing at the same time. Their execution model depends on groups of threads moving in lockstep. When all of those threads follow the same path through the code and read nearby memory locations, performance is excellent. When they diverge, performance falls apart quickly.

1. Autodiff Tape Construction Fights GPU Memory

Stan uses reverse-mode automatic differentiation. During the forward pass it builds a computational graph, then during the backward pass it walks that structure in reverse to compute gradients. That process creates irregular memory access patterns and uses dynamic allocation in ways that GPUs generally dislike.

GPU memory systems reward coalesced access, nearby threads reading nearby addresses. But autodiff tape traversal is often the opposite: each thread chases its own path through a dynamically constructed graph. That means more stalls, weaker cache behavior, and poor effective throughput.

2. Branch Divergence Kills Warp Efficiency

Stan's math code is full of branching for valid parameter regions, numerical safety, and special-case behavior. A JSU log-density is a good example: it combines nonlinear transforms, guards, and special functions whose execution paths can vary across observations and parameter values.

On a CPU, that can be manageable. On a GPU, one divergent lane can force the entire warp to serially execute both branches. Stack enough of those branches inside a gradient evaluation and the theoretical advantage of many cores shrinks quickly.

3. HMC and NUTS Are Sequential at the Outer Level

Even if you could perfectly parallelize one likelihood evaluation, the sampler itself is still adaptive and sequential. HMC and NUTS run leapfrog steps, build trajectory trees, and make U-turn decisions dynamically. Each step depends on the result of earlier steps. That outer control flow is not something you can simply fan out across GPU threads.

for each NUTS iteration:
  build trajectory adaptively
  for each leapfrog step:
    evaluate gradient
    update position and momentum
  decide whether to keep expanding or stop

So the GPU opportunity is limited to pieces of the gradient calculation, not the full sampling loop. That sharply limits how much a single Stan chain can benefit.

Where GPUs Do Work Well for Bayesian Inference

This does not mean GPUs are useless for Bayesian computation. It means they shine when the algorithm is reformulated around GPU strengths rather than forcing Stan-style within-chain MCMC onto them.

Between-chain parallelism is the clearest example. Frameworks like NumPyro on JAX can run many chains simultaneously with a traced, compiled computation graph. That gives high throughput across chains, even if it does not dramatically accelerate the mixing time of one chain.

Variational inference is another good fit because it turns posterior approximation into an optimization problem with stable graph structure and batch-friendly gradients. Simulation-based inference methods also work well when the GPU-heavy part is neural density estimation rather than classical MCMC.

Important distinction: GPU Bayesian workflows often improve throughput across many chains or replace sampling with optimization. That is different from making one Stan chain dramatically faster.

Practical Implications

If you are running a computationally heavy Stan model today, CPU threading with reduce_sum is often close to the most practical hardware acceleration you can get without changing frameworks. That is not a disappointing compromise. It reflects a genuine fit between the workload and the hardware.

If you truly need GPU acceleration, the realistic path is usually one of three options: port the model to a JAX-based system like NumPyro, move to variational inference, or restructure the problem around simulation-based inference. Each option changes not just the hardware target, but the inferential workflow itself.

Stan still retains major advantages: mature diagnostics, battle-tested HMC behavior, and a modeling language built around careful statistical computation rather than just accelerator compatibility. That tradeoff is often worth preserving.

Key Takeaway: reduce_sum works on CPUs because Stan's MCMC workload is branch-heavy, memory-irregular, and sequentially controlled. GPUs can still be excellent for Bayesian inference, but usually only when the method changes to match GPU architecture instead of trying to force Stan's single-chain execution model onto it.

Related Statistical Methods

Discuss a Bayesian Computing Project