The No Free Lunch theorem reminds us that no single model dominates across all possible data-generating processes. In practice, the question isn't "which model is best?" but "which model is best-matched to the structure in my data?"
This post demonstrates three complementary results with a fully reproducible R reprex:
- Case 1: When the data has describable group structure, a Bayesian hierarchical model (BHM) produces lower out-of-sample prediction error.
- Case 2: When the signal is complex and unstructured, a Bayesian neural network (BNN) wins.
- Case 3: Their ensemble is competitive across both scenarios — hedging against model misspecification.
The key insight: for problems where structure can be described, Bayesian hierarchical models will tend to produce lower out-of-sample prediction error, because that structure constitutes significant prior information. When you can't describe the structure, flexible learners like BNNs earn their keep — and ensembling the two provides a robust hedge.
We evaluate models using the Continuous Ranked Probability Score (CRPS), a strictly proper scoring rule that assesses both calibration and sharpness of the full predictive distribution. Lower CRPS is better.
Setup
library(cmdstanr)
library(posterior)
library(scoringRules)
library(dplyr)
library(tibble)
library(ggplot2)
set.seed(42)
We compile two Stan models — a hierarchical linear model (group-varying intercepts and slopes) and a small Bayesian neural network (one hidden layer with tanh activations, no group structure).
The Two Models
Bayesian Hierarchical Model
Group-varying intercept + slope, partial pooling through a shared hyperprior. This is the standard mixed-effects specification:
parameters {
real alpha_0; // population intercept
real beta_0; // population slope
real sigma_alpha;
real sigma_beta;
vector[J] alpha_raw; // group deviations (non-centered)
vector[J] beta_raw;
real sigma;
}
transformed parameters {
vector[J] alpha = alpha_0 + sigma_alpha * alpha_raw;
vector[J] beta = beta_0 + sigma_beta * beta_raw;
}
Bayesian Neural Network
One hidden layer with H = 8 units, tanh activation, no group structure at all. Everything is learned from the signal in x:
parameters {
vector[H] w1; // input→hidden weights
vector[H] b1; // hidden biases
vector[H] w2; // hidden→output weights
real b2; // output bias
real sigma;
}
model {
...
for (n in 1:N) {
vector[H] h = tanh(w1 * x[n] + b1);
mu[n] = b2 + dot_product(w2, h);
}
y ~ normal(mu, sigma);
}
Case 1: Grouped Linear DGP — BHM Wins
We simulate data from exactly the generative model that the BHM encodes: J = 10 groups, each with its own intercept and slope, drawn from a population distribution:
J <- 10; n_per <- 30; N1 <- J * n_per
alpha_true <- rnorm(J, 0, 1.5)
beta_true <- rnorm(J, 2, 0.8)
d1 <- tibble(
g = rep(1:J, each = n_per),
x = runif(N1, -2, 2)
) %>%
mutate(
mu = alpha_true[g] + beta_true[g] * x,
y = rnorm(n(), mu, 0.5)
)
The BHM is well-specified: it matches the group structure and the linear functional form. The BNN has no access to group labels — it must learn everything from x alone. With 10 groups worth of systematic heterogeneity in both intercept and slope, the BNN is structurally handicapped.
Case 2: Complex Nonlinear DGP — BNN Wins
Now we flip the script. The true signal is a sum of sines — no group structure at all:
d2 <- tibble(
g = rep(1:J, length.out = N2),
x = runif(N2, -3, 3)
) %>%
mutate(
mu = 2 * sin(1.5 * x) + 0.5 * cos(3 * x),
y = rnorm(n(), mu, 0.4)
)
The group labels are arbitrary — there's no group-level heterogeneity in the DGP. The signal is entirely nonlinear in x. The BHM fits group-varying linear functions, which fundamentally cannot represent sin and cos. The BNN's tanh hidden layer can approximate this smooth nonlinearity.
Case 3: Ensemble — Robust Across Both DGPs
The ensemble mixes posterior predictive draws 50/50 from each model:
# Pool posterior predictive draws: half from BHM, half from BNN
n_draws <- ncol(yrep_hier)
ens_idx <- sample(c(TRUE, FALSE), n_draws, replace = TRUE)
yrep_ens <- yrep_hier
yrep_ens[, !ens_idx] <- yrep_bnn[, !ens_idx]
This is Bayesian model averaging in its simplest form. The ensemble's predictive distribution is a mixture of the two models' posteriors.
Summary: CRPS Comparison
↓ Lower CRPS = lower out-of-sample prediction error (winner)
# CRPS results (lower = better out-of-sample prediction):
#
# DGP BHM BNN Ensemble
# Grouped linear 0.31★ 1.07 0.49
# Complex nonlinear 0.89 0.21★ 0.40
#
# ★ = lowest CRPS (best out-of-sample predictions)
The Takeaway
- When you can describe the structure — groups, hierarchies, known functional forms — encode it in a Bayesian hierarchical model. That structure is information, and information reduces prediction error.
- When the structure is unknown or too complex to specify — use a flexible model like a BNN. Better to learn the function than to misspecify it.
- When you're uncertain which regime you're in — ensemble. You pay a small premium over the best specialist, but you're insured against the worst case.
This is particularly relevant in portfolio optimization and financial modeling, where some aspects of the data (sector effects, seasonal patterns, cross-asset correlations) have known structure worth encoding, while other aspects (regime changes, tail events, nonlinear interactions) resist explicit specification. The practitioner's job is to match model complexity to problem structure — and to ensemble where structure is ambiguous.
Full Reprex
The complete, self-contained R script is available at
bhm_bnn_ensemble_reprex.R.
It requires cmdstanr, scoringRules, and standard tidyverse packages. Runtime is approximately 2–3 minutes on a modern laptop.