Quantile Regression
Beyond the mean: Modeling conditional quantiles for robust regression and heteroskedastic data.
Introduction
In traditional linear regression, we attempt to model the conditional mean by assuming that errors are normally distributed with zero mean and constant variance.
Unfortunately, real-world data are usually not normally distributed. Often, data are heteroskedastic, have many outliers, or follow special distributions. In these cases, linear regression may not be the best model to use.
Quantile regression is a useful alternative when the data are not normally distributed. It models the conditional median (or any other quantile) of the response variable instead of the conditional mean—it models the conditional distribution instead of just the conditional mean.
The Pinball Loss Function
The difference between quantile regression and linear regression lies solely in the loss function. In quantile regression, a pinball loss function is used, defined as:
or equivalently:
Interactive Visualization
Drag the slider to change the quantile ($\tau$) and see how the loss function changes shape. Notice how $\tau=0.5$ yields the symmetric absolute error (median regression), while extreme values penalize over/under-predictions asymmetrically.
Intuition: If we want to model the 0.25 quantile (a lower quantile), the loss function weights points lying below the line less heavily than points above it. Optimization balances these weights so that approximately 25% of points fall below the line.
Examples: When to use Quantile Regression
1. Heteroskedasticity
Notice how quantile regression accurately captures the heteroskedasticity by estimating conditional quantiles separately. The spread of the quantiles increases as X increases.
# Generate data with heteroskedasticity
np.random.seed(42)
n = 1000
X = np.random.uniform(0, 10, n)
y = 2 * X + X * np.random.normal(0, 0.5, n) # Variance increases with X
2. Skewed Distribution (Log-Normal)
Notice how quantile regression can accurately capture the skewness by estimating individual conditional quantiles, unlike the conditional mean which enforces symmetric boundaries.
# Generate skewed errors from log-normal distribution
skewed_errors = np.random.lognormal(mean=0, sigma=0.8, size=n)
skewed_errors = skewed_errors - np.mean(skewed_errors) # Center errors
y = 2 * X + skewed_errors
3. Heavy-Tailed Distribution (Cauchy)
The Cauchy distribution has undefined mean and variance. Linear regression (Least Squares) crumbles under the extreme outliers, but Quantile Regression (robust to outliers) prevails.
# Generate errors from Cauchy distribution
cauchy_errors = np.random.standard_cauchy(size=n) * 2
y = 2 * X + cauchy_errors
Prediction Interval vs Confidence Interval
We compare the prediction interval derived from linear regression with the quantile estimated with a quantile regression model. Quantile regression provides a more robust estimate of the conditional distribution.
Note that confidence intervals (CI) estimate the range for the mean response, whereas prediction intervals (PI) predict the range for a single, new observation.
Formula for prediction interval:
Formula for confidence interval:
The difference is the "+1" inside the square root. The prediction interval is always wider, as it accounts for the variance of the error term $\epsilon$ in addition to the uncertainty in the coefficient estimates.
Derivation of the Quantile Loss Function
We want to find a prediction $\hat{y}$ that minimizes the expected loss for a given quantile $\tau \in (0,1)$.
Step 1: Define asymmetric absolute loss
Step 2: Express expected loss
Step 3: Take derivative with respect to $\hat{y}$
Using Leibniz's rule for differentiation under the integral sign:
This simplifies to:
Step 4: Set derivative to zero for optimal prediction
Step 5: Set this equal to the desired quantile
Step 6: Choose convenient constants
A convenient normalization is to set $c_1 + c_2 = 1$, which gives:
Step 7: The resulting loss function
Substituting these constants back into our original loss function:
Implementation in Python
import statsmodels.api as sm
from statsmodels.regression.quantile_regression import QuantReg
# Add constant to predictor variables
X_with_const = sm.add_constant(X)
# Fit quantile regression for multiple quantiles
quantiles = [0.1, 0.25, 0.5, 0.75, 0.9]
for tau in quantiles:
# Fit the model
model = QuantReg(y, X_with_const)
result = model.fit(q=tau)
# Make predictions
predictions = result.predict(X_with_const)
# Print summary
print(result.summary())