MLE & Fisher Information
Understanding Maximum Likelihood Estimation and Fisher Information - the foundation of statistical inference and the mathematical framework that connects data to parameter estimation.
The Likelihood Function
Imagine we're tossing a coin and observe H,H,H,T,H. What's the probability of seeing heads? A common approach is to take the mean (0.8), but is this optimal for any distribution? What if there are multiple parameters?
Maximum Likelihood Estimation provides a general framework. The key insight is that the underlying parameter is fixed (frequentist view), so randomness comes only from data generation.
And the log-likelihood function:
All data points are assumed to be i.i.d. (independent and identically distributed).
Coin Tossing Example
For Bernoulli trials (coin flips), the probability mass function is:
The likelihood intuitively asks: "If P(Heads) = θ, what's the probability of observing our data?"
Maximum Likelihood Estimation
MLE finds the parameter that maximizes the likelihood of observing our data:
This can be found by setting the derivative to zero or using optimization algorithms. The log-likelihood is preferred because:
- Product becomes sum (easier mathematically)
- Same maximum as likelihood function
- Simpler derivatives
- Better asymptotic properties
Visualizing MLE with Normal Distribution
Consider estimating the mean μ of a normal distribution with known variance σ² = 4. We have 10 data points from N(μ=5, σ=2).
The MLE finds the parameter value that makes our observed data most probable. For normal distributions, this coincides with the sample mean.
Fisher Information
Fisher Information measures how much information our data carries about an unknown parameter:
The score function \(\frac{d}{d\theta} \ln f(X;\theta)\) indicates whether θ should be larger or smaller. Fisher Information is the variance of this score (since E[score] = 0).
Low Information: Flat likelihood → Fisher Information ≈ 0
High Information: Sharp likelihood → Fisher Information → ∞
Properties of Maximum Likelihood Estimators
Under regularity conditions, MLEs have desirable asymptotic properties:
| Property | Description | Mathematical Form |
|---|---|---|
| Consistency | Converges to true parameter | \(\hat{\theta} \xrightarrow{P} \theta\) as \(n \to \infty\) |
| Asymptotic Normality | Approximately normal for large n | \(\sqrt{n}(\hat{\theta} - \theta) \sim N(0, I^{-1}(\theta))\) |
| Efficiency | Achieves minimum variance bound | Attains Cramer-Rao Lower Bound |
Cramer-Rao Lower Bound
The CRLB states that for any unbiased estimator \(\hat{\theta}\), the variance satisfies:
where \(I_n(\theta)\) is the Fisher information for n observations. MLE achieves this bound asymptotically.
For Bernoulli distribution: \(I(\theta) = \frac{1}{\theta(1-\theta)}\), so \(\text{Var}(\hat{\theta}) \geq \frac{\theta(1-\theta)}{n}\).