MLE & Fisher Information - Complete Explanation

The Likelihood Function

Imagine we're tossing a coin and observe H,H,H,T,H. What's the probability of seeing heads? A common approach is to take the mean (0.8), but is this optimal for any distribution? What if there are multiple parameters?

Maximum Likelihood Estimation provides a general framework. The key insight is that the underlying parameter is fixed (frequentist view), so randomness comes only from data generation.

L(\theta) = \prod_{i=1}^{n} f(x_i;\theta)

And the log-likelihood function:

l(\theta) = \sum_{i=1}^{n} \ln f(x_i;\theta)

All data points are assumed to be i.i.d. (independent and identically distributed).

Coin Tossing Example

For Bernoulli trials (coin flips), the probability mass function is:

f(x) = \theta^{x}(1-\theta)^{1-x}

The likelihood intuitively asks: "If P(Heads) = θ, what's the probability of observing our data?"

Heads observed: 1 / 10 flips

Maximum Likelihood Estimation

MLE finds the parameter that maximizes the likelihood of observing our data:

\hat{\theta}_{MLE} = \arg\max_{\theta} L(\theta)

This can be found by setting the derivative to zero or using optimization algorithms. The log-likelihood is preferred because:

Product becomes sum (easier mathematically)
Same maximum as likelihood function
Simpler derivatives
Better asymptotic properties

Visualizing MLE with Normal Distribution

Consider estimating the mean μ of a normal distribution with known variance σ² = 4. We have 10 data points from N(μ=5, σ=2).

Sample size: 10

The MLE finds the parameter value that makes our observed data most probable. For normal distributions, this coincides with the sample mean.

Fisher Information

Fisher Information measures how much information our data carries about an unknown parameter:

I(\theta) = E\left[\left(\frac{d}{d\theta} \ln f(X;\theta)\right)^2 \bigg| \theta\right] = E\left[-\frac{d^2}{d\theta^2} \ln f(X;\theta) \bigg| \theta\right]

The score function \(\frac{d}{d\theta} \ln f(X;\theta)\) indicates whether θ should be larger or smaller. Fisher Information is the variance of this score (since E[score] = 0).

Low Information: Flat likelihood → Fisher Information ≈ 0
High Information: Sharp likelihood → Fisher Information → ∞

Properties of Maximum Likelihood Estimators

Under regularity conditions, MLEs have desirable asymptotic properties:

Property	Description	Mathematical Form
Consistency	Converges to true parameter	\(\hat{\theta} \xrightarrow{P} \theta\) as \(n \to \infty\)
Asymptotic Normality	Approximately normal for large n	\(\sqrt{n}(\hat{\theta} - \theta) \sim N(0, I^{-1}(\theta))\)
Efficiency	Achieves minimum variance bound	Attains Cramer-Rao Lower Bound

Cramer-Rao Lower Bound

The CRLB states that for any unbiased estimator \(\hat{\theta}\), the variance satisfies:

\text{Var}(\hat{\theta}) \geq \frac{1}{I_n(\theta)}

where \(I_n(\theta)\) is the Fisher information for n observations. MLE achieves this bound asymptotically.

For Bernoulli distribution: \(I(\theta) = \frac{1}{\theta(1-\theta)}\), so \(\text{Var}(\hat{\theta}) \geq \frac{\theta(1-\theta)}{n}\).