# Why does a 95% CI not imply a 95% chance of containing the mean?

In the stack Exchange website there is a question about confidence intervals, see here: Why does a 95% CI not imply a 95% chance of containing the mean?

My personal answer can be assessed here.

Why does a 95% CI not imply a 95% chance of containing the mean?

There are many issues to be clarified in this question and in the majority of the given responses. I shall confine myself only to two of them.

a. What is a population mean? Does exist a true population mean?

The concept of population mean is model-dependent. As all models are wrong, but some are useful, this population mean is a fiction that is defined just to provide useful interpretations. The fiction begins with a probability model.

The probability model is defined by the triplet

$(\mathcal{X}, \mathcal{F}, P),$

where $\mathcal{X}$ is the sample space (a non-empty set), $\mathcal{F}$ is a family of subsets of $\mathcal{X}$ and $P$ is a well-defined probability measure defined over $\mathcal{F}$ (it governs the data behavior). Without loss of generality, consider only the discrete case. The population mean is defined by

$\mu = \sum_{x \in \mathcal{X}} xP(X=x),$

that is, it represents the central tendency under $P$ and it can also be interpreted as the center of mass of all points in $\mathcal{X}$, where the weight of each  $x \in \mathcal{X}$ is given by $P(X=x)$.

In the probability theory, the measure $P$ is considered known, therefore the population mean is accessible through the above simple operation. However, in practice, the probability $P$ is hardly known. Without a probability $P$, one cannot describe the probabilistic behavior of the data. As we cannot set a precise probability $P$ to explain the data behavior, we set a family $\mathcal{M}$ containing probability measures that possibly govern (or explain) the data behavior. Then, the classical statistical model emerges

$(\mathcal{X}, \mathcal{F}, \mathcal{M}).$

The above model is said to be a parametric model if there exists $\Theta \subseteq \mathbb{R}^p$ with $p< \infty$ such that $\mathcal{M} \equiv \{P_\theta: \ \theta \in \Theta\}$. Let us consider just the parametric model in this post.

Notice that, for each probability measure  $P_\theta \in \mathcal{M}$, there is a respective mean definition

$\mu_\theta = \sum_{x \in \mathcal{X}} x P_\theta(X=x).$

That is, there is a family of population means $\{\mu_\theta: \ \theta \in \Theta\}$ that depends tightly on the definition of $\mathcal{M}$. The family $\mathcal{M}$ is defined by limited humans and therefore it may not contain the true probability measure that governs the data behavior. Actually, the chosen family will hardly contain the true measure, moreover this true measure may not even exist. As the concept of a population mean depends on the probability measures in $\mathcal{M}$, the population mean is model-dependent.

The Bayesian approach considers a prior probability over the subsets of $\mathcal{M}$ (or, equivalently, $\Theta$), but in this post I will concentrated only on the classical version.

b. What is the definition and the purpose of a confidence interval?

As aforementioned, the population mean is model-dependent and provides useful interpretations. However, we have a family of population means, because the statistical model is defined by a family of probability measures (each probability measure generates a population mean). Therefore, based on an experiment, inferential procedures should be employed in order to estimate a small set (interval) containing good candidates of population means. One well-known procedure is the ($1-\alpha$) confidence region, which is defined by a set $C_\alpha$ such that, for all $\theta \in \Theta$,

$P_\theta(C_\alpha(X) \ni \mu_\theta) \geq 1-\alpha$  and $\inf_{\theta\in \Theta} P_\theta(C_\alpha(X) \ni \mu_\theta) = 1-\alpha,$

where $P_\theta(C_\alpha(X) = \varnothing) = 0$ (see Schervish, 1995). This is a very general definition and encompasses virtually any type of confidence intervals. Here, $P_\theta(C_\alpha(X) \ni \mu_\theta)$ is the probability that $C_\alpha(X)$ contains $\mu_\theta$ under the measure $P_\theta$. This probability should be always greater than (or equal to) $1-\alpha$, the equality occurs at the worst case.

Remark: The readers should notice that it is not necessary to make assumptions on the state of reality, the confidence region is defined for a well-defined statistical model without making reference to any “true” mean. Even if the “true” probability measure does not exist or it is not in $\mathcal{M}$, the confidence region definition will work, since the assumptions are about statistical modelling rather than the states of reality.

On the one hand, before observing the data, $C_\alpha(X)$ is a random set (or random interval) and the probability that “$C_\alpha(X)$ contains the mean $\mu_\theta$” is, at least, (1-\alpha) for all $\theta \in \Theta$. This is a very desirable feature for the frequentist paradigm.

On the other hand, after observing the data $x$, $C_\alpha(x)$ is just a fixed set and the probability that  “$C_\alpha(x)$ contains the mean $\mu_\theta$” should be in $\{0,1\}$ for all $\theta \in \Theta$.

That is, after observing the data $x$, we cannot employ the probabilistic reasoning anymore. As far as I know, there is no theory to treat confidence sets for an observed sample (we are working on it and we are getting some nice results). For a while, the frequentist must believe that the observed set (or interval) $C_\alpha(x)$ is one of the $(1-\alpha)100\%$ sets that contains $\mu_\theta$ for all $\theta\in \Theta$.

PS: I invite any comments, reviews, critiques, or even objections to my post. Let’s discuss it in depth. As I am not a native English speaker, my post surely contains typos and grammar mistakes.

Reference:

Schervish, M. (1995), Theory of Statistics, Second ed, Springer.

## 2 thoughts on “Why does a 95% CI not imply a 95% chance of containing the mean?”

1. Thanks for the interesting post!

As you say, we cannot talk about the probability that the interval contains the true value of the parameter. Hence, people typically interpret it by saying that if we repeat the experiment n times, n big, in approximately $(1-\alpha)\%$ we will cover the true value of the parameter.

Of course, in general we won’t repeat such experiment, and hence this interpretation loses most of its appeal to me. I think that a slight variation on that is the following idea, which I’ve seen only in one or two places so far: Imagine we are frequentist statisticians, and perform $n$ different analyses. They do not have to be related at all, one can be about the weather forecast, and the other about a medication to flu. Imagine that in analysis $i$, we create a parameter for the parameter $i$ with confidence $(1-\alpha)$. Denote by $I_i$ such interval. Then, by the law of large number (assuming the experiments are independent),

$\frac{1}{n}\sum_{i=1}^n I(\theta_i \in I_i) \longrightarrow 1-\alpha.$

This means that in $(1-\alpha)\%$ of the experiments we will cover the true value of the parameter. This is much more appealing to me, because we will in fact perform many analyses in our lives.

Of course, to start with, this assumes the model is right, which is another discussion where foundation plays an important role…

2. Hi Rafael,

Thanks for your comments. I think it is possible to study convergence in distribution without postulating anything about the true probabilities, true models and related. We have to represent the convergent in terms of the family of probabilities $\mathcal{M}$. That is, the true probability measure does not need to be inside $\mathcal{M}$, it does not need even to exist.