Why does a 95% CI not imply a 95% chance of containing the mean?

In the stack Exchange website there is a question about confidence intervals, see here: Why does a 95% CI not imply a 95% chance of containing the mean?

My personal answer can be assessed here.

Why does a 95% CI not imply a 95% chance of containing the mean?

There are many issues to be clarified in this question and in the majority of the given responses. I shall confine myself only to two of them.

 a. What is a population mean? Does exist a true population mean?

The concept of population mean is model-dependent. As all models are wrong, but some are useful, this population mean is a fiction that is defined just to provide useful interpretations. The fiction begins with a probability model.

The probability model is defined by the triplet

(\mathcal{X}, \mathcal{F}, P),

where \mathcal{X} is the sample space (a non-empty set), \mathcal{F} is a family of subsets of \mathcal{X} and P is a well-defined probability measure defined over \mathcal{F} (it governs the data behavior). Without loss of generality, consider only the discrete case. The population mean is defined by

\mu = \sum_{x \in \mathcal{X}} xP(X=x),

that is, it represents the central tendency under P and it can also be interpreted as the center of mass of all points in \mathcal{X}, where the weight of each  x \in \mathcal{X} is given by P(X=x).

In the probability theory, the measure P is considered known, therefore the population mean is accessible through the above simple operation. However, in practice, the probability P is hardly known. Without a probability P, one cannot describe the probabilistic behavior of the data. As we cannot set a precise probability P to explain the data behavior, we set a family \mathcal{M} containing probability measures that possibly govern (or explain) the data behavior. Then, the classical statistical model emerges

(\mathcal{X}, \mathcal{F}, \mathcal{M}).

The above model is said to be a parametric model if there exists \Theta \subseteq \mathbb{R}^p with p< \infty such that \mathcal{M} \equiv \{P_\theta: \ \theta \in \Theta\}. Let us consider just the parametric model in this post.

Notice that, for each probability measure  P_\theta \in \mathcal{M}, there is a respective mean definition

\mu_\theta = \sum_{x \in \mathcal{X}} x P_\theta(X=x).

That is, there is a family of population means \{\mu_\theta: \ \theta \in \Theta\} that depends tightly on the definition of \mathcal{M}. The family \mathcal{M} is defined by limited humans and therefore it may not contain the true probability measure that governs the data behavior. Actually, the chosen family will hardly contain the true measure, moreover this true measure may not even exist. As the concept of a population mean depends on the probability measures in \mathcal{M}, the population mean is model-dependent.

The Bayesian approach considers a prior probability over the subsets of \mathcal{M} (or, equivalently, \Theta), but in this post I will concentrated only on the classical version.

 b. What is the definition and the purpose of a confidence interval?

As aforementioned, the population mean is model-dependent and provides useful interpretations. However, we have a family of population means, because the statistical model is defined by a family of probability measures (each probability measure generates a population mean). Therefore, based on an experiment, inferential procedures should be employed in order to estimate a small set (interval) containing good candidates of population means. One well-known procedure is the (1-\alpha) confidence region, which is defined by a set C_\alpha such that, for all \theta \in \Theta,

P_\theta(C_\alpha(X) \ni \mu_\theta) \geq 1-\alpha  and \inf_{\theta\in \Theta} P_\theta(C_\alpha(X) \ni \mu_\theta) = 1-\alpha,

where P_\theta(C_\alpha(X) = \varnothing) = 0 (see Schervish, 1995). This is a very general definition and encompasses virtually any type of confidence intervals. Here, P_\theta(C_\alpha(X) \ni \mu_\theta) is the probability that C_\alpha(X) contains \mu_\theta under the measure P_\theta. This probability should be always greater than (or equal to) 1-\alpha, the equality occurs at the worst case.

Remark: The readers should notice that it is not necessary to make assumptions on the state of reality, the confidence region is defined for a well-defined statistical model without making reference to any “true” mean. Even if the “true” probability measure does not exist or it is not in \mathcal{M}, the confidence region definition will work, since the assumptions are about statistical modelling rather than the states of reality.

On the one hand, before observing the data, C_\alpha(X) is a random set (or random interval) and the probability that “C_\alpha(X) contains the mean \mu_\theta” is, at least, (1-\alpha) for all \theta \in \Theta. This is a very desirable feature for the frequentist paradigm.

On the other hand, after observing the data x, C_\alpha(x) is just a fixed set and the probability that  “C_\alpha(x) contains the mean \mu_\theta” should be in \{0,1\} for all \theta \in \Theta.

That is, after observing the data x, we cannot employ the probabilistic reasoning anymore. As far as I know, there is no theory to treat confidence sets for an observed sample (we are working on it and we are getting some nice results). For a while, the frequentist must believe that the observed set (or interval) C_\alpha(x) is one of the (1-\alpha)100\% sets that contains \mu_\theta for all \theta\in \Theta.

PS: I invite any comments, reviews, critiques, or even objections to my post. Let’s discuss it in depth. As I am not a native English speaker, my post surely contains typos and grammar mistakes.

Reference:

Schervish, M. (1995), Theory of Statistics, Second ed, Springer.

Why is studying measure theory important to statisticians?

Measure theory, as many others branches of mathematics, is much important to formalize and understand more profoundly the theory of statistics mainly for theoreticians. It deals with how to measure parts of a set of interest.

In statistics, we always use random variables to make inferences about the unknown quantities of interest (parameters). These random variables are just functions that transport the elements of an abstract set to the real line (i.e., X: \Omega \to \mathbb{R}, where X is a random variable, \Omega is an abstract set and \mathbb{R} is the set of real numbers), since it is much easier to work by using the real line rather than an abstract set. The probability space is the triplet

(\Omega,\mathcal{F},\mu),

where \mathcal{F} is a collection of subsets of \Omega and \mu a set function which gives a probability for each set in \mathcal{F}), i.e., \mu(\varnothing) =0, \ \mu(\Omega) =1 and if A, B \in \mathcal{F} are disjoints, then \mu(A\cup B)= \mu(A) + \mu(B) (it is infinite additive, but for easiness of presentation I just consider finite additivity).

The statistical model is strictly related with the probability one, the difference is in the third term of the triplet (\Omega,\mathcal{F},\mu). When the probability measure \mu is unknown, we may find a family of probability measures that possibly fit adequatelly the observed data, say \mathcal{P}. The statistical model is then defined as

(\Omega,\mathcal{F},\mathcal{P})           (*)

The inferencial process is the procedure of finding a subfamily (possibly one element) of \mathcal{P} that contains all the “best” cases according to some criteria. Knowing measure theory, one can propose coherent methodologies of estimations, predictions and so on. Note that, any statistical models can be written as (*).

Much controversies would be avoided in statistical hypothesis testing if all envolved quantities were formally defined, for instance, the informal definition of p-values provide many fruitless discussions (see, Patriota, 2013).

When you are proposing a new statistical methodology, you have to keep in mind the main theorems  and lemmas of measure theory in order to give a solid theoretical base for your proposal.

References:

Patriota, AG (2013). A classical measure of evidence for general null hypotheses, Fuzzy Sets and Systems, In Press http://dx.doi.org/10.1016/j.fss.2013.03.007