Why does a 95% CI not imply a 95% chance of containing the mean?

In the stack Exchange website there is a question about confidence intervals, see here: Why does a 95% CI not imply a 95% chance of containing the mean?

My personal answer can be assessed here.

Why does a 95% CI not imply a 95% chance of containing the mean?

There are many issues to be clarified in this question and in the majority of the given responses. I shall confine myself only to two of them.

 a. What is a population mean? Does exist a true population mean?

The concept of population mean is model-dependent. As all models are wrong, but some are useful, this population mean is a fiction that is defined just to provide useful interpretations. The fiction begins with a probability model.

The probability model is defined by the triplet

(\mathcal{X}, \mathcal{F}, P),

where \mathcal{X} is the sample space (a non-empty set), \mathcal{F} is a family of subsets of \mathcal{X} and P is a well-defined probability measure defined over \mathcal{F} (it governs the data behavior). Without loss of generality, consider only the discrete case. The population mean is defined by

\mu = \sum_{x \in \mathcal{X}} xP(X=x),

that is, it represents the central tendency under P and it can also be interpreted as the center of mass of all points in \mathcal{X}, where the weight of each  x \in \mathcal{X} is given by P(X=x).

In the probability theory, the measure P is considered known, therefore the population mean is accessible through the above simple operation. However, in practice, the probability P is hardly known. Without a probability P, one cannot describe the probabilistic behavior of the data. As we cannot set a precise probability P to explain the data behavior, we set a family \mathcal{M} containing probability measures that possibly govern (or explain) the data behavior. Then, the classical statistical model emerges

(\mathcal{X}, \mathcal{F}, \mathcal{M}).

The above model is said to be a parametric model if there exists \Theta \subseteq \mathbb{R}^p with p< \infty such that \mathcal{M} \equiv \{P_\theta: \ \theta \in \Theta\}. Let us consider just the parametric model in this post.

Notice that, for each probability measure  P_\theta \in \mathcal{M}, there is a respective mean definition

\mu_\theta = \sum_{x \in \mathcal{X}} x P_\theta(X=x).

That is, there is a family of population means \{\mu_\theta: \ \theta \in \Theta\} that depends tightly on the definition of \mathcal{M}. The family \mathcal{M} is defined by limited humans and therefore it may not contain the true probability measure that governs the data behavior. Actually, the chosen family will hardly contain the true measure, moreover this true measure may not even exist. As the concept of a population mean depends on the probability measures in \mathcal{M}, the population mean is model-dependent.

The Bayesian approach considers a prior probability over the subsets of \mathcal{M} (or, equivalently, \Theta), but in this post I will concentrated only on the classical version.

 b. What is the definition and the purpose of a confidence interval?

As aforementioned, the population mean is model-dependent and provides useful interpretations. However, we have a family of population means, because the statistical model is defined by a family of probability measures (each probability measure generates a population mean). Therefore, based on an experiment, inferential procedures should be employed in order to estimate a small set (interval) containing good candidates of population means. One well-known procedure is the (1-\alpha) confidence region, which is defined by a set C_\alpha such that, for all \theta \in \Theta,

P_\theta(C_\alpha(X) \ni \mu_\theta) \geq 1-\alpha  and \inf_{\theta\in \Theta} P_\theta(C_\alpha(X) \ni \mu_\theta) = 1-\alpha,

where P_\theta(C_\alpha(X) = \varnothing) = 0 (see Schervish, 1995). This is a very general definition and encompasses virtually any type of confidence intervals. Here, P_\theta(C_\alpha(X) \ni \mu_\theta) is the probability that C_\alpha(X) contains \mu_\theta under the measure P_\theta. This probability should be always greater than (or equal to) 1-\alpha, the equality occurs at the worst case.

Remark: The readers should notice that it is not necessary to make assumptions on the state of reality, the confidence region is defined for a well-defined statistical model without making reference to any “true” mean. Even if the “true” probability measure does not exist or it is not in \mathcal{M}, the confidence region definition will work, since the assumptions are about statistical modelling rather than the states of reality.

On the one hand, before observing the data, C_\alpha(X) is a random set (or random interval) and the probability that “C_\alpha(X) contains the mean \mu_\theta” is, at least, (1-\alpha) for all \theta \in \Theta. This is a very desirable feature for the frequentist paradigm.

On the other hand, after observing the data x, C_\alpha(x) is just a fixed set and the probability that  “C_\alpha(x) contains the mean \mu_\theta” should be in \{0,1\} for all \theta \in \Theta.

That is, after observing the data x, we cannot employ the probabilistic reasoning anymore. As far as I know, there is no theory to treat confidence sets for an observed sample (we are working on it and we are getting some nice results). For a while, the frequentist must believe that the observed set (or interval) C_\alpha(x) is one of the (1-\alpha)100\% sets that contains \mu_\theta for all \theta\in \Theta.

PS: I invite any comments, reviews, critiques, or even objections to my post. Let’s discuss it in depth. As I am not a native English speaker, my post surely contains typos and grammar mistakes.

Reference:

Schervish, M. (1995), Theory of Statistics, Second ed, Springer.

Statistical hypothesis Testings (Probability X Possibility)

In statistics, the main tool for modeling uncertainties is certainly the probability measure. Other measures like possibility, necessity, impossibility, and plausibility are not part of the menu in statistics, probability and related courses (physics, biology and so on). This fact is indeed a very strong limitation that us, statisticians, have to deal with. The price of it is a low understanding of the statistical thinking and modeling, since a classical statistical model is a meta-probabilistic one and hence more general tools are required to understand it in depth. In general,  the probability rules are justified in terms of frequencies (Laplace),  game theory with its own definition of “coherence” (Ramsey, de Finetti, Savage, Lindley and Kadane), or desiderata (de Finetti, Richard Cox, Jaynes and Paris), and so on. Basically, in the core of all these arguments it is embedded a strong linear constraint, namely: 1. the frequency justification is based on counting frequencies; 2. the probabilistic “coherence” definition is always dependent on arbitrary linear rules; 3. the more basic axioms (desiderata) always assume a strictly increasing constraint on the involved functions. All of these imposed constraints can be easily refuted as immanent attributes of the coherent reasoning. The coherent reasoning is much broader than a set of quantitative specifications, it is qualitative rather than quantitative and it is much more related with aesthetics than otherwise. Therefore any type of axiomatization of Coherence should not be considered as the final word.

It should be clear that probability can be used for modeling uncertainties but it should not be imposed as the unique tool. As professor Zadeh put it wisely “a problem arises when “can” is replaced with “should,” as in the following dictum of a noted Bayesian, Professor E. Lindley”:

The only satisfactory description of uncertainty is probability. By this I mean that every uncertainty statement must be in the form of a probability; that several uncertainties must be combined using the rules of probability; and that the calculus of probabilities is adequate to handle all situations involving uncertainty… probability is the only sensible description of uncertainty and is adequate for all problems involving uncertainty. All other methods are inadequate… anything that can be done with fuzzy logic, belief functions, upper and lower probabilities, or any other alternative to probability can better be done with probability (Lindley 1987).

The study of plausibility measures can significantly broaden our views regarding the modeling process of uncertain events. An attentive statistician would note that on testing very restrictive hypothesis (e.g., the Hardy-Weinberg equilibrium or any physical law. Formally, a very restricted hypothesis is written as H_0: \theta \in \Theta_0 were \mbox{dim}(\Theta_0) < \mbox{dim}(\Theta) and \Theta is the model parameter space) the best probability estimate is zero, i.e., the probability of very restricted hypothesis is zero. As the probability is zero, should we claim that this very restrictive hypothesis is impossible? of course not, since as all (probability) models are always approximations, “zero-probability” events does occur in practice (the examples abound and I am not going to enumerate them). In these “zero-probability” cases, we can have a value pointing out how discrepant is this very restrictive hypothesis with the observed data, this value is not attained from the probabilistic framework. If a positive probability is set for a very restrictive hypothesis, then many paradoxes emerge (vide Bayes Factors, Lavine and Schervish, 1999).

In statistical hypothesis testing, those who are called Bayesians are not willing for testing very restrictive hypotheses, since as aforementioned the probabilities of such hypotheses are always zero. The other school (the frequentist, likelihoodist school or simply classical school) handles the problem in very different way. They consider that probability can describe some uncertain events but not all.

If we perform an experiment, then the observed data may be ideally modeled by a probability measure, however in practice we don’t know which is the actual probability measure the governs the data behavior. We can create a family of probability candidates and then choose one (or a small set of) probability measure(s) that may be appropriated for modeling the data. The two schools act as follows:

a) Bayesian statisticians impose another probability measure over the initial family of probability measures, called prior probability measure, and then compute the so-called posterior probability measure. Then, the posterior probability measure is used for testing any type of hypotheses regarding the initial family.

b) Classical statisticians just treat the problem by considering full possibility for all elements of the initial family. They use the so-called p-value as a measure of evidence for testing any type of hypotheses regarding the initial family. Also, there are several other methods (confidence intervals, most powerfull tests, and so on).

The p-value is not even a plausibility measure  (the plausibility measure includes many important measures such as probability and possibility)  over the parameter space producing some logical “problems”. (It has a quasi-possibilistic  behavior). In a very recent paper “A classical measure of evidence for general null hypotheses to appear in Fuzzy Sets and Systems, it is proposed a possibility measure for testing very general hypotheses (including very restrictive ones) which is free of logical contradictions.

Therefore, possibility measures can be used to test very restrictive hypotheses while probability cannot handle with it. Possibility measures can be justified in terms of game theory (Dubois and Prade) and we can also create a definition of coherence that matches with the possibility rules. The limit of the probability rules is not the limit of the coherent reasoning, actually, the limit of any quantitative artifact is not the limit of the coherent reasoning.

Referencies:

Lavine, M., Schervish, J.M. (1999). Bayes Factors: what they are and what they are not, The American Statistician, 53, 119-122.

Lindley, D.V. (1987). The probability approach to the treatment of uncertainty in Artificial Intelligence and expert systems, Statistical Science, 2, 17-24.