The p-value is not a probability conditional on the null hypothesis H0

On 13 March 2013, Professor Andrew Gelman wrote a post in his blog about some common misunderstandings of P-values, see  here.  At the middle of his post, Gelman referred the reader to his paper Gelman (2013) for more discussions on the topic. What I am going to comment here is the third sentence of this paper that reads:

“The formal view of the P value as a probability conditional on the null is mathematically correct but typically irrelevant to research …”

The formal view of the p-value as probability conditional on the null is NOT mathematically correct, since the concept of conditional probability is not employed on the definition of a p-value. This confusion occurs when p-values are informally defined and, as we shall see, a formal definition would illuminate the issue.

Informal definition

I use uppercase letters for random quantities and lowercase letters for observed quantities. Define the null hypothesis by H_0 and let X be the data and T be a statistic holding an initial condition:  the more incompatible is the data x with the null H_0, the larger is t. The p-value is the probability of observing another statistic T at least as extreme as the observed one, under the null hypothesis. This is commonly stated as

P(T > t; \mbox{under } H_0) \mbox{ or } P(T>t | H_0).

They are both mathematically imprecise definitions, since “under H_0” or “given H_0” were not properly defined. It should be clear that “under the null hypothesis” means that the probability distributions  state in H_0 are adequate to fit the data. Notice that if H_0 states more than one probability measure the above definitions are meaningless. Those types of informal definitions are certainly the main cause of controversies over the literature and the obvious way to avoid them is making them formal by using proper tools.

Formal definition

Consider the statistical model (\mathcal{X}, \mathcal{F}, \mathcal{P}) where \mathcal{X} is the random space, \mathcal{F} is the Borel sigma-field and \mathcal{P} is a family of probability measures that possibly  fit the observed data. The null hypothesis states that a subfamily \mathcal{P}_{0} \subset \mathcal{P} contains probability measures that are not discrepant with the observed data, this is mathematically translated into the symbols H_0: P \in \mathcal{P}_{0}.

Again: let X be the data and T be a statistic holding an initial condition:  the more incompatible is the data x with the null H_0, the larger is t. Notice that such statistic T should depend on the null hypothesis, therefore, it should be read as T_{H_0} rather than T,  see for instance the likelihood ratio statistic. This latter remark helps us to understand some paradoxical features of p-values (for more on this see Patriota (2013)). As aforementioned, the p-value is the probability of observing another statistic T at least as extreme as the observed one, under the null hypothesis. This is formally written as

P(T> t; \mbox{under } H_0) = \sup_{P \in \mathcal{P}_0} P(T_{H_0}>t).

Now, we have a precise definition for a p-value. As you can see, it is not a conditional probability, since there is no prior distribution over \mathcal{P} or \mathcal{P}_0 and the concept of conditional probability is not applied at all. Moreover, p-values of different hypotheses should not be directly compare, since different statistics induce different metrics (Patriota, 2013)

The p-value is built over the best probability measure stated in H_0. Thus, if even taking the best case described in H_0  we  have a very small probability for the event \{T > t\}, then all probability measures described in H_0 must be rejected.

Notice that, if at least one probability measure listed in the null hypothesis is adequate to explain the observed data, then we should not reject this null. In order to reject the null hypothesis, we must pick the best choice on the list and verify if the set {T>t} has low probability. This procedure has little to do with integrals over the null parameter space.

Example

It is easier to see that by taking the full family as \mathcal{P} =\{P_0,P_1, P_2,P_3\} and the restricted null set as \mathcal{P}_0=\{P_0,P_1\}, i.e., we want to verify if P_0 or P_1 can fit the data (i.e.,  H_0: P \in \mathcal{P}_0). For simplicity, consider the Bernoulli experiment with P_0(S)=0.1, P_1(S)=0.4, P_2(S)=0.6 and  P_3(S)=0.9.

Consider an independent Bernoulli sample of size n=5, assume that we observed from the experiment  the vector x = (1,1,1, 1,0).  The likelihood function is:

L(p,x) = p^{\sum x}(1-p)^{n-\sum x}.

Our null hypothesis is H_0: P \in \{P_0,P_1\}, then we can consider the following (likelihood ratio) statistic:
T(x) = -2\log\bigg( \frac{\sup_{p \in \{0.1, 0.4\}} L(p,x)}{\sup_{p \in \{0.1, 0.4,0.6, 0.9\}} L(p,x)}\bigg)
and its observed value is t= -2\log\big(\frac{0.01536}{0.06561}\big)= 2.903923. Notice that, the only configuration that produces T(x)> t is x=(1,1,1,1,1), then:

1. Considering P_0 we have that the probability of \{T>t\} is

p_0 = P_0(\{x \in \{0,1\}^5: T(x)>t \})= P_0(X = (1,1,1,1,1)) = 0.1^{5}.

2. Considering P_1 we have that the probability of \{T>t\} is

p_1 = P_1(\{x \in \{0,1\}^5: T(x)>t \}) = P_1(X = (1,1,1,1,1))=0.4^5.

According to our definition the p-value is 0.4^5. We can guarantee that for all probability measures listed in \mathcal{P}_0, the probability of observing a statistic T, at least, as extreme as the observed one is at most 0.4^5.

It is easy to see that if at least one P_0(T_{H_0}>t) or P_1(T_{H_0}>t) is not small, then we should not reject H_0. On the other hand, if both P_0(T_{H_0}>t) and P_1(T_{H_0}>t) are small then we can reject this null, since for both measures the observed data produce an extreme statistic (the extreme part is governed by the measures restricted to the null). Note that, the max operator does quite well this job. P-values are defined to verify a discrepancy between $H_0$ and data, so it is pretty clear to me that if at least one choice listed on H_0 is not bad for modelling the observed data, then you should not reject the null.

Here is my task for you that still believe in the “conditional” interpretation: Define precisely the notation P(T>t | H_0) and provide an interpretation. Does it match with the p-value purpose?

Referencies:

Gelman (2013). P value and Statistical Practice, Epidemiology, 69-72. Doi: 10.1097/EDE.0b013e31827886f7.

Patriota (2013). A classical measure of evidence for general null hypotheses, Doi: 10.1016/j.fss.2013.03.007

Advertisements

2 thoughts on “The p-value is not a probability conditional on the null hypothesis H0

  1. Pingback: Statistical hypothesis Testings (Probability X Possibility) | Statmath's Blog

  2. Pingback: e-value and s-value: possibility rather than probability measures | Statmath's Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s