# The p-value is not a probability conditional on the null hypothesis H0

On 13 March 2013, Professor Andrew Gelman wrote a post in his blog about some common misunderstandings of P-values, see  here.  At the middle of his post, Gelman referred the reader to his paper Gelman (2013) for more discussions on the topic. What I am going to comment here is the third sentence of this paper that reads:

“The formal view of the P value as a probability conditional on the null is mathematically correct but typically irrelevant to research …”

The formal view of the p-value as probability conditional on the null is NOT mathematically correct, since the concept of conditional probability is not employed on the definition of a p-value. This confusion occurs when p-values are informally defined and, as we shall see, a formal definition would illuminate the issue.

# Informal definition

I use uppercase letters for random quantities and lowercase letters for observed quantities. Define the null hypothesis by $H_0$ and let $X$ be the data and $T$ be a statistic holding an initial condition:  the more incompatible is the data $x$ with the null $H_0$, the larger is $t$. The p-value is the probability of observing another statistic $T$ at least as extreme as the observed one, under the null hypothesis. This is commonly stated as

$P(T > t; \mbox{under } H_0) \mbox{ or } P(T>t | H_0).$

They are both mathematically imprecise definitions, since “under $H_0$” or “given $H_0$” were not properly defined. It should be clear that “under the null hypothesis” means that the probability distributions  state in $H_0$ are adequate to fit the data. Notice that if $H_0$ states more than one probability measure the above definitions are meaningless. Those types of informal definitions are certainly the main cause of controversies over the literature and the obvious way to avoid them is making them formal by using proper tools.

# Formal definition

Consider the statistical model $(\mathcal{X}, \mathcal{F}, \mathcal{P})$ where $\mathcal{X}$ is the random space, $\mathcal{F}$ is the Borel sigma-field and $\mathcal{P}$ is a family of probability measures that possibly  fit the observed data. The null hypothesis states that a subfamily $\mathcal{P}_{0} \subset \mathcal{P}$ contains probability measures that are not discrepant with the observed data, this is mathematically translated into the symbols $H_0: P \in \mathcal{P}_{0}$.

Again: let $X$ be the data and $T$ be a statistic holding an initial condition:  the more incompatible is the data $x$ with the null $H_0$, the larger is $t$. Notice that such statistic $T$ should depend on the null hypothesis, therefore, it should be read as $T_{H_0}$ rather than $T$,  see for instance the likelihood ratio statistic. This latter remark helps us to understand some paradoxical features of p-values (for more on this see Patriota (2013)). As aforementioned, the p-value is the probability of observing another statistic $T$ at least as extreme as the observed one, under the null hypothesis. This is formally written as

$P(T> t; \mbox{under } H_0) = \sup_{P \in \mathcal{P}_0} P(T_{H_0}>t).$

Now, we have a precise definition for a p-value. As you can see, it is not a conditional probability, since there is no prior distribution over $\mathcal{P}$ or $\mathcal{P}_0$ and the concept of conditional probability is not applied at all. Moreover, p-values of different hypotheses should not be directly compare, since different statistics induce different metrics (Patriota, 2013)

The p-value is built over the best probability measure stated in $H_0$. Thus, if even taking the best case described in $H_0$  we  have a very small probability for the event $\{T > t\}$, then all probability measures described in $H_0$ must be rejected.

Notice that, if at least one probability measure listed in the null hypothesis is adequate to explain the observed data, then we should not reject this null. In order to reject the null hypothesis, we must pick the best choice on the list and verify if the set {T>t} has low probability. This procedure has little to do with integrals over the null parameter space.

# Example

It is easier to see that by taking the full family as $\mathcal{P} =\{P_0,P_1, P_2,P_3\}$ and the restricted null set as $\mathcal{P}_0=\{P_0,P_1\}$, i.e., we want to verify if $P_0$ or $P_1$ can fit the data (i.e.,  $H_0: P \in \mathcal{P}_0$). For simplicity, consider the Bernoulli experiment with $P_0(S)=0.1$, $P_1(S)=0.4$, $P_2(S)=0.6$ and  $P_3(S)=0.9$.

Consider an independent Bernoulli sample of size n=5, assume that we observed from the experiment  the vector x = (1,1,1, 1,0).  The likelihood function is:

$L(p,x) = p^{\sum x}(1-p)^{n-\sum x}.$

Our null hypothesis is $H_0: P \in \{P_0,P_1\}$, then we can consider the following (likelihood ratio) statistic:
$T(x) = -2\log\bigg( \frac{\sup_{p \in \{0.1, 0.4\}} L(p,x)}{\sup_{p \in \{0.1, 0.4,0.6, 0.9\}} L(p,x)}\bigg)$
and its observed value is $t= -2\log\big(\frac{0.01536}{0.06561}\big)= 2.903923$. Notice that, the only configuration that produces $T(x)> t$ is $x=(1,1,1,1,1)$, then:

1. Considering $P_0$ we have that the probability of $\{T>t\}$ is

$p_0 = P_0(\{x \in \{0,1\}^5: T(x)>t \})= P_0(X = (1,1,1,1,1)) = 0.1^{5}.$

2. Considering $P_1$ we have that the probability of $\{T>t\}$ is

$p_1 = P_1(\{x \in \{0,1\}^5: T(x)>t \}) = P_1(X = (1,1,1,1,1))=0.4^5.$

According to our definition the p-value is $0.4^5$. We can guarantee that for all probability measures listed in $\mathcal{P}_0$, the probability of observing a statistic $T$, at least, as extreme as the observed one is at most $0.4^5$.

It is easy to see that if at least one $P_0(T_{H_0}>t)$ or $P_1(T_{H_0}>t)$ is not small, then we should not reject $H_0$. On the other hand, if both $P_0(T_{H_0}>t)$ and $P_1(T_{H_0}>t)$ are small then we can reject this null, since for both measures the observed data produce an extreme statistic (the extreme part is governed by the measures restricted to the null). Note that, the max operator does quite well this job. P-values are defined to verify a discrepancy between $H_0$ and data, so it is pretty clear to me that if at least one choice listed on $H_0$ is not bad for modelling the observed data, then you should not reject the null.

Here is my task for you that still believe in the “conditional” interpretation: Define precisely the notation $P(T>t | H_0)$ and provide an interpretation. Does it match with the p-value purpose?

# Referencies:

Gelman (2013). P value and Statistical Practice, Epidemiology, 69-72. Doi: 10.1097/EDE.0b013e31827886f7.

Patriota (2013). A classical measure of evidence for general null hypotheses, Doi: 10.1016/j.fss.2013.03.007