On 13 March 2013, Professor Andrew Gelman wrote a post in his blog about some common misunderstandings of P-values, see here. At the middle of his post, Gelman referred the reader to his paper Gelman (2013) for more discussions on the topic. What I am going to comment here is the third sentence of this paper that reads:

“The formal view of the P value as a probability conditional on the null is mathematically correct but typically irrelevant to research …”

The formal view of the p-value as probability conditional on the null is **NOT** mathematically correct, since the concept of conditional probability is not employed on the definition of a p-value. This confusion occurs when p-values are informally defined and, as we shall see, a formal definition would illuminate the issue.

**Informal definition**

I use uppercase letters for random quantities and lowercase letters for observed quantities. Define the null hypothesis by and let be the data and be a statistic holding an initial condition: the more incompatible is the data with the null , the larger is . The p-value is the probability of observing another statistic at least as extreme as the observed one, under the null hypothesis. This is commonly stated as

They are both mathematically imprecise definitions, since “under ” or “given ” were not properly defined. It should be clear that “under the null hypothesis” means that the probability distributions state in are adequate to fit the data. Notice that if states more than one probability measure the above definitions are meaningless. Those types of informal definitions are certainly the main cause of controversies over the literature and the obvious way to avoid them is making them formal by using proper tools.

**Formal definition**

Consider the statistical model where is the random space, is the Borel sigma-field and is a family of probability measures that possibly fit the observed data. The null hypothesis states that a subfamily contains probability measures that are not discrepant with the observed data, this is mathematically translated into the symbols .

Again: let be the data and be a statistic holding an initial condition: the more incompatible is the data with the null , the larger is . **Notice that** such statistic should depend on the null hypothesis, therefore, it should be read as rather than , see for instance the likelihood ratio statistic. This latter remark helps us to understand some *paradoxical* features of p-values (for more on this see Patriota (2013)). As aforementioned, the p-value is the probability of observing another statistic at least as extreme as the observed one, under the null hypothesis. This is formally written as

Now, we have a precise definition for a p-value. As you can see, it is not a conditional probability, since there is no prior distribution over or and the concept of conditional probability is not applied at all. Moreover, p-values of different hypotheses should not be directly compare, since different statistics induce different metrics (Patriota, 2013)

The p-value is built over the best probability measure stated in . Thus, if even taking the best case described in we have a very small probability for the event , then all probability measures described in must be rejected.

Notice that, if at least one probability measure listed in the null hypothesis is adequate to explain the observed data, then we should not reject this null. In order to reject the null hypothesis, we must pick the best choice on the list and verify if the set {T>t} has low probability. This procedure has little to do with integrals over the null **parameter** space.

**Example**

It is easier to see that by taking the full family as and the restricted null set as , i.e., we want to verify if or can fit the *data* (i.e., ). For simplicity, consider the Bernoulli experiment with , , and .

Consider an independent Bernoulli sample of size n=5, assume that we observed from the experiment the vector x = (1,1,1, 1,0). The likelihood function is:

1. Considering we have that the probability of is

2. Considering we have that the probability of is

According to our definition the p-value is . We can guarantee that for all probability measures listed in , the probability of observing a statistic , at least, as extreme as the observed one is at most .

It is easy to see that if at least one or is not small, then we should not reject . On the other hand, if both and are small then we can reject this null, since for both measures the observed *data* produce an extreme statistic (the extreme part is governed by the measures restricted to the null). Note that, the max operator does quite well this job. P-values are defined to verify a discrepancy between $H_0$ and *data*, so it is pretty clear to me that if at least one choice listed on is not bad for modelling the observed data, then you should not reject the null.

**Here is my task for you that still believe in the “ conditional” interpretation:** Define precisely the notation and provide an interpretation. Does it match with the p-value purpose?

**Referencies:**

Gelman (2013). P value and Statistical Practice, *Epidemiology*, 69-72. Doi: 10.1097/EDE.0b013e31827886f7.

Patriota (2013). A classical measure of evidence for general null hypotheses, Doi: 10.1016/j.fss.2013.03.007