On 13 March 2013, Professor Andrew Gelman wrote a post in his blog about some common misunderstandings of P-values, see here. At the middle of his post, Gelman referred the reader to his paper Gelman (2013) for more discussions on the topic. What I am going to comment here is the third sentence of this paper that reads:
“The formal view of the P value as a probability conditional on the null is mathematically correct but typically irrelevant to research …”
The formal view of the p-value as probability conditional on the null is NOT mathematically correct, since the concept of conditional probability is not employed on the definition of a p-value. This confusion occurs when p-values are informally defined and, as we shall see, a formal definition would illuminate the issue.
Informal definition
I use uppercase letters for random quantities and lowercase letters for observed quantities. Define the null hypothesis by and let
be the data and
be a statistic holding an initial condition: the more incompatible is the data
with the null
, the larger is
. The p-value is the probability of observing another statistic
at least as extreme as the observed one, under the null hypothesis. This is commonly stated as
They are both mathematically imprecise definitions, since “under ” or “given
” were not properly defined. It should be clear that “under the null hypothesis” means that the probability distributions state in
are adequate to fit the data. Notice that if
states more than one probability measure the above definitions are meaningless. Those types of informal definitions are certainly the main cause of controversies over the literature and the obvious way to avoid them is making them formal by using proper tools.
Formal definition
Consider the statistical model where
is the random space,
is the Borel sigma-field and
is a family of probability measures that possibly fit the observed data. The null hypothesis states that a subfamily
contains probability measures that are not discrepant with the observed data, this is mathematically translated into the symbols
.
Again: let be the data and
be a statistic holding an initial condition: the more incompatible is the data
with the null
, the larger is
. Notice that such statistic
should depend on the null hypothesis, therefore, it should be read as
rather than
, see for instance the likelihood ratio statistic. This latter remark helps us to understand some paradoxical features of p-values (for more on this see Patriota (2013)). As aforementioned, the p-value is the probability of observing another statistic
at least as extreme as the observed one, under the null hypothesis. This is formally written as
Now, we have a precise definition for a p-value. As you can see, it is not a conditional probability, since there is no prior distribution over or
and the concept of conditional probability is not applied at all. Moreover, p-values of different hypotheses should not be directly compare, since different statistics induce different metrics (Patriota, 2013)
The p-value is built over the best probability measure stated in . Thus, if even taking the best case described in
we have a very small probability for the event
, then all probability measures described in
must be rejected.
Notice that, if at least one probability measure listed in the null hypothesis is adequate to explain the observed data, then we should not reject this null. In order to reject the null hypothesis, we must pick the best choice on the list and verify if the set {T>t} has low probability. This procedure has little to do with integrals over the null parameter space.
Example
It is easier to see that by taking the full family as and the restricted null set as
, i.e., we want to verify if
or
can fit the data (i.e.,
). For simplicity, consider the Bernoulli experiment with
,
,
and
.
Consider an independent Bernoulli sample of size n=5, assume that we observed from the experiment the vector x = (1,1,1, 1,0). The likelihood function is:
1. Considering we have that the probability of
is
2. Considering we have that the probability of
is
According to our definition the p-value is . We can guarantee that for all probability measures listed in
, the probability of observing a statistic
, at least, as extreme as the observed one is at most
.
It is easy to see that if at least one or
is not small, then we should not reject
. On the other hand, if both
and
are small then we can reject this null, since for both measures the observed data produce an extreme statistic (the extreme part is governed by the measures restricted to the null). Note that, the max operator does quite well this job. P-values are defined to verify a discrepancy between $H_0$ and data, so it is pretty clear to me that if at least one choice listed on
is not bad for modelling the observed data, then you should not reject the null.
Here is my task for you that still believe in the “conditional” interpretation: Define precisely the notation and provide an interpretation. Does it match with the p-value purpose?
Referencies:
Gelman (2013). P value and Statistical Practice, Epidemiology, 69-72. Doi: 10.1097/EDE.0b013e31827886f7.
Patriota (2013). A classical measure of evidence for general null hypotheses, Doi: 10.1016/j.fss.2013.03.007
Pingback: Statistical hypothesis Testings (Probability X Possibility) | Statmath's Blog
Pingback: e-value and s-value: possibility rather than probability measures | Statmath's Blog