My response to What do you learn from p=.05? This example from Carl Morris will blow your mind:

Carl Mooris presented three hypothetical scenarios with different sample sizes in an election race between two candidates, namely Mr. Allen and Mr. Backer. A sample of n voters is taken and let Y be the voters favoring Allen. He would like to test H_0: \theta \geq 0.5 against H_1: \theta <0.5. The three scenarios are

  1. Y = 15 and  n=20,
  2. Y = 115 and n=200,
  3. Y =1046 and n=2000.

The p-values are about 0.021 for all scenarios and the ICs are:

  1. [0.560,0.940],
  2. [0.506,0.640],
  3. [0.501,0.545].

He asked which one of the three scenarios is most encouraging to candidate Allen, see the article. Andrew Gelman presented a discussion of this in his blog.

I argue here that the comparison of observed ICs and observed p-values are not appropriated, since ICs are random intervals and, as such, they are subject to random variabilities. Their observed values alone do not signify much without their dispersion measures. It is like comparing the observed values of two estimators without regarding their standard errors or other measures. P-values can also be regarded as random variables. Identical p-values could be compared together with a measure of their variabilities.

For instance,  let H_0: \theta \in M_0 be the null hypothesis. A p-value is defined by

p(T(x),M_0) = \sup_{\theta \in M_0} P_{\theta}(T(X) \geq T(x)),

where T(x) is the observed value of the test statistics T(X), X = (X_1, \ldots, X_n) is the random sample and P_\theta is the joint probability measure of the statistical model.

Define p(x) := p(t(x), M_0), then p(X) is a random variable whose distribution depends on M_0, \theta and n (of course that it depends on the adopted statistical model).

It is possible to compute, e.g., E_\theta(p(X)^k) = m(k, \theta). Then by plugging the estimative of \theta, we got one possible measure of variability

m(2, \hat{\theta}) - m(1, \hat{\theta})^2.

Other measures can be implemented by using this method.

Notice that, if a problem occurs in the first theory level, then you go to a meta-theory level to solve the problem, if a problem occurs in the meta-theory level, then you go to a meta-meta-theory level and so on and so forth.

It is too easy to find `apparent holes’ in the classical statistical theory, since it is a language with huge number of concepts that go far beyond the probabilistic knowledge. Unfortunately, the general recipe is: “if it appears to be probabilistically incoherent, it must be incoherent in a broadly sense and should be avoided´´. This recipe is too intellectually weak. If you do not use an appropriate language to treat these concepts that requires other non-probabilistic tools, you are doomed to interpret the classical concepts in a very narrow way as it seems the rule nowadays.

Conditional probabilities?

Just a little provocation:

The definition for the conditional probability is similar to the definition of division: let x be the number such that  P(A \cap B) = P(B) x.

As 0 \leq P(A \cap B) \leq P(B) \leq 1, and provided P(B)>0, this value x is always a well-defined number between 0 and 1 that can be recovered from P(A \cap B) and P(B). We can understand this number x as a value of a function of the events A and B, x = f(A,B), since this number varies with the events A and B.

It is possible to show that this function f(A,B) is a probability measure over the first argument, when B is fixed: that is f(.,B) is a probability measure: f(\varnothing , B) = 0, f(\Omega, B) = 1 and if C and D are disjoint sets, then f(C \cup D, B) = f(C,B) + f(D,B). Define  P(A|B) = f(A,B).

The interpretation for P(A|B) as “the probability of A given that the event B has occurred” seems to be fictional, since P(A|B) is just a number such that P(A \cap B) = P(B)P(A|B).

One can also begin from the “below”, by defining the conditional probability and then building a joint probability space.  In this case, the interpretation for P(A|B) as “the probability of A given that B has occurred” seems to be more justifiable.

Let us explore it a little bit more. If you first define the function P(.|B), then you must have a sigma-field for the events for the argument of the function P(.|B), since for each fixed B, this function must be a probability and hence you must define a list of sets to be measurable by P(.|B): i.e., you must define the domain of P(.|B). The symbol “|B” seems initially to mean that the probability P(.|B) was built by using the *information* contained in B and you can write instead \mu_B(.) = P(.|B). Well, this is how any type of probability is built, naturally also likelihood functions, joint probabilities, marginal probabilities and so forth. The problem is how to justify a probability space for the conditioning events, since they may not be measurable in the probabilistic sense. For instance, when the probability measure \mu_B is built by employing some deterministic laws, such as via differential equations, in this case, B contains our knowledge about differential equations, knowledge about the relation among the elements of interest and so on. Can it be measurable in terms of probabilities? some conceptual discussion is needed and maybe this is not the right place to do it.

Well, you want to start from the “conditional” probability \mu_B(.) = P(.|B) — which is not really a conditional probability in the usual sense, since B might not be measurable in terms of probability laws — to get a “joint” probability measure.

Let us assume that B is a measurable event in terms of probabilities. You must expand the initial measurable space to built a joint measurable space. First, you have to build all probability spaces (O_B, F_B, \mu_B), such that \mu_B(B)=1 for all “conditional” sets B \in K (a non-pathological sigma-field of the “conditional” subsets of B), and, finally, you must define a probability space to be applied in all “conditional” sets B in K, say Q(.). Then you define W(A \& B) to be Q(B)\mu_B(A); naturally that \mu_B(.) and Q(.) must have both special behaviors, otherwise W is not well defined; this was just an informal description.

As we saw, it is quite easier to start from “above” than from the “below” to built conditional probabilities. Is the interpretation “the probability of A given that the event B has occurred” for P(A|B) a fiction? Well, we can argue that all linguistic artifacts are fictions, even this post, but some are useful and others not.

PS: It is just a thought provoking note, please do not be angry…

PPS: In mathematics a definition (according to Suppes theory) must be eliminable and also non-creative. It means that all results obtained from a specific definition should be attainable without that specific definition, otherwise contradictions emerge from this creative non-eliminable definition. The sentence “definition is eliminable” means that its definiendum can be replaced by other definiens. The definition of $ latex P(A|B)$ must comply this criteria, however P(A|B) = \frac{P(A \cap B)}{P(B)} is not eliminable, the reason is that we cannot substitute the definiendum P(A|B) by any other definiens, since it does not have a definiens for the case P(B)=0. On the other hand, we can define P(A|B) as a number x \in [0,1] such that P(A \cap B) = x P(B).


P . Suppes. Introduction to Logic . Wadsworth International Group (1957).

e-value and s-value: possibility rather than probability measures

Let H be a statistical null hypothesis (it contains statements about the probability distribution of the observable data x). In the classical statistics, a p-value can be employed to test this null hypothesis, see for instance this post. In the Bayesian paradigm, the posterior distribution is used, however, if H is a sharp hypothesis (it is formed by a set with measure zero), then the posterior probability of H given the observed data is zero. Let \pi(.|x) be a posterior probability, it is clear that the following sentence is false

\pi(H|x)=0\Rightarrow  “H is impossible to occur, given x”.

That is, zero probability does not mean impossibility of the null hypothesis. In order to measure possibility and impossibility of a hypothesis, one need to use other measures, e.g.: e-value and s-value. The former is built under a Bayesian paradigm and the latter under the classical one.

The e-value and s-value (notations: ev(.|x) and s(.|x), respectively) have the same behavior: they are possibility measures rather than probability ones. They provide a degree of contradiction between the observed data x and the null hypothesis H and have the following interpretations:

1. “s(H|x) = 1\Rightarrowx does not contradict H”,
2. “s(H|x) = 0\Rightarrowx fully contradicts H”,
3. “s(H'|x) < s(H''|x)\Rightarrowx contradicts more H' than H''”.

It is possible to have s(H|x) = ev(H|x) = 1 and \pi(H|x) = 0 for the very same data and hypothesis. It just means that the observed data bring information that does not contradict a hypothesis formed by a set of measure zero. For the s-value, if the maximum likelihood estimative lies in the null set, then s(H|x) = 1. For the e-value, if the mode of the posterior probability lies in the null set, then ev(H|x) =1. It is straightforward to show that either s(H|x) = 1 or s(\neg H|x) = 1, the same for the e-value, where \neg H is the negation of H.

In order to accept/reject a hypothesis H (assuming that the universe of hypotheses is closed), one should compute the s/e-value for the negation of H, that is

4. if s(H|x) = 1 and s(\neg H| x) = a, one can accept H if “a” is sufficient small
5. if s(H|x) = b and s(\neg H|x) = 1, one can reject H is “b” is sufficient small
6. if a (or b) is not sufficient small, then more data are necessary to have a decision.

By this prescription, one will never accept a hypothesis formed by a set of Lebesgue measure zero (for both the s- and e-values).


Pereira, CAB, Stern, J., Wechsler, S. (2008). Can a significance test be genuinely Bayesian?, Bayesian Analysis, 3, 1, 79-10.

Patriota, AG. (2013). A classical measure of evidence for general null hypotheses, Fuzzy sets and Systems, 233, 74-88.

see also the comments in the blog

A non-parametric statistical test to compare clusters with applications in functional magnetic resonance imaging data

Statistical inference of functional magnetic resonance imaging (fMRI) data is an important tool in neuroscience investigation. One major hypothesis in neuroscience is that the presence or not of a psychiatric disorder can be explained by the differences in how neurons cluster in the brain. Therefore, it is of interest to verify whether the properties of the clusters change between groups of patients and controls. The usual method to show group differences in brain imaging is to carry out a voxel-wise univariate analysis for a difference between the mean group responses using an appropriate test and to assemble the resulting ‘significantly different voxels’ into clusters, testing again at cluster level. In this approach, of course, the primary voxel-level test is blind to any cluster structure. Direct assessments of differences between groups at the cluster level seem to be missing in brain imaging. For this reason, we introduce a novel non-parametric statistical test called analysis of cluster structure variability (ANOCVA), which statistically tests whether two or more populations are equally clustered. The proposed method allows us to compare the clustering structure of multiple groups simultaneously and also to identify features that contribute to the differential clustering. We illustrate the performance of ANOCVA through simulations and an application to an fMRI dataset composed of children with attention deficit hyperactivity disorder (ADHD) and controls. Results show that there are several differences in the clustering structure of the brain between them. Furthermore, we identify some brain regions previously not described to be involved in the ADHD pathophysiology, generating new hypotheses to be tested. The proposed method is general enough to be applied to other types of datasets, not limited to fMRI, where comparison of clustering structures is of interest.

Links: Original paper, free version

On scale-mixture Birnbaum-Saunders distributions


We present for the first time a justification on the basis of central limit theorems for the family of life distributions generated from scale-mixture of normals. This family was proposed by Bal-akrishnan et al. (2009) and can be used to accommodate unexpected observations for the usual Birnbaum-Saunders distribution generated from the normal one. The class of scale-mixture of normals includes normal, slash, Student-t, logistic, double-exponential, exponential power and many other distributions. We present a model for the crack extensions where the limiting distribu-tion of total crack extensions is in the class of scale-mixture of normals. Moreover, simple Monte Carlo simulations are reported in order to illustrate the results.

link to the article

Bias correction in a multivariate normal regression model with general parameterization


This paper develops a bias correction scheme for a multivariate normal model under a general parameterization. In the model, the mean vector and the covariance matrix share the same parameters. It includes many impor-tant regression models available in the literature as special cases, such as (non)linear regression, errors-in-variables models, and so forth. Moreover, heteroscedastic situations may also be studied within our framework. We derive a general expression for the second-order biases of maximum likeli-hood estimates of the model parameters and show that it is always possible to obtain the second order bias by means of ordinary weighted lest-squares regressions. We enlighten such general expression with an errors-in-variables model and also conduct some simulations in order to verify the performance of the corrected estimates. The simulation results show that the bias correc-tion scheme yields nearly unbiased estimators. We also present an empirical illustration.

Link to the article

Multivariate elliptical models with general parameterization


In this paper we introduce a general elliptical multivariate regression model in which the mean vector and the scale matrix have parameters (or/and covariates) in common. This approach unifies several important elliptical models, such as nonlinear regressions, mixed-effects model with nonlinear fixed effects, errors-in-variables models, and so forth. We discuss maximum likelihood estimation of the model parameters and obtain the information matrix, both observed and expected. Additionally, we derived the generalized leverage as well as the normal curvatures of local influence under some perturbation schemes. An empirical application is presented for illustrative purposes.

Link to the article