In this article, we discuss the exactness property of permutation tests, which is closely related to how pp-values are computed from the permutations. This article is a summary of the paper by Phipson and Smyth (2010).

Traditional formulation of a statistical test

A statistical test aims at determining whether some observed data can be considered to be strong evidence in favor of a so-called alternative hypothesis HaH_a compared to a so-called null hypothesis H0H_0. To do that, a test statistic TT is defined such that:

  • its observed value under H0H_0 can be computed once you observed some data;
  • large values of the statistic are evidence in favor of HaH_a.
  • you know (an approximation of) the distribution of TT under the null hypothesis, also known as the null distribution.

Once such a test statistic is available and we observe some data, we can denote by tobst_\mathrm{obs} the value of the test statistic computed from the observed data and define the so-called p-value as the null hypothesis tail probability: p=H0(Ttobs). p_\infty = \mathbb{P}_{H_0} \left( T \ge t_\mathrm{obs} \right). The p-value pp_\infty is by definition uniformly distributed on (0,1)(0,1) under the null hypothesis. Hence, we can define the so-called significance level α(0,1)\alpha \in (0,1) and decide to reject H0H_0 in favor of H1H_1 when pαp_\infty \le \alpha. By doing this, the probability of wrongly rejecting H0H_0, also known as the probability of type I errors, is simply: H0(pα)=α. \mathbb{P}_{H_0} \left( p_\infty \le \alpha \right) = \alpha. The significance level α\alpha therefore matches by design the probability of type I errors, which means that choosing α\alpha allows to control the probability of type I errors. We say that the test is exact.

When the null distribution of TT is not known, you do not have access to pp_\infty. A possible solution is then to resort to resampling techniques to approach the null hypothesis. Once you get an approximate null distribution, the question is how do you compute a p-value that provides an exact statistical test. There are two approaches to this problem: the first estimates the true p-value pp_\infty from the approximate null distribution while the second proposes an alternative definition of the p-value that can be straightforwardly computed. Let us expand on both approaches.

The two-sample problem in a permutational framework

We start with two samples X1,,Xnxiid𝒟(θx)X_1, \dots, X_{n_x} \stackrel{iid}{\sim} \mathcal{D}(\theta_x) and Y1,,Ynyiid𝒟(θy)Y_1, \dots, Y_{n_y} \stackrel{iid}{\sim} \mathcal{D}(\theta_y). We want to know whether the two distributions are the same or not on the basis of the two samples we collected. In this parametric setting, it boils down to testing the following hypotheses: H0:θx=θyvsθxθy. H_0: \theta_x = \theta_y \quad \mbox{vs} \quad \theta_x \neq \theta_y. Let TT be a statistic that depends on the two samples which is suited for elucidating this test, i.e.:

  • you can compute its observed value under the null hypothesis once you observed some data;
  • large values of the statistic are evidence in favor of the alternative hypothesis.

Now, for performing the test, one should also know (an approximation of) the distribution of TT under the null hypothesis, also known as the null distribution, from which the p-value associated to this test can be computed for instance.

If one knows the exact null distribution, then there is no need to resort to permutations. However, if the null distribution is not known, permutations come in handy for approaching it.

The idea is that, under the null hypothesis, the two samples come from the same distribution. Hence, we can pull them together as one big sample of size n=nx+nyn = n_x + n_y generated by that common distribution. At this point, we can split the pooled sample into two random subsets of size nxn_x and nyn_y respectively, and use them to compute a value of the statistic TT. If we repeat many times this splitting strategy, say mm times, we end up with mm values of the statistic from which we can compute the empirical distribution, known as the permutation distribution, which approaches the null distribution.

Permutation p-value as an unbiased estimator of pp_\infty

Let tobst_\mathrm{obs} be the value of the statistic computed from the original two samples, mm be the number of permutations used to approach the null distribution and BB be a random variable that counts the number of test statistic values greater than or equal to tobst_\mathrm{obs}.

By definition, the random variable BB follows a binomial distribution of size mm and rate of success pp_\infty. Hence, we can define the following unbiased estimator of pp_\infty: p̂=Bm. \widehat{p_\infty} = \frac{B}{m}.

However, when one uses this estimator of the p-value for the purpose of hypothesis testing, the resulting test is not exact. Let us investigate why.

First, remember that the true p-value pp_\infty is a random variable itself, in the sense that its value changes as soon as tobst_\mathrm{obs} changes i.e. each time the whole experiment is reconducted. Hence, the probability of wrongly rejecting the null hypothesis using p̂\widehat{p_\infty} reads: (p̂α)=(p̂α|p)fp(p)dp=01(p̂α|p)dp, \mathbb{P} \left( \widehat{p_\infty} \le \alpha \right) = \int_\mathbb{R} \mathbb{P} \left( \widehat{p_\infty} \le \alpha | p \right) f_{p_\infty}(p) dp = \int_0^1 \mathbb{P} \left( \widehat{p_\infty} \le \alpha | p \right) dp, because pp_\infty is uniformly distributed on (0,1)(0,1) under the null hypothesis.

Next, notice that p̂\widehat{p_\infty} can only take on a finite set of values {0,1m,2m,,m1m,1}\left\{ 0, \frac{1}{m}, \frac{2}{m}, \dots, \frac{m-1}{m}, 1 \right\}. Hence, we have for any b0,1,,mb \in 0, 1, \dots, m: (p̂=bm)=01(p̂=bm|p)dp=01(mb)pb(1p)mbdp=1m+1. \mathbb{P} \left( \widehat{p_\infty} = \frac{b}{m} \right) = \int_0^1 \mathbb{P} \left( \widehat{p_\infty} = \left. \frac{b}{m} \right| p \right) dp = \int_0^1 \binom{m}{b} p^b (1-p)^{m-b} dp = \frac{1}{m + 1}.

We can therefore deduce that: (p̂α)=mα+1m+1α. \mathbb{P} \left( \widehat{p_\infty} \le \alpha \right) = \frac{\lfloor m \alpha \rfloor + 1}{m + 1} \neq \alpha.

The following R code shows graphically that using p̂\widehat{p_\infty} as p-value does not provide an exact test:

alpha <- seq(0.01, 0.1, by = 0.01)
m <- c(10, 100, 1000)
p1 <- crossing(alpha, m) %>% 
  mutate(
    p = (floor(m * alpha) + 1) / (m + 1), 
    mf = paste("m =", m)
  ) %>% 
  ggplot(aes(alpha, p, color = mf)) + 
  geom_point() + 
  geom_abline(aes(intercept = 0, slope = 1)) + 
  labs(
    x = "Significance level",
    y = "Probability of wrongly rejecting H0"
  ) + 
  facet_wrap(vars(mf)) + 
  scale_color_viridis_d() + 
  scale_y_continuous(limits = c(0, 0.1)) + 
  coord_equal() + 
  theme_bw()
fig <- p1 %>% 
  plotly::ggplotly() %>% 
  plotly::hide_legend()
htmlwidgets::saveWidget(
  widget = fig, 
  file = "exactness-fig1.html", 
  selfcontained = rmarkdown::pandoc_available("1.12.3")
)
htmltools::tags$iframe(
  src = "exactness-fig1.html",
  scrolling = "no", 
  seamless = "seamless",
  frameBorder = "0",
  width = "100%", 
  height = 400
)

Permutation p-value as the tail probability of a resampling distribution

An alternative strategy is to define the p-value by looking at the random variable BB instead of the test statistic TT. While the p-value is the tail probability of the null distribution, in the context of randomization tests, we can define it as the tail probability of the distribution of BB. Given a fixed number mm of sampled permutations, recall that the random variable BB counts the number of test statistic values larger than or equal to tobst_\mathrm{obs}. Hence, an alternative equivalent definition of the p-value is given by the so-called exact permutation p-value: pe=H0(Bb), p_e = \mathbb{P}_{H_0} \left( B \le b \right), where bb is the observed number of test statistics larger than or equal to tobst_\mathrm{obs} (using the observed sample of permutations that was drawn).

Let BtB_t be a random variable that counts the total number of possible distinct test statistic values exceeding tobst_\mathrm{obs} and recall that mtm_t is the total number of possible distinct permutations. We denote by pt=Bt+1mt+1, p_t = \frac{B_t + 1}{m_t + 1}, the permutation p-value when the exhaustive list of all permutations is used.

As we have seen before, it is straightforward to show that BtB_t follows a discrete uniform distribution on the integers 0,,mt0, \dots, m_t and that, conditional on Bt=btB_t = b_t, the random variable BB follows a binomial distribution of size mm and rate of success ptp_t. We can thus write: pe=bt=0BtH0(Bb|Bt=bt)H0(Bt=bt)=1mt+1bt=0BtFB(b;m,bt+1mt+1), p_e = \sum_{b_t=0}^{B_t} \mathbb{P}_{H_0} \left( B \le b | B_t = b_t \right) \mathbb{P}_{H_0} \left( B_t = b_t \right) = \frac{1}{m_t + 1} \sum_{b_t=0}^{B_t} F_B \left( b; m, \frac{b_t + 1}{m_t + 1} \right), where FB(;m,bt+1mt+1)F_B \left( \cdot; m, \frac{b_t + 1}{m_t + 1} \right) is the cumulative probability function of the binomial distribution of size mm and probability of success bt+1mt+1\frac{b_t + 1}{m_t + 1}.

This can be computationally intense to compute for large values of mtm_t, in which case one might use the following integral approximation: peb+1m+100.5mt+1FB(b;m,pt)dpt. p_e \approx \frac{b+1}{m+1} - \int_0^{\frac{0.5}{m_t+1}} F_B (b; m, p_t) dp_t.

This approximation shows that the exact p-value pep_e is upper bounded by pu=b+1m+1, p_u = \frac{b+1}{m+1}, which happens to be the exact p-value in the case of sampling permutations without replacement.

Comparison by empirical evidence

In flipr, you perform a permutation test using the estimator p̂\widehat{p_\infty} of the p-value pp_\infty by setting type == "estimate". This provides a non-exact test but an unbiased estimate of pp_\infty. You perform a permutation test using the permutation p-value pep_e by setting type == "exact". This provides an exact test. You also have the possibility of using the upper bound p-value pup_u using type == "upper_bound".

The following R code runs simulations to empirically estimate the probability of wrongly rejecting the null hypothesis. The generative model for both samples is the standard normal distribution. Sample sizes are set to n1=n2=5n_1 = n_2 = 5. We draw 2020 permutations for each test. We used a significance level of 5%5\%.

# General setup
nreps <- 1e4
n1 <- 5
n2 <- 5
set.seed(12345)
sim <- map(sample.int(.Machine$integer.max, nreps, replace = TRUE), ~ {
    list(
      x = rnorm(n = n1, mean = 0, sd = 1),
      y = rnorm(n = n2, mean = 0, sd = 1),
      s = .x
    )
  })

# Cluster setup
cl <- makeCluster(detectCores(logical = FALSE))
clusterEvalQ(cl, {
  library(tidyverse)
  library(flipr)
  null_spec <- function(y, parameters) {
    map(y, ~ .x - parameters)
  }
  stat_functions <- list(stat_t)
  stat_assignments <- list(delta = 1)
  nperms <- 20
  alpha <- 0.05
})

alpha_estimates <- pbapply::pblapply(sim, function(.l) {
    pf <- PlausibilityFunction$new(
      null_spec = null_spec,
      stat_functions = stat_functions,
      stat_assignments = stat_assignments,
      .l$x, .l$y,
      seed = .l$s
    )
    pf$set_nperms(nperms)
    pf$set_alternative("left_tail")
    pf$set_pvalue_formula("exact")
    pv_exact <- pf$get_value(0)
    pf$set_pvalue_formula("upper_bound")
    pv_upper_bound <- pf$get_value(0)
    pf$set_pvalue_formula("estimate")
    pv_estimate <- pf$get_value(0)
    c(
      exact       = pv_exact       <= alpha,
      upper_bound = pv_upper_bound <= alpha,
      estimate    = pv_estimate    <= alpha
    )
  }, cl = cl) %>%
  transpose() %>%
  simplify_all() %>%
  map(mean)
stopCluster(cl)
as_tibble(alpha_estimates)
#> # A tibble: 1 × 3
#>    exact upper_bound estimate
#>    <dbl>       <dbl>    <dbl>
#> 1 0.0507      0.0507   0.0978

This simulation provides numerical evidence that using the permutation p-value pep_e yields an exact test. This is by design since pep_e is a genuine probability hence it is uniformly distributed on (0,1)(0,1) under the null hypothesis. We also observe that the upper bound pup_u is very close to the exact p-value pep_e.

Of course, one might criticize this simulation in which the number of permutations mm has been chosen small on purpose to prove our point. In effect, when mm is larger, all formulae to compute the p-value yield an asymptotically exact test. Quoting Phipson and Smyth (2010), while this is true, the urgent need to avoid small mm disappears when the exact p-value pep_e is used in place of p̂\widehat{p_\infty}, because pep_e ensures a valid statistical test regardless of the sample sizes or the number of permutations. When exact p-values are used, the only penalty for choosing mm small is a loss of statistical power to reject the null hypothesis. This is as it should be: more permutations should generally provide greater power.

References

Phipson, Belinda, and Gordon K Smyth. 2010. “Permutation p-Values Should Never Be Zero: Calculating Exact p-Values When Permutations Are Randomly Drawn.” Statistical Applications in Genetics and Molecular Biology 9 (1). https://doi.org/10.2202/1544-6115.1585.