Jeremy Minton About
Frequentist hypothesis testing
Posted by Jeremy Minton, ,

I am trying to learn more about frequentist statistics. This post is an attempt for me to clarify some frameworks and mental models to better structure the selection of tools and processes written about in the field. If I’ve made any mistakes, oversights or omissions in my generalisation, please get in touch.

The basic process of hypothesis testing is: data -> test statistic -> p-value -> retain/reject hypothesis

The test statistic is a quantity derived from a sample that can be used to perform a hypothesis test. In a sense it is a numerical summary of the data.

The p-value is a probability of how extreme the sample test statistic is and is more interpretable than the raw test statistic.

The hypothesis can be rejected or retained based on whether the p-value surpasses a chosen significance level.

Test statistic

A mapping from observations to a scalar:

\[T: \mathcal{X}^n \rightarrow \Re.\]

The test statistic must be calculable under the sampling distribution, but otherwise it seems there are no more requirements than this, although some useful properties include:

  1. completeness
  2. consistency
  3. sufficiency
  4. unbiasedness
  5. minimum mean square error
  6. low variance
  7. robustness
  8. computational convenience.

With this breadth of choice and modern compute, I assume almost any statistical model could be constructed for the experimental setup and almost any statistic of interest chosen.

Alas, the beginner texts and courses I have found seem dominated by the classical z-, t- and chi squared tests, at the expense of this general case. While these are broadly applicable, especially for experimental and population sciences, it seems to miss a more abstract understanding and broader solution space.

P-value and rejecting the hypothesis:

The p-value is a mapping of the test statistic to a probability space, for interpretability.

It is defined as “the probability that a new sample will be at least as extreme as the observed sample”; or,

\[\textrm{p-value} = \int_{\Gamma_{T(\mathbf{\bar{X}})}} p\left(T(\mathbf{X}) = \psi | H_0\right)d\psi,\]

where \(p\) is the probability density function of \(T(\mathbf{X})\), the test statistic of a randomly drawn sample, and \(\Gamma_{T(\mathbf{\bar{X}})}\) is the critical region: the test statistic values defined to be more extreme than the sample test statistic, \(T(\mathbf{\bar{X}})\).

This definition has two key parts. First, the probability density function, \(p\), of the test statistic on another sample. This requires an appropriate sampling distribution to be selected for the data generation process. The test statistic must also be calculable under this distribution - either exactly or approximately.

Second, the critical region, \(\Gamma_{T(\mathbf{\bar{X}})}\). It is the region of the test statistic that is at least as extreme as the test statistic of the observed sample. For the general case this means applying a partial-ordering to the range of the test statistic mapping. For the practical cases of one-sided right and left tail test statistic distributions, this means the region greater or less than the sample statistic respectively . It is a little more complex for two-sided test statistics, where the critical region is above one threshold or below another. I do not know of any more complex examples than this, although it would not be difficult to construct a (likely useless) test statistic that is.

The hypothesis can then be rejected or retained depending on whether the p-value exceeds a chosen significance level or not.

Future posts

  1. Work an example of a classic parametric test.
  2. Investigate what non-parametric tests are.
  3. Work an example of generalized likelihood ratio, and find out what limits exist on the statistical models that determine the likelihoods.
  4. Understand the interplay between the null hypothesis informing the test statistic and the alternative hypothesis defining the critical region.