# Statistical significance

**Statistical significance** is a concept from statistics in which an observed result has a relatively small probability of occurring merely "by chance" and thus serves (through an inference) as evidence in favor of a particular conclusion. It arises in statistical procedures collectively known as hypothesis testing.

## Surprise!

Statistical significance is closely related to the everyday notion of a "surprising" outcome. As an illustration, consider repeatedly spinning a coin on a flat surface to determine whether it is "fair" (i.e., just as likely to come to rest showing "heads" as "tails"). Because modern coins are designed and manufactured to exacting standards, they are generally assumed to be fair. If we spin the coin, say, 20 times and observe that it comes up heads 10 times, then it is clearly reasonable to retain the assumption that the coin is fair, since the observed proportion of heads is exactly one-half. If, however, the coin comes up heads all 20 times (or, indeed, tails all 20 times), then it is clearly reasonable to conclude that it is *not* a fair coin. These two statements are surely intuitively obvious, but what is the basis for this intuition? In particular, what if we observed 7 heads or 15? What should we conclude in those cases?

The impression of whether the coin is fair or not is based on how *surprising* the results are. Given the assumption that the coin is fair, getting 10 heads in 20 spins is not surprising at all, since it is actually the number of heads we can "expect" to get from a fair coin. The farther the observed number of heads gets from 10, however, the more surprised we are. Getting 20 heads in 20 spins is so surprising to most people that they would no longer believe that the coin is fair. Somewhere between 10 and 20 heads (or, at the other extreme, somewhere between 10 and 0 heads), there must be a certain point at which the assumption of fairness becomes untenable. This would be the point at which the results would become "statistically significant". It would also be the point at which the opposite statement that the coin is unfair (or biased) becomes acceptable. (Note, therefore, that the real purpose of the experiment is not to convince us *that the coin is fair*, but the opposite: to convince us *that the coin is biased*. The observed evidence either convinces us or we remain unconvinced. We never actually become convinced by the experiment that the coin is fair. This is mainly due to issues related to inductive reasoning and the burden of proof.)

Treating our simple example slightly more formally, we can calculate the probability of seeing different numbers of heads in 20 spins of a fair coin (using the binomial distribution, for those who are interested) and use these probabilities to make a decision about *our* coin. The less likely our observed results would be when spinning a fair coin, the less believable the fairness of our coin becomes. Stated more simply, the less likely a fair coin would do what our coin did, the less we believe that we're using a fair coin. The statistical term for the point at which results become significant is the "significance level", denoted mathematically by the Greek letter alpha (α) and typically chosen to be a small probability such as 0.05. That is, if a fair coin has less than a 5% chance of giving the results we observed, we conclude that our coin is biased.

## A more real-world example

To see how these ideas are used in real-world studies, consider a randomized, double-blind, comparative experiment designed to test whether a new drug to treat a disease is more effective than an older, standard drug. In other words, patients are randomly assigned to receive either the new drug or the old one, and neither the patients nor the experimenters know which drug they have gotten (although this information is, of course, kept track of through the use of, say, numerical codes). The degree of improvement the new drug provides relative to the old is said to be *statistically significant* if it is so large that it is unlikely to have occurred by chance alone.

To understand what this means, consider the following facts:

- Different patients may respond differently to the same treatment. (There is variability across patients.)
- Some patients may respond better to one drug than another, but which drug is better may be different for different people. (This is another type of patient variability.)
- Even assuming the previous two points, it may be the case that the
*average*effectiveness of the new drug if it were given to the entire population of patients with this disease is no better than the*average*effectiveness of the old one if*it*were given to the same population. (There may be no overall benefit to using the new drug over the old.)

Therefore, *if any improvement at all* is observed for the new drug over the old, there are two possible explanations:

- There is actually
**no overall benefit**to using the new drug and the improvement we saw only occurred because we "just happened to" assign a bunch of people who would do better on the new drug to the group that actually received the new drug, and/or we happend to assign a bunch of people who would do worse on the old drug to the group that actually received the old drug. (In other words, the difference was due solely to which patients ended up in which group.) - There
**is an overall benefit**to using the new drug (in the population) and our sample results are simply reflecting this fact.

The first explanation, while not very satisfying, is the one that we must initially assume; we can only accept the second explanation if the statistical evidence is sufficiently strong. As in the coin spinning example, this evidence takes the form of a probability calculation. In particular, we calculate the conditional probability (in this context called a "*p*-value") that we would get similar results (more precisely, results *at least as extreme* as what we observed, using the same sample sizes) *assuming* that there is, in fact, no benefit to the new drug. The smaller this probability is (the larger the amount of the improvement, or the larger the sample sizes used in the experiment), the less convincing explanation #1 becomes, and the more convincing explanation #2 becomes. (If the *p*-value drops below the significance level, α, then the results are designated "statistically significant" and explanation #2 is accepted.)

Note, by the way, that just because an "effect" (the benefit of using the new drug over the old, in this case) is large enough to be statistically significant, that doesn't mean it is actually large in an absolute, real-world sense. For example, the new drug might only give a small benefit (that is nonetheless significant because of the sample size used in the experiment) that is outweighed by considerations of cost or side-effects.

In any case, accepting explanation #2 even though one has not actually tested the drug on every member of the population, makes the conclusion an inference and not a logical deduction. The decision could be wrong, even if the results are highly (statistically) significant. Explanation #1 could, in fact, be the truth. This is why any conclusion based on the statistical analysis of data is fundamentally subject to error (the kind of error being discussed here is usually called a "type I error" or "alpha error"). By carefully controlling for other sources of error (experimenter bias, data collection errors, etc.), and performing a valid statistical analysis, one can quantify the *probability* of making such an error — something that ad hoc (and most other unscientific) explanations cannot hope to achieve.