jueves, 7 de junio de 2018

The normal distribution: history, computation and curiosities

A Panorama of Statistics: Perspectives, Puzzles and Paradoxes in Statistics

Eric Sowey, School of Economics, The University of New South Wales, Sydney, Australia

Peter Petocz, Department of Statistics, Macquarie University, Sydney, Australia

This book is a stimulating panoramic tour – quite different from a textbook journey – of the world of statistics in both its theory and practice, for teachers, students and practitioners.At each stop on the tour, the authors investigate unusual and quirky aspects of statistics, highlighting historical, biographical and philosophical dimensions of this field of knowledge. Each chapter opens with perspectives on its theme, often from several points of view. Five original and thought-provoking questions follow. These aim at widening readers’ knowledge and deepening their insight. Scattered among the questions are entertaining puzzles to solve and tantalising paradoxes to explain. Readers can compare their own statistical discoveries with the authors’ detailed answers to all the questions.

The writing is lively and inviting, the ideas are rewarding, and the material is extensively cross-referenced.

A Panorama of Statistics:

Leads readers to discover the fascinations of statistics.
Is an enjoyable companion to an undergraduate statistics textbook.
Is an enriching source of knowledge for statistics teachers and practitioners.
Is unique among statistics books today for its memorable content and engaging style.

Lending itself equally to reading through and to dipping into, A Panorama of Statistics will surprise teachers, students and practitioners by the variety of ways in which statistics can capture and hold their interest.

The normal distribution: history, computation and curiosities

When we began our statistical studies at university, last century, we bought, as instructed, a booklet of ‘Statistical Tables’. This booklet contained all the standard tables needed for a conventional undergraduate degree programme in statistics, including the ‘percentage points’ of a variety of standard distributions, critical values for various statistical tests, and a large array of random numbers for sampling studies. We were soon made aware of one particular table, titled ‘Areas under the standard normal curve’, and were left in no doubt that we would be referring to it frequently. We used this booklet of tables in classwork throughout our studies, and had clean copies of it issued to us at every statistics examination. From our vantage point today, that booklet of Statistical Tables has become a rather quaint historical artefact from the mid‐20th century, at a time when calculators were the size of typewriters (both of these, too, being artefacts of that era).

That vital standard normal area table was to be found also in the Appendix of every statistics textbook on the market at that time. That is still the case today. It suggests that this printed tabulation from the past is still being consulted by students and, perhaps, also by professional statisticians. Need this still be so?

‐‐‐oOo‐‐‐

Before considering this question, let’s look at the history of the normal distribution and of the construction of this enduring table.

The normal distribution of values of the variable x has probability density function (pdf):

with parameters μ (the population mean) and σ (the population standard deviation). It is the most commonly met probability distribution in theoretical statistics, because it appears in such a variety of important contexts. If we follow the historical evolution of this distribution through the 18th and 19th centuries, we shall discover these contexts in a (somewhat simplified) sequence.

The normal distribution (though not yet with that name) first appeared in 1733, in the work of the English mathematician Abraham de Moivre. De Moivre discovered that the limiting form of the (discrete) binomial distribution, when the number of trials of the binomial experiment becomes infinitely large, is the pdf that we today call the (continuous) normal distribution. Stigler (1986), pages 70–85, conveys, in some analytical detail, the satisfaction de Moivre gained from his hard‐won discovery.

In a book on planetary motion, published in 1809, the German mathematician Carl Friedrich Gauss presented his pioneering work on the method of least squares estimation. It was in this context that he proposed the normal distribution as a good theoretical model for the probability distribution of real‐world random errors of measurement. Now, in a context different from de Moivre’s, Gauss rediscovered the pdf of the normal distribution by asking the question (translated into modern terminology): for what symmetric continuous probability distribution is the mean of a random sample the maximum likelihood estimator of the population mean? How he derived the pdf algebraically is sketched in Stigler (1986), pages 139–143.

It’s worth mentioning here that Gauss’s choice of a model for random errors of measurement was not the only candidate historically considered for that purpose. Gauss’s contemporary, the French mathematician Pierre Simon Laplace, had independently been looking for a suitable model since about 1773. However, he tackled the problem in a way that was the converse of Gauss’s. Gauss simply sought the symmetric function whose mean was ‘best’ estimated by the sample mean. Happily, he hit upon a criterion of a ‘best’ estimator that produced a model with other far‐reaching virtues as well.

Laplace’s approach was to search ingeniously among mathematical functions having the desired graphical profile – unimodal, symmetric and with rapidly declining tails – and to worry afterwards about how the parameters would be estimated. During the following dozen years, he came up with several promising models, but they all ultimately proved mathematically intractable when it came to parameter estimation. Some of the candidate functions that Laplace investigated are on view in Stigler (1986), chapter 3.

Though Laplace was familiar with the pdf of the normal distribution from de Moivre’s work, he somehow never thought of considering it as a model for random measurement errors and, after 1785, he turned his attention to other topics.

Then, in early 1810 – shortly after Gauss’s rediscovery of the normal distribution in 1809 – Laplace found the normal distribution turning up in his own work. This was, again, in a different context – his early proof of what we know today (in a more general form) as the Central Limit Theorem (CLT). You will find a statement of the CLT in the Overview of CHAPTER 12. To recapitulate: under very general conditions, the mean of a sample from a non‐normal distribution is approximately normally distributed if the sample size is large.

These several important achievements prompted many 19th century mathematicians to call the normal the Gauss‐Laplace distribution. As time passed, however, that name gave way to simply the Gaussian distribution.

The name ‘normal’ for this distribution first appeared when Francis Galton pioneered it to a wide public in his 1889 book Natural Inheritance (online at [14.1]) – giving chapter 5 the title ‘Normal Variability’, while still occasionally using an earlier name, ‘law of frequency of error’. He and many of his scientific contemporaries were excited to confirm that biologically determined real‐world variables, such as species‐specific size and weight, are often approximately normally distributed. Thus, yet a fourth context for the significance of this distribution was identified. This last context is, logically, quite closely related to the normal as a model for random errors of measurement, as Gauss had earlier proposed.

Galton was so moved by his and his predecessors’ discoveries that he imbued the normal distribution with an almost mystical character (see QUESTION 14.3). He writes (chapter 5, page 66): ‘I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the “Law of Frequency of Error”. The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self‐effacement amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway.’ (See QUESTION 22.1(b) for a 20th century tribute in only slightly less lyrical terms.)

From Galton’s time onwards, English‐speaking statisticians began to use the term ‘normal distribution’ routinely. Continental statisticians, on the other hand, continued for many years to refer to it as the ‘Gaussian distribution’.

For broader historical detail on this major strand in the history of statistics, we recommend the introductory chapter of Patel and Read (1996).

‐‐‐oOo‐‐‐

What about the calculation of normal probabilities? As for any continuous probability distribution, (standard) normal probabilities are represented by areas under the (standard) normal curve. The equation of the standard normal curve is

. There is a practical obstacle to evaluating areas under this curve by straightforward application of integral calculus. This is because (as has long been known) there is no closed‐form expression for the indefinite integral of the right hand side function in the above equation – that is, no solution in terms only of constants; variables raised to real powers; trigonometric, exponential or logarithmic functions; and the four basic operators + – × ÷.

Happily, there are several alternative algorithms for numerically approximating the definite integral between any two given values of x, and thus finding the corresponding normal area to any desired degree of accuracy. In earlier times, such a calculation would be slowly and laboriously done by hand, with much checking of numerical accuracy along the way. Nowadays, numerical approximation software does the job swiftly and effortlessly.

Fortunately, the standard normal area table suffices for evaluating areas under any normal curve, since all normal curves have the same shape, relative to their location and spread. A random variable that has a normal distribution with a general mean μ and general standard deviation σ can be standardised (i.e. transformed into a standard normal random variable, having mean 0 and standard deviation 1) by subtracting μ and dividing by σ.

Normal area tables were first calculated in the late 18th century, and for a variety of purposes. In 1770–71 in Basel, Daniel Bernoulli compiled a table of areas under the function y = exp(– x²/100), essentially a multiple of a normal distribution function, for approximating binomial probabilities. In 1799 in Strasbourg, Chrétien Kramp prepared a similar table to aid astronomical calculations of refraction. An overview of normal area tables published between 1786 and 1942, with insights on the different algorithms used, is given in David (2005). It is interesting to note that the normal area tables produced in 1903 by the Australian‐English statistician William Sheppard (1863–1936) have remained unsurpassed in scope and accuracy, with later reproductions of these tables differing mainly in their layout.

‐‐‐oOo‐‐‐

Do students of statistics still need booklets of statistical tables?

There is no doubt that they were indispensable until the microcomputer (also called a personal computer or desktop computer) became ubiquitous about 35 years ago. Thereafter, they became merely convenient, but even that is hardly the case today. We can see why by tracing the evolutionary thread of computing devices over the past century. (We are omitting mention of mainframe and minicomputers, because these were not generally available to undergraduate statistics students.)

During this period, the time interval between successive significant innovations in computing technology has been shortening, the computing power of the devices has been growing, and their physical scale has been shrinking – from the ‘transportable’, to the ‘portable’, to the ‘mobile’. For the practising statistician, complex computation without a mechanical calculator was unthinkable in the period 1900–1940. Electrically driven calculators gradually took over in the years 1940–1970. These were, in turn, superseded by hand‐held electronic calculators (first solely hard‐wired, then later programmable) over the period 1970–1985. These devices, with further refinements such as graphics capability, then co‐existed with the evolution of computationally much more powerful, but bulkier, microcomputers (from around 1970) and laptop computers (from around 1990). Smaller‐sized netbooks emerged in 2007, and tablet computers in 2010. Today, mobile ‘smart’ phones and (even smaller) ‘wearable’ devices represent the latest reductions in scale.

From this overview, we see that it was only with the arrival of programmable calculators, around 1980, that devices powerful enough for automated statistical computation first became cheap enough for students to afford one for their own personal use.

Today, web‐enabled and app‐equipped tablets and phones have comprehensively displaced programmable calculators for the standard repertoire of statistical functions and analyses. With steady growth in the range of statistical apps being made available, and seemingly endless expansion of statistical resources (including applets) on the web, these highly mobile personal devices can be very efficient tools for routine statistical computing, including finding ‘areas under the normal curve’.

Moreover, applets and apps introduce two improvements over printed standard normal area tables. The first is direct computation of any area under any normal distribution, using the option of specifying the mean and standard deviation of the required normal distribution. The second is graphical representation of the area calculated – that is, the probability. The first offers only convenient flexibility, but the second is an important aid for students learning about the normal distribution and finding normal probabilities for the first time. As teachers, we know that this step in a first course in statistics is often a real stumbling block; visual representation in this, and many other contexts, can be a significant key to learning.

Here are two examples of these types of software. David Lane’s online textbook of statistics, HyperStat [14.2] includes an applet for finding the normal area corresponding to the value of a standard normal variable (or vice versa) [14.3]. A similar function is available in the modestly‐priced StatsMate app (see [14.4]), developed by Nics Theerakarn for a tablet computer or mobile phone. Both sources also provide routines for a wide array of other statistical calculations, as their websites reveal.

We recognise that the limited availability and high expense of modern mobile technology in some countries may preclude access to the convenience this technology offers. For those to whom it is available and affordable, however, the era of statistical tables is, surely, past.

Questions

QUESTION 14.1 (A)

Attributes of the normal distribution, some of which may surprise you.

As everyone knows, the standard normal distribution has points of inflection at z = –1 and +1. But where do the tangents at these points cut the z‐axis?
On the standard normal distribution, what is interesting about the z‐value, z = 0.35958?
Suppose we wish to draw a standard normal distribution accurately to scale on paper so that the curve will be 1 mm above the horizontal axis at z = 6. How large a piece of paper will be required? [Hint: how high will the curve be above the horizontal axis at the mode?]

QUESTION 14.2 (B)

William Sheppard’s (1903) article, ‘New tables of the probability integral’, gives cumulative areas under the standard normal curve to 7 decimal places for values of z from 0.00 to 4.50 and to 10 decimal places for values of z from 4.50 to 6.00. Why did he carry out his calculations to so many decimal places? Can you suggest any situation where knowledge of these values to such accuracy would be useful?

QUESTION 14.3 (B)

In the Overview, we quoted Galton’s lyrical description of the normal distribution as that ‘wonderful form of cosmic order’. Galton’s jubilation came from observing two phenomena. Firstly, that the distribution of measured values of variables which could be interpreted, in some way, as random errors (i.e. deviations from some biological or technical standard, or ‘norm’), seems to be (approximately) normal. And secondly, that the distribution of sample means drawn from non‐normal distributions becomes, as the sample size increases (‘the huger the mob’), more and more like the normal (‘the more perfect its sway’) – which is what the Central Limit Theorem (CLT) declares.However, there are exceptions – measured variables with distributions that do not look normal, and sampling distributions of sample means that do not conform to the CLT. Give examples of these two kinds of exceptions. Can they be brought within the ‘cosmic order’? Do you think Galton was wildly overstating his case?

QUESTION 14.4 (B)

Some non‐normal data have a distribution that looks more like a normal after a logarithmic transformation is applied. Other non‐normal data look more like a normal after a reciprocal transformation is applied. What data characteristic(s), in each case, suggest that the mentioned transformation will be effective?

QUESTION 14.5 (B)

Sketch on the same set of axes the frequency curve of a standard normal distribution and of a chi‐squared distribution with one degree of freedom. Do these two curves intersect? (Note: the chi‐squared distribution with one degree of freedom is the distribution of the square of a single standard normally‐distributed variable.)

References

PRINT

David, H.A. (2005). Tables related to the normal distribution: a short history. The American Statistician 59, 309–311.
Patel, J.K. and Read, C.B. (1996). Handbook of the Normal Distribution, 2nd edition. CRC Press.
Sheppard, W. (1903). New tables of the probability integral. Biometrika 2, 174–190.
Stigler, S.M. (1986). The History of Statistics: The Measurement of Uncertainty Before 1900. Harvard University Press.

ONLINE

[14.1] Galton, F. (1889). Natural Inheritance. Macmillan. At http://galton.org, click on Collected Works, then on Books.
[14.2] http://davidmlane.com/hyperstat/index.html
[14.3] http://davidmlane.com/normal.html
[14.4] http://www.statsmate.com

15
The pillars of applied statistics I – estimation

There are two pillars of statistical theory, upon which all applied work in statistical inference rests. In this chapter we shall focus on estimation while, in the next chapter, we shall look at hypothesis testing. Among the most famous of past statisticians, Ronald Fisher, Jerzy Neyman and Egon Pearson (whose names appear in FIGURE 22.2) laid the foundations of modern methods of statistical inference in the 1920s and 1930s. They polished procedures for estimation proposed by earlier thinkers, and invented terminology and methods of their own. This was an era of fast‐moving developments in statistical theory.

For relevant historical background, a valuable resource is Jeff Miller’s website at [15.1], titled Earliest Known Uses of Some of the Words of Mathematics. The entry for ‘Estimation’ informs us that the terms ‘estimation’ and ‘estimate’, together with three criteria for defining a good estimator – ‘consistency’, ‘efficiency’ and ‘sufficiency’ – were first used by Fisher (1922), online at [15.2]. Fisher defined the field in a way that sounds quite familiar to us today: ‘Problems of estimation are those in which it is required to estimate the value of one or more of the population parameters from a random sample of the population.’ In the same article, he presented ‘maximum likelihood’ as a method of (point) estimation with some very desirable statistical properties.

Neyman, who subsequently pioneered the technique of interval estimation, referred to it as ‘estimation by interval’, and used the term ‘estimation by unique estimate’ for what we now call point estimation. It was Pearson who introduced the modern expression ‘interval estimation’.

These, and many earlier, pioneers of (so‐called) Classical estimation devised parameter estimators that use sample data alone. In addition to maximum likelihood, Classical techniques include method of moments, least squares, and minimum‐variance unbiased estimation.

The pioneers’ successors pursued more complex challenges.

A salient example is devising estimators that are more efficient than Classical estimators, because they synthesise the objective (i.e. factual) information in sample data, with additional subjectiveinformation (e.g. fuzzy knowledge about parameter magnitudes) that may be available from other sources. For example, if we want to estimate the mean of a Poisson model of the distribution of children per family in Australia, we can expect to do so more efficiently by amending a Classical estimation formula to incorporate the ‘common knowledge’ that the mean number of children per family in Australia is somewhere between 1 and 3.

The most extensive array of techniques for pooling objective and subjective information to enhance the quality of estimation goes by the collective name Bayesian estimation. Methods of Bayesian estimation were first proposed in the 1950s, by the US statistician Leonard Savage. See CHAPTER 20 for an overview of the Bayesian approach.

Here is a second example of progress in estimation.

The Classical theory of interval estimation of any parameter – say, a population mean – necessarily assumes a specific model (e.g. normal, exponential, Poisson) for the distribution of the population data. As you may know, constructing a confidence interval for a population mean requires a value both for the sample mean (or whichever other point estimator of the population mean is to be used) and for the standard error of the sample mean (or other estimator). Often, the theoretical standard error involves unknown parameters, so it is necessary, in practice, to work with an estimated standard error. The Classical approach uses the properties of the specific model chosen for the data to derive an estimator of the standard error.

But how is such a model fixed upon in the first place? It may be suggested by theoretical principles of the field of knowledge within which the data are being analysed, or by the general appearance of the summarised sample data (e.g. a histogram or a scatter diagram).

Yet, what if the field of knowledge has nothing to say on the choice of a model for the data – and what if, moreover, the summarised sample data look quite unlike any of the well‐established statistical models? We would then need to find some way of estimating the standard error of the sample mean without having any explicit model as a basis. In 1979, the US statistician Bradley Efron invented a very effective way of doing just that. His estimation procedure is known as bootstrapping. In the simplest version of this procedure, the sampling distribution of the sample mean is approximated by repeated sampling (termed ‘resampling’) from the original sample. A bootstrapped standard error of the sample mean can be constructed from this distribution and, thus, a bootstrapped confidence interval for the population mean can be obtained. For a fuller, non‐technical explanation (including an illustration of resampling), see Wood (2004) or Diaconis and Efron (1983).

Given the complexity of the real world, and the ever‐increasing number of fields in which statistical methods are being applied, it is hardly surprising that countless situations have turned up where no well‐defined model for the data is evident, or where statisticians are unwilling to assume one. This explains the enormous growth in popularity of estimation by bootstrapping over the past twenty years.

Lastly, a third direction in which estimation has moved in the post‐Classical period: statisticians’ willingness to use biased estimators.

In Classical estimation, whenever it came to a conflict between the criteria of unbiasedness and efficiency in the choice of a ‘best’ estimator, the unbiased estimator was inflexibly preferred over the more efficient, but biased, one. QUESTION 6.5 illustrates such a conflict.

A more flexible way of resolving this kind of conflict is to see if there is an estimator that compromises between the conflicting criteria – that is, an estimator which is a little biased, yet rather more efficient than the corresponding unbiased estimator. There are several paths to finding such useful biased estimators. One is the method of minimum mean square error (MMSE) estimation. A context in which this is effective is seen in QUESTION 15.4.

Unfortunately, the method of MMSE estimation is not immune to breakdown, even in some quite simple contexts. See QUESTION 15.5(a) for an example.

Classical estimation methods break down, too, though more rarely. Maximum likelihood estimation, for instance, fails in any context where the likelihood function increases without limit, and thus has no maximum. A (quite technical) example is given in Konijn (1963).

You have read of paradoxes in earlier chapters. So it should come as no surprise that there are paradoxes to be found (and resolved!) in the theory of estimation as well. There are two in the answer to QUESTION 15.5 (b).

Questions

QUESTION 15.1 (B)

Suppose you are asked this question: ‘I’ve noticed that the best estimator of the population mean is the sample mean; the best estimator of the population median is the sample median; and the best estimator of the population variance is the sample variance. Is that a pattern I can rely on for finding best estimators?’ How would you answer?

QUESTION 15.2 (B)

Given a sample mean and an initial numerical 95% confidence interval for the unknown population mean of a normal distribution, N(μ, σ²), based on that sample mean, what is the probability that a replication (i.e. an independent repetition of the same sampling process) gives a sample mean that falls within the initial confidence interval? [For simplicity, assume that the value of σ² is known.]

QUESTION 15.3 (B)

We wish to estimate the mean μ of a normal distribution N(μ, σ²). Suppose we have two independent random samples from this distribution: one sample has size n₁ and sample mean

, and the other has size n₂ and sample mean

. As an estimator of μ, is it better to use the average of the sample means

or, alternatively, the mean of the pooled data

QUESTION 15.4 (C)

When estimating the variance σ² of a normal population with unknown mean from a sample of size nwith mean

, we know (see, for example, QUESTION 6.5) that

is an unbiased estimator. But what sort of biased estimator is

, and why might we prefer to use it?

QUESTION 15.5 (C)

In the case of the normal distribution N(μ, σ²), with σ² assumed known, consider estimators of μ of the form c (c a constant), where is the sample mean. Find the value of c that will make c the MMSE estimator of μ, and show that this value of c means that here the method of MMSE estimation has failed.
In 1961, in the Proceedings of the 4th Berkeley Symposium on Mathematical Statistics and Probability, two US statisticians, Willard James and Charles Stein, published a very counterintuitive – and, indeed, paradoxical – theoretical result loosely to do with MMSE estimation. What is this result? And why is it paradoxical?

References

PRINT

Diaconis, P. and Efron, B. (1983). Computer‐intensive methods in statistics. Scientific American248(5), 116–130.
Konijn, H.S. (1963). Note on the non‐existence of a maximum likelihood estimate. Australian Journal of Statistics 5, 143–146.
Wood, M. (2004). Statistical inference using bootstrap confidence intervals. Significance 1, 180–182.

ONLINE

[15.1] http://jeff560.tripod.com/mathword.html
[15.2] Fisher, R.A. (1922). On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society, Series A 222, 309–368. At http://digital.library.adelaide.edu.au/dspace/handle/2440/15172

The pillars of applied statistics II – hypothesis testing

In this chapter, we are looking at hypothesis testing – that peculiarly statistical way of deciding things. Our focus is on some philosophical foundations of hypothesis testing principles in the frequentist, rather than the Bayesian, framework. For more on the Bayesian framework, see CHAPTER 20.

The issues discussed here all relate to a single test. In the next chapter, we investigate some matters that may complicate the interpretation of test results when multiple tests are performed using the same set of data.

Let us begin with a brief refresher on the basics of hypothesis testing.

The first step is to set up the null hypothesis. Conventionally, this expresses a conservative position (e.g. ‘there is no change’), in terms of the values of one or more parameters of a population distribution. For instance, in the population of patients with some particular medical condition, the null hypothesis may be: ‘mean recovery time (μ₁) after using a new treatment is the same as mean recovery time (μ₂) using the standard treatment’. This is written symbolically as H₀: μ₁ = μ₂.

Then we specify the alternative (or ‘experimental’) hypothesis. For instance, ‘mean recovery time using the new treatment is different from mean recovery time using the standard treatment’. We write this symbolically as H₁: μ₁ ≠ μ₂. Though we would usually hope that the new treatment generally results in a shorter recovery time (μ₁ < μ₂), it is conventional, in clinical contexts, to test with a two‐sided alternative. We must also specify an appropriate level of significance (usually 0.05, but see the answer to QUESTION 16.1), and a suitable test statistic – for example, the one specified by the two‐sample t‐test of a difference of means.

Before proceeding, we must confirm the fitness for purpose of the chosen test. The two‐sample t‐test assumes that the data in each group come from a normal distribution. If this is not a reasonable assumption, we may need to transform the data to (approximate) normality (see QUESTION 14.4).

Next, we collect data on the recovery times of some patients who have had the new treatment, and some who have had the standard treatment. It is important for the validity of conclusions from statistical testing that the data collected are from patients sampled randomly within each of the two groups. In performing the test, we gauge how different the mean recovery times actually are.

On the (null) hypothesis that the mean recovery times after the new and the standard treatments are equal, we calculate the probability (called the ‘p‐value’) of finding a difference of means (in either direction) at least as large as the one that we actually observe in our sample data. If this p‐value is large, it’s very likely that there is no conflict with the null hypothesis that the two sets of recovery times have the same population means. However, we cannot be certain about this conclusion; it might be that the means really are not the same, and that our sample data are quite unusual. If, on the other hand, this p‐value is small, we would be inclined to the interpretation that the means are not the same and our null hypothesis is incorrect. Again, we cannot be certain; it might be that the means really are the same, and our sample data are quite unusual.

How small is ‘small’ for the p‐value? When the p‐value is less than the pre‐specified significance level. In that case, when we conclude that there is a difference between the population means, we could say equivalently that there is a (statistically) significant difference between the sample means. It is important to understand that ‘significant’, here, means ‘unlikely to have arisen by chance, if the null hypothesis is true’. It does not necessarily mean ‘practically important’ in the real‐world context of the data.

As we indicated earlier, any conclusion we draw from a hypothesis test may be wrong. By choosing to use a 0.05 significance level, we admit a 5% chance of rejecting the null hypothesis when it is, in fact, true. Thus, were we to carry out a particular test multiple times with different data, we could expect to make this ‘type I error’ one time in every 20, when the null hypothesis is true. Mirroring the ‘type I error’ is the ‘type II error’, where we fail to reject the null hypothesis when it is the alternative hypothesis that is, in fact, true. In practice, it is rarely possible to fix the chance of making a type II error. In this example, that is because we do not know the actual amount by which μ₁ and μ₂ differ. All we know is that they are not equal. This explains why fixing the chance of making a type I error is the focus of the test procedure, even where the type II error may be the more practically important one to avoid.

These are the salient theoretical aspects of a test of a single statistical hypothesis, as presented in introductory textbooks. The procedure seems polished and easy to implement. However, with a little historical background, we shall see that it is by no means uncontroversial. Nor are the results always straightforward to interpret.

‐‐‐oOo‐‐‐

Statistical methods for testing hypotheses were developed in the 1920s and 1930s, initially by Ronald Fisher, and subsequently by Jerzy Neyman and Egon Pearson – the same three statisticians whose pioneering role in the theory of estimation we highlighted in CHAPTER 15. In their work on testing hypotheses, these three introduced the terms null hypothesis, alternative hypothesis, critical region, statistical significance, power and uniformly most powerful test, which are today familiar to every statistician.

There is an important philosophical difference between the approaches of Fisher, on the one hand, and Neyman and Pearson on the other. At the time, it caused a great deal of polemical controversy and personal acrimony between these proponents.

Fisher developed what he called the theory of significance testing, which focuses exclusively on what we termed, above, the null hypothesis. The purpose of significance testing, said Fisher, is to reach a conclusion about the truth of this hypothesis. To proceed, begin by tentatively assuming it is true. Then, under this assumption, compare (a) the probability of getting the test data (or data more extreme than the test data), with (b) a reference value, chosen at the statistician’s discretion. As already mentioned, the former probability is nowadays called the ‘p‐value’ and the latter reference value is called the ‘level of significance’. After some reflection, Fisher came to the view that it is quite appropriate for the level of significance to be chosen subjectively, even after the p‐value has been calculated.

If the p‐value is smaller than the level of significance, the test data are unlikely to have been generated under the stated hypothesis. Accordingly, the hypothesis is rejected. Only then, said Fisher, is there a search for another hypothesis to replace the one that has been rejected.

It is worth noting that Fisher offered no theoretical criteria for judging the merits of the test statistic he put forward in each of the hypothesis testing contexts he studied, in marked contrast to his theoretical work on estimation (see CHAPTER 15). This omission by Fisher was remedied by Neyman and Pearson.

In their alternative to Fisher’s approach, labelled hypothesis testing, Neyman and Pearson argued that the testing procedure should keep the null and alternative hypotheses simultaneously in view. The purpose of hypothesis testing, they said, is to make a decision between these two hypotheses. They reasoned that a suitable decision procedure could be developed from appraising the relative risks (i.e. probabilities) of both kinds of possible decision errors mentioned above – namely, rejecting the null when it is true (the type I error), and failing to reject the null when the alternative is true (the type II error).

Fixing the risk of the type I error is achieved in the same way as Fisher did, for Fisher’s ‘level of significance’ is nothing but the probability of rejecting a true null hypothesis. However, Neyman and Pearson viewed the choice of level of significance as restricted to a set of standard values (e.g. 0.05, 0.01, 0.001), rather than being open to discretion, as Fisher advocated. Fixing the risk of the type II error is not routinely feasible, since the alternative hypothesis is not specified in exact numerical terms. Nevertheless, that risk can be tabulated for each of a set of parameter values, corresponding to a range of alternative hypotheses. Then, supposing that several competing test statistics are available for the test in question, an optimal selection can be made among them by choosing the one that has – for a given size of type I error – the set with the smallest type II errors.

Biau et al. (2010), online at [16.1], contrast the approaches of Fisher and of Neyman and Pearson at greater length. Lehmann (1993) reviews the two approaches in insightful detail and concludes that, in practice, they are complementary. For some statisticians, this is sufficient justification to declare the modern textbook account of testing to be a unification of the two approaches. Other statisticians are unconvinced, maintaining that, in theoretical terms, the two approaches will always be philosophically incompatible. Thus, they refer – rather negatively – to the modern textbook treatment as a hybrid, rather than a unification, of the two approaches.

‐‐‐oOo‐‐‐

It seems to us that when most applied statisticians are at work, they rarely give much thought to the philosophical foundations on which their techniques rest. This lack of attention applies, generally speaking, also to statistics educators.

So, it is a rare statistics course where students learning about hypothesis testing are invited to reflect on questions such as the following. What is the worth of hypothesis tests carried out on non‐random samples, such as ‘voting’ data submitted by readers of online publications? How many missing data values may be tolerated before a standard hypothesis test is no longer worth doing? How common is it to have a uniformly most powerful test? What should be done if such a test is unavailable? What are the essential differences between the Bayesian and the frequentist approaches to hypothesis testing? Why, in testing, is the principal focus of attention the null hypothesis – that is, the ‘no change, no effect or no difference’ situation –which, some argue, is in practice almost never true?

Sometimes, the type II error is more serious than the type I error. This is the case, for instance, in mass screening for cancer, where the type II error of a test on an individual is deciding that the person doesn’t have cancer when, in fact, he or she actually does. So, is it ever possible to directly control the size of the type II error in testing?

Knowing how to answer these and similar philosophical questions is very important in developing a deep and secure understanding of statistics and its techniques. With such an understanding, it is easier to recognise and respond to valid criticisms of hypothesis testing.

Here is an example of such a criticism: a null hypothesis (that assigns a specific numerical value to a parameter) can always be rejected with a large enough sample. If, however, this finding represents a (type I) decision error, then this interpretation of the test result will send the investigator off in the wrong direction. That fundamentally limits the usefulness of every technique of hypothesis testing nowadays, since huge samples (‘big data’) are becoming common in more and more fields (e.g. banking, climatology, cosmology, meteorology, online commerce, telecommunications and analysis of social media). In such situations, it is more fruitful to replace testing by interval estimation, which is as meaningful with ‘big data’ as with ‘little data’.

There are, in fact, further reasons to prefer an interval estimate to a hypothesis test than the likely breakdown of testing in the case of big data. A sustained critique of hypothesis testing has evolved over at least 50 years in the literature of statistics in psychology. This critique has several strands.

Firstly, there is the evidence that test results are too often given erroneous interpretations through faulty understanding of the theory. Here are two examples of common mistakes: the failure to reject the null hypothesis means that the null hypothesis is certainly true; the p‐value is the probability that the null hypothesis is false.

Secondly, there is the view that a confidence interval achieves more than a hypothesis test. The reasoning runs like this. Given the endpoints of an appropriate confidence interval, the result of an associated hypothesis test can be deduced (see QUESTION 16.2) but, given the result of a hypothesis test, we cannot deduce the endpoints of the associated confidence interval. Thus, the confidence interval gives more information and, at the same time, it is less liable to misinterpretation.

Krantz (1999) reviews and comments on these critiques and several others. He concludes that, while a few criticisms are misjudged, most are merited, yet he hesitates to recommend the total abandonment of hypothesis testing in favour of interval estimation. More recent writers are not so reluctant. A textbook by Cumming (2012), aimed at psychologists and other scientists who produce meta‐analyses, shows how statistical inference can be carried out properly without hypothesis tests. The future implications for statistics education are reviewed in Cumming et al. (2002), online at [16.2].

Questions

QUESTION 16.1 (A)

Most statistical hypothesis tests are carried out using a significance level of 5%. But where does this choice of numerical value (almost a statistical icon!) come from?

QUESTION 16.2 (B)

In many contexts, estimation and hypothesis testing can be viewed as two sides of the same coin. Let’s explore this proposition. Suppose you have constructed a 95% confidence interval for the mean μ of a population, based on a random sample of data. Now you decide that you would have preferred to use your sample to carry out a test of μ = μ₀ against a two‐sided alternative. How can you use the confidence interval to obtain a result for the test? Could you use this confidence interval if you wanted to carry out a test against a one‐sided alternative?

QUESTION 16.3 (B)

Consider the familiar test on the value of the population mean of a normal distribution with known variance, against the two‐sided alternative. The power curve for this test has been described as resembling ‘an upside‐down normal curve’. To what extent is this description correct?

QUESTION 16.4 (B)

If a hypothesis test is carried out using a 5% level of significance against a specific alternative with a power of 90%, and the null hypothesis is rejected, what is the probability that it is actually true?

QUESTION 16.5 (B)

Writing in about 1620 about the game of rolling three dice – at a time when there was as yet little in the way of formal probability theory – Galileo reported that gamblers experienced in this game told him that a total of 9 and 10 can each be obtained in six ways. Yet, it was their perception over the long run, that 10 is slightly more likely than 9. (Galileo’s resolution of this puzzle is set out in the answer to QUESTION 11.1.)

Let us examine this perception formation over the long run, using modern statistical methods. If the dice are fair, the theoretical probabilities of 9 and 10 are 25/216 and 27/216, respectively. In a long run of rolls of three dice, one would expect the empirical probabilities of observing a total of 9 or 10 to closely approximate these respective values.

How many rolls of three dice would be necessary to conclude with reasonable confidence that a total of 10 is more likely than 9?

Interpret this question as follows: for a test at the 5% significance level of the null hypothesis that the ratio of chances is 1 : 1, what sample size would give a 90% power of rejecting this, in favour of the alternative that the ratio of chances is 27 : 25?

References

PRINT

Cumming, G. (2012). Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta‐Analysis. Routledge.
Krantz, D. (1999). The null hypothesis testing controversy in psychology. Journal of the American Statistical Association 94, 1372–1381.
Lehmann, E.L. (1993). The Fisher, Neyman‐Pearson theories of testing hypotheses: one theory or two? Journal of the American Statistical Association 88, 1242–1249.

ONLINE

[16.1] Biau, D.J., Jolles, B.M. and Porcher, R. (2010). P value and the theory of hypothesis testing: an explanation for new researchers. Clinical Orthopaedics and Related Research 468(3), 885–892. At http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2816758/
[16.2] Cumming, G., Fidler, F. and Thomason, N. (2002). The statistical re‐education of psychology. Proceedings of the Sixth International Conference on Teaching Statistics (ICOTS6). At http://www.stat.auckland.ac.nz/~iase/publications/1/6c3_cumm.pdf

17
‘Data snooping’ and the significance level in multiple testing

It is a fundamental precept of applied statistics that the scheme of analysis is to be planned in advance of looking at the data. This applies to all kinds of procedures. Let’s take fitting a statistical model as an example.

The point, in this context, is to ensure as far as possible that the model is vulnerable to rejection by the data. If the data were inspected first, they might suggest a form of model to the investigator, who might then become attached – and even committed – to that model. He or she might then, even subconsciously, twist the data or direct the analysis so that the initially favoured model also comes out best in the end. This kind of subtly biased analysis is especially likely when persuasion (whether social, political or commercial) is the ultimate purpose of the model builder’s activity.

To avoid such bias, it is important that the model’s form and structure be specified in the greatest detail possible before the data are examined. The data should then be fitted to the model, rather than the model fitted to the data. (For more on fitting a model, see CHAPTER 13.)

What applies to modelling also applies to hypothesis testing: the hypotheses to be tested should be formulated before looking at the data. If the choice of hypothesis (or of statistical analysis, generally) is made after looking at the data, then the process is described as data snooping.

In this chapter, we explore the statistical consequences of data snooping in hypothesis testing – in particular, when multiple tests are done using the same set of data. We show why statisticians should be wary of this practice and, yet, why it is almost impossible to avoid.

Data snooping is one of several kindred practices that go by different names in the statistical literature. Selvin and Stuart (1966) distinguish data snooping, data hunting and data fishing, which they refer to collectively as data dredging. Martha Smith’s website [17.1], ‘Common mistakes in using statistics’, has a nice section on data snooping. She writes ‘Data snooping can be done professionally and ethically, or misleadingly and unethically, or misleadingly out of ignorance.’ We hope to influence you to keep to the first of these alternatives in your statistical work.

We can summarise the textbook procedure for testing a single hypothesis test like this. A null hypothesis is set up, expressing a conservative (e.g. ‘no change’) position – for example, that a particular parameter has the value zero. This is the hypothesis that is to be tested. At the same time, an alternative hypothesis is set up in contrast to the null hypothesis – for example, that the parameter is greater than zero. This is the hypothesis which will be adopted if the null hypothesis is rejected by the test. The null hypothesis is always defined in exact numerical terms, while the alternative is, in general, numerically open‐ended.

Evidence is collected in the form of real‐world data. If this evidence is unlikely to have arisen if the null hypothesis were true, then the null hypothesis is formally ‘rejected’ – otherwise, the formal conclusion is ‘the evidence is not strong enough to reject the null hypothesis’.

‐‐‐oOo‐‐‐

Scientific investigations rarely limit themselves to a single hypothesis. Let’s return to our clinical example in CHAPTER 16. Rather than collecting data solely on the recovery times of patients after treatment, we (as medical researchers) will usually gather much more information at the same time: patients’ age, sex, body mass index (BMI), blood pressure and pulse rate will be recorded; blood will be taken and cholesterol, glucose and insulin levels measured; and subjective assessments of the patients’ state of mind will be obtained via questionnaires. After all, recovery time may depend on many more variables than just the mode of treatment used. Ultimately, we will have a sizable database.

Then, as well as testing whether or not recovery times are different for the two modes of treatment, we may also want to test whether each of the other variables that we have available is related to recovery time – maybe as part of a comprehensive model, maybe as separate tests on individual variables. In this process, we will probably use particular data sets in the database multiple times. Let’s say that, in all, we do 25 tests, each with significance level 0.05.

Now we may have a new problem with our testing procedure. Assume, for the sake of illustration, that, in fact, none of the measured variables in the database is related to recovery time. Although for each test there is only a 5% chance of a type I error occurring, there is, in the complete set of 25 tests, a higher probability of at least one type I error occurring. If the tests were independent (which is not so realistic here), this probability would be 1 – 0.95²⁵ = 0.72. With dependent tests, the probability is likely to be lower, but still well above 0.05. Thus, there is a probability of up to 0.72 of discovering a ‘significant’ result at least once in the 25 tests. Yet, any finding of significance is, by our assumption, illusory.

This illustration shows how multiple tests of hypothesis, performed using a common data set, can inflate the chance of making a type I error to a quite unacceptable level. To counteract this, we could adjust the significance level of each individual test so that the overall significance level remains at 0.05. A straightforward way to achieve this is to divide 0.05 by the number of tests to be done (this is known technically as a Bonferroni adjustment – see QUESTION 17.3). Here, 0.05/25 = 0.002, so an individual test result will be significant if the p‐value is less than 0.002.

However, if a very large number of tests is to be carried out, this approach can produce a quite dramatically low value for the significance level of a single test. That will make it very difficult ever to reject the null hypothesis. Thus, Michels and Rosner (1996), writing in The Lancet about a situation involving 185 planned tests using the same database, where the overall significance level was to be held to 0.05, say: ‘It defies any modicum of commonsense to require a significance level of 0.00027 from a study.’

And this is not the end of the story. Suppose that, after doing our 25 initial tests, we notice that, for female patients over the age of 70, the new treatment seems to work much better than the standard treatment. So, we carry out a further hypothesis test on this subset of our sample, and find a p‐value of 0.00015 – surely a significant result, even assessed against the adjusted significance level of 0.002, and one that we could promote among practitioners as evidence for preferring use of the new treatment with older women.

But is this latest result, in truth, significant? Well, how many tests will have been done, explicitly or implicitly, when we consider our study concluded? Let’s see: there are the initial 25 that we carried out earlier, plus this latest one. Now, let’s suppose we had picked out females over 70 as one of (say) eight subgroups to test (two sexes and four age groups). True, we didn’t actually do the other seven tests, because it was fairly obvious by inspection that there would be no significant result. Also, what about the other combinations of explanatory variables that we reviewed (BMI, blood pressure, higher than usual lipid levels, etc.)? Maybe we should allow for testing these in the subgroups as well. But then, allowing for all these extra tests, the adjusted significance level would get too small, and our result for females over 70 would (unfortunately) no longer be significant. Better, then, to forget about our many exploratory investigations that turned out insignificant, and report only those that are significant. We are much more likely to get such a report published!

Now we have slipped over the line into the statistically unethical behaviour that data snooping can represent! As the American Statistical Association says in its Ethical Guidelines for Statistical Practice, online at [17.2]: ‘Selecting the one “significant” result from a multiplicity of parallel tests poses a grave risk of an incorrect conclusion. Failure to disclose the full extent of tests and their results in such a case would be highly misleading.’

An honest (and professionally defensible) strategy is to do and report only the tests that we specify in advance, adjusting the significance level we use in each test for the number of tests. There is no objection to undertaking the fishing expeditions that tempt us as we proceed with our pre‐specified agenda of tests. Indeed, it may be hard to resist their allure. However, the results of fishing expeditions should be reported as such, and the process described, perhaps, as ‘hypothesis generation’ rather than ‘hypothesis testing’.

A valid way to test a hypothesis thrown up in a fishing expedition is to seek out a new set of data. Alternatively, if we initially have a large set of data (meaning many subjects, rather than many variables), we could divide the set randomly into two parts, using one part to generate hypotheses and the other to test them.

In recent years, there has been a movement in many professional fields (notably in medicine) towards evidence‐based practice. It ought to be a matter of deep public concern if the unethical pursuit of data snooping were widespread, for it would raise the suspicion that much of the evidence behind evidence‐based practice was, in fact, statistically insignificant. Indeed, there are outspoken researchers who claim that this state of affairs is already real, rather than merely speculative. A striking example is given by Ioannidis (2005), online at [17.3], in a publication provocatively titled ‘Why most published research findings are false’. Another such example is given by Simmons et al. (2011). A web search of either title opens a deluge of supportive scientific commentary.

Questions

QUESTION 17.1 (B)

A classic example of data snooping in the scientific literature appears in an article that examines the rhythms of metabolic activity of a mythical animal. The author started with a set of randomly generated data representing its metabolic activity, and used standard time series techniques to analyse them. What is the animal, who is the author, and what are the conclusions of the study? And what was the point of doing it?

QUESTION 17.2 (B)

In the context of cyber security, the term ‘data snooping’ has another meaning. What is this, and what relation does it have to statistical data snooping?

QUESTION 17.3 (B)

The Bonferroni adjustment for multiple testing consists of lowering the significance level for each individual test to α/n (where n is the number of tests carried out, explicitly or implicitly) in order to achieve an overall significance level of at most α for the group of tests as a whole. Explain, in the simplest case of just two tests, how the Bonferroni adjustment works when the tests are dependent (as will be the case when they are carried out using the data on exactly the same set of variables)?

QUESTION 17.4 (B)

In the lead up to the Soccer World Cup in 2010, Paul the Octopus displayed his psychic powers by correctly predicting the outcomes of seven final‐round games involving Germany, and then the final between Spain and The Netherlands. More detail about his feat is available online at [17.4]. Paul carried out his predictions by choosing between two identical containers of food marked with the flags of the competing countries. If this were set up as a hypothesis testing situation, what would be the null and alternative hypotheses? What is the p‐value from the test? To what extent does the result give evidence for Paul’s psychic abilities? In what sense is this result connected with data snooping?

Cartoon illustration of an octopus on top of two boxes labeled with flags of Netherlands (left) and Spain (right). A soccer ball is right beside the box for Netherlands.

QUESTION 17.5 (C)

In a situation where we carry out a large number of hypothesis tests, there is an unacceptably high chance of finding at least one ‘significant’ result by chance. Using a Bonferroni adjustment, the overall significance level remains low, but at the expense of requiring very small levels of significance for each individual test. In 1995, two Israeli statisticians put forward a compromise approach to tackling the problem of multiple testing. Who were the statisticians, and what did they propose?

References

PRINT

Michels, K. and Rosner, B. (1996). Data trawling: to fish or not to fish. The Lancet 348(9035), 1152–53.
Selvin, H. and Stuart, A. (1966). Data dredging procedures in survey analysis. The American Statistician 20(3), 20–23.
Simmons, J.P., Nelson, L.D. and Simonsohn, U. (2011). False‐positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science 22, 1359–1366.

ONLINE

[17.1] http://www.ma.utexas.edu/users/mks/statmistakes/datasnooping.html
[17.2] http://www.amstat.org/committees/ethics/index.html
[17.3] Ioannidis, J. (2005). Why most published research findings are false. PLoS Medicine 2(8), e124. At http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124
[17.4] http://en.wikipedia.org/wiki/Paul_the_Octopus

18
Francis Galton and the birth of regression

It has not been very long since the centenary of the death of one of the founders of our discipline – Francis Galton (1822–1911), student of medicine and mathematics, tropical explorer and geographer, scientist and, above all, statistician. In this chapter, we shall bring to mind something of this truly remarkable man and his statistical contributions. A comprehensive account of Galton’s life and work can be had from his own Memories of My Life (1908) and from Karl Pearson’s three volume The Life, Letters and Labours of Francis Galton (1914‐24‐30). These books, as well as a large collection of Galton’s scientific writings, can be read in facsimile at the website [18.1]. A further biography of Galton is cited in CHAPTER 22.

Galton was born into a well‐off manufacturing and banking family who were much involved with scientific and literary matters. Members of his immediate family were, in particular, interested in things statistical, his grandfather ‘loving to arrange all kinds of data in parallel lines of corresponding lengths, and frequently using colour for distinction’, and his father ‘eminently statistical by disposition’ (Memories of My Life, pages 3, 8). His half‐cousin was the naturalist Charles Darwin, author of On the Origin of Species (1859). Galton showed his high intelligence early. On the day before he turned five years old, he wrote a letter to his sister (quoted in Terman, 1917):

‘My dear Adèle, I am 4 years old and I can read any English book. I can say all the Latin Substantives and Adjectives and active verbs besides 52 lines of Latin poetry. I can cast up any sum in addition and can multiply by 2, 3, 4, 5, 6, 7, 8, 10. I can also say the pence table. I read French a little and I know the clock.’

The story of Galton’s early life – particularly his travels and explorations in Eastern Europe and Africa – is a colourful one. After his marriage at age 31 to Louisa Jane Butler, he settled down to a life of scientific studies that included such diverse areas as anthropology, anthropometry, psychology, photography, fingerprint identification, genetics and heredity.

One of his experiments concerned the sizes of seeds, and it turned out to be particularly important statistically, for it led to the birth of the concept of regression. Galton sent several country friends a carefully selected set of sweet pea seeds. Each set contained seven packets of ten equal‐sized seeds, with diameters from 15 to 21 hundredths of an inch. Each friend planted the seven packets in separate beds, grew the seeds following instructions, and collected and returned the ripe seeds from the new generation of plants.

Galton first reported the results in an article in Nature in 1877, and summarised them in 1886 in his far‐reaching paper Regression towards mediocrity in hereditary stature. In this paper (page 246), Galton states:

‘It appeared from these experiments that the offspring did not resemble their parent seeds in size, but to be always more mediocre [today we would say ‘middling’] than they – to be smaller than the parents, if the parents were large; to be larger than the parents, if the parents were very small. … The experiments showed further that the mean filial regression towards mediocrity was directly proportional to the parental deviation from it.’

And, in the appendix to the paper (page 259), he writes more specifically:

‘It will be seen that for each increase of one unit on the part of the parent seed, there is a mean increase of only one‐third of a unit in the filial seed; and again that the mean filial seed resembles the parental when the latter is about 15.5 hundredths of an inch in diameter. Taking then 15.5 as the point towards which filial regression points, whatever may be the parental deviation … from that point, the mean filial deviation will be in the same direction, but only one‐third as much.’

Galton then repeated his heredity investigation with human heights. He obtained data on the heights of 930 adult children and their 205 pairs of parents, from family records that he collected by offering prizes. Since women are generally shorter than men, he adjusted the female heights to male equivalents by asking his ‘computer’ (in those days, a person!) to multiply them by 1.08. Using only large families, with six or more adult children, he tabulated the average height of a child in each family against the mid‐parental height (the average of father’s and adjusted mother’s heights), and found essentially the same results as he had with his seed experiment. He determined that the ‘level of mediocrity’ (the point where the average height of all children equals the average mid‐parental height of all parents) in the population was 68¼ inches, and then defined what he called the ‘law of regression’ for this context (page 252): ‘… the height‐deviate of the offspring is, on the average, two‐thirds of the height‐deviate of its mid‐parentage.’

Galton had used the term ‘regression’ for the first time the year before, when he presented these results in person at a meeting of the Anthropological Institute. He had previously used the term ‘reversion’, but abandoned it because it suggested that the offspring went all the way back to the average of the parents, rather than only part of the way.

FIGURE 18.1, below, is reproduced from the facsimile of Galton’s 1886 paper, from which we have been quoting, at www.galton.org. It shows adult child height on the horizontal axis, and mid‐parental height on the vertical axis (today, following the convention of putting the ‘dependent’ variable on the vertical axis, we would reverse these axes). The numbers of observations are shown in small digits within the diagram, and the ellipse represents a locus of roughly equal frequencies, in this case connecting the values 3 or 4. About this, Galton wrote (page 254): ‘I then noticed … that lines drawn through entries of the same value formed a series of concentric and similar ellipses.’

Galton’s diagram of the relationship between child and parental height, illustrated by a graph with an ellipse and small digit numbers. — **Figure 18.1** Galton’s analysis of the relationship between child and parental height.

Reproduced with the permission of Gavan Tredoux.

In modern terminology, the ellipses represent contours parallel to the base of a three‐dimensional bivariate normal distribution. Galton implicitly attributed a normal distribution to the measurement errors in his data. The line through N represents the regression of child height on mid‐parental height (see QUESTION 18.5), and the line through M, the regression of mid‐parental height on child height. We can see that these two lines are not the same – a point that did not escape Galton.

Initially Galton thought of his discovery of ‘regression towards mediocrity’ as simply a characteristic of heredity. However, by the time he published his book Natural Inheritance in 1889 he understood it for what it really is – a statistical artefact, that is, a change signalled by a fitted regression line that does not necessarily represent a change in the real world. This artefact appears not only in work (such as Galton’s) with heredity data – it is quite general in contexts involving repeated measures.

Consider a situation where measurements are made on two occasions (call them ‘before’ and ‘after’) on a particular attribute of the same population (or of closely similar populations, such as the heights of parents and of their adult children).We are referring here only to attribute populations that are stable, in the particular sense that the ‘before’ and ‘after’ (population) means are equal, and the ‘before’ and ‘after’ (population) variances are equal. We suppose, moreover, that all measurements are subject to random (normally distributed) measurement error – that is, they are not perfectly correlated between the two occasions. Then, when these repeated measurements are regressed on one another by the method of least squares, it is easy to show algebraically that the slope of the fitted regression line is always less than one.

It is this property of the regression line that produces the phenomenon of regression towards the mean (to give it its modern name). An introduction to the concept of regression towards the mean by Martin Bland is online at [18.2], a good non‐technical account can be found in Freedman, Pisani and Purves (2007), chapter 10, and a particularly interesting historical perspective is given in Stigler (1999), chapter 9.

If unrecognised for what it is, this artefact is likely to lead to false interpretations of regression‐based results in experimental studies of a kind that is very common in medicine, epidemiology and psychology. There is now an extensive literature showing how this artefact may be adjusted for or circumvented in such studies. A book‐length presentation for non‐statisticians is Campbell and Kenny (1999).

Galton’s studies of ‘regression towards mediocrity’ represent the beginnings of what we know today as regression analysis. Galton chose the term ‘regression’ with great care for the quite specific notion he sought to describe, though this is now mostly unknown or forgotten. Regression was not Galton’s only contribution to statistics. Far from it; he made statistical contributions in at least a dozen fields, as well as introducing the fundamental statistical idea of correlation. The past hundred years have seen his huge contribution grow, through the work of countless others – possibly beyond even his wildest imaginings.

Questions

QUESTION 18.1 (B)

A student regresses weight in kilograms on height in inches for a group of adult males. Having recorded the results, he decides that it was silly to mix metric and imperial units, and converts the heights to centimetres (using 1 inch = 2.54 cm). Now he can regress weight in kilograms on height in centimetres. Which of the following results will be the same for the second regression as for the first: the intercept coefficient, the slope coefficient, the value of r² (the coefficient of determination)?

QUESTION 18.2 (B)

When the coefficient of determination, r², equals 1, all the points in an (X,Y) data scatter lie on the least squares regression line of Y on X. When r² = 0, the least squares regression line of Y on X is horizontal. Sketch the scatter of (X,Y) data points (1,3), (3,3), (5,3), (7,3), (9,3). For the regression of Y on X based on these data, is r² equal to 1 or to 0?

QUESTION 18.3 (A)

As a statistician, Galton often carried out statistical estimation (though the actual term was introduced several decades later by R.A. Fisher). Perhaps his strangest activity was to estimate the bodily measurements of ‘Hottentot Ladies’ on his expedition to South‐West Africa (now Namibia) in 1850–1852. How did he carry out this estimation process?

QUESTION 18.4 (B)

The plot of a novel published in 2000 by an English writer: a postgraduate student decides to give up postmodern literary theory and write, instead, a biography about a famous (though fictional) biographer who left notes on three (real) subjects, identified only as CL, FG and HI. FG is Francis Galton, but who are the other two, who is the author of the book and what is its title?

QUESTION 18.5 (B)

In Galton’s diagram (FIGURE 18.1, above), the regression line ON of child height on mid‐parental height is defined geometrically (N is the point where the tangent to the ellipse is horizontal). Would calculation using the usual least‐squares approach result in the identical regression line? Can you explain why or why not?

References

PRINT

Campbell, D.T. and Kenny, D.A. (1999). A Primer on Regression Artifacts. Guilford Press.
Freedman, D., Pisani, R. and Purves, R. (2007). Statistics, 4th edition. Norton.
Galton, F. (1877). Typical laws of heredity II. Nature 15, 512–514.
Galton, F. (1886). Regression towards mediocrity in hereditary stature. Journal of the Anthropological Institute of Great Britain and Ireland 15, 246–263.
Stigler, S.M. (1999). Statistics on the Table: The History of Statistical Concepts and Methods. Harvard University Press.
Terman, L. (1917). The intelligence quotient of Francis Galton in childhood. American Journal of Psychology 28, 209–215.

ONLINE

[18.1] http://galton.org (both the Galton papers from the print references are available here)
[18.2] http://www‐users.york.ac.uk/~mb55/talks/regmean.htm
19
Experimental design – piercing the veil of random variation

As we highlighted in CHAPTER 2, the study of variation is at the heart of statistics. In almost all fields of mathematics, variation means non‐random (i.e. systematic) variation. Statisticians, however, take account not only of non‐random variation, but also of random (i.e. chance) variation in the real‐world data they work with. This contrast, indeed, distinguishes statistics from mathematics. What’s more, the two types of variation that statisticians deal with are almost always present simultaneously. Sometimes, it is the influence of the random variation which is dominant in a particular data set – as, for example, in day‐to‐day movements in the price of a particular share on the stock exchange. Sometimes, it is the other way round – as, for example, in the monthly value of sales of ice cream in a particular city, where the regular seasonal pattern city‐wide dominates random local variation.

It is useful, for what follows, to think of the patternless chance variation as being overlaid, like a veil, on some underlying pattern of systematic variation. A prime goal of statistical analysis is to get behind this veil of random variation in the data, so as to have a clearer picture of the underlying pattern (i.e. the form) of systematic variation in the variable or variables of direct interest. This goal is pursued with reference not just to the data at hand, but also (by using appropriate techniques of statistical inference) to the population from which the data came. Where more than one variable is of direct interest, there is an additional motive for getting behind the veil – to identify the degree of stability (i.e. the strength) of the pattern of relations between the variables.

It follows that the veil of random variation is actually a kind of obstructive nuisance. In most real‐world settings, there is also a second kind of nuisance variation. It is the variation of systematic variables that are not of direct interest, but whose influence is nevertheless present. Statisticians aim in various ways to neutralise the impact of both these kinds of ‘nuisance variation’, so that they can get on with their real objectives – to study the form and strength of the systematic variation in the variables that are of direct interest.

‐‐‐oOo‐‐‐

All practical statistical studies fall into one of two categories: non‐experimental and experimental. Non‐experimental studies are sometimes termed observational studies (a rather inexpressive term, since the word ‘observation’ also turns up in reports on experimental studies!). In an experimental context, the effects of some intervention by the experimenter on a set of experimental units (which may be animate subjects or inanimate objects) are recorded. These data are then analysed to determine whether or not it is likely that the intervention affects the experimental units in some systematic fashion. In a non‐experimental context, by contrast, data on variables of interest are collected in the real world, however they occur; there is no intervention. Intuitively, it should be clear that there is greater potential to neutralise nuisance variation successfully when one can (at least partially) control both the source and the intensity of that nuisance variation – which is what a well‐designed intervention is intended to do.

We now focus on experimental contexts. Suppose we are interested to know whether a theoretical scale of difficulty that is used to classify particular cases of some task is valid – that is, that the tasks labelled ‘easy’ are actually found by people to be easy, and that those labelled ‘difficult’ are actually found to be difficult. To investigate this, we might take as a null hypothesis that the theoretical scale is not valid – that is, tasks labelled ‘easy’ and ‘difficult’ are actually perceived in much the same way. Then we are interested to see if the data we collect will reject this hypothesis in favour of the one‐sided alternative hypothesis.

Let’s take a specific context – Sudoku puzzles – and consider a tentative approach to developing the hypothesis test. If you are unfamiliar with Sudoku puzzles, there are countless websites where you will find them described.

Choose, say, 60 experimental subjects, and give each a Sudoku puzzle to solve, where these puzzles are drawn from a pool containing puzzles labelled ‘easy’, ‘medium’, ‘hard’ or ‘diabolical’. For subsequent analysis, we shall proxy the four states of the ‘theoretical level of difficulty’ by numbers on an ordinal scale. It is common in such contexts to use values in arithmetic progression (e.g. 1, 2, 3, 4, respectively), though it could reasonably be argued that values spaced progressively more widely (e.g. in geometric progression) will more realistically proxy the increasing theoretical level of difficulty of the four kinds of Sudokus. Next, we record how many minutes it takes each subject to complete his or her puzzle.

In this way, we collect a numerical observation for each subject on the variable ‘theoretical level of difficulty’ and on the variable ‘time taken’. If the scatter plot of time taken against (increasing) theoretical level of difficulty for these 60 subjects has a positive slope, it suggests that ‘time taken’ varies directly with ‘difficulty’. We would interpret this result to mean that the theoretical scale of Sudoku difficulty is valid.

If, on the other hand, the scatter plot is roughly horizontal (i.e. has zero slope), this suggests that there is no systematic relation between ‘difficulty’ and ‘time taken’. We would interpret this to mean that the theoretical scale of Sudoku difficulty is not valid.

These interpretations, specific to the sample of 60 subjects involved, could be generalised for the population of all Sudoku solvers by applying a formal significance test to the slope of a line of best fit to the sample scatter plot. If the slope of this line were significantly greater than zero, the null hypothesis (‘the theoretical scale of difficulty is not valid’) would be rejected.

Any formal significance test mentioned in this experimental context has, as its theoretical foundation, a statistical model of the experiment. It is worth recalling, from CHAPTER 13, that such a model includes both a deterministic component (comprising one or more systematic variables that influence the time taken to solve a Sudoku) and a random component (which we have likened here to an overlaid veil).

The foregoing interpretations would be entirely valid if slope in the scatter plot reflected solely an intrinsic population relation between time taken and theoretical level of difficulty. Unfortunately, this proposition is not necessarily true – and not only on account of random variation. Why not? Because, apart from the theoretical level of difficulty (our focus variable), there are also several systematic nuisance variables in this setting that have not been taken into account. Here is one such variable: how much prior experience, on a binary scale (more experienced/less experienced), each subject has in solving Sudokus. To see how this categorical variable could influence the test outcome, consider two contrasting scenarios.

If the more experienced solvers all happened to get easy puzzles, and the less experienced solvers all got diabolical ones, the scatter plot of time taken against theoretical level of difficulty would – already for this reason alone – have a positive slope.

Now suppose that the more experienced solvers all happened to be assigned diabolical Sudokus, and the less experienced solvers all assigned easy ones. Then it could well be that the two groups take, on average, roughly the same amount of time to solve their puzzles. In that case – and already for this reason alone – the slope of the scatter plot might be close to zero, or even negative.

In other words, trend shape in the scatter plot could be the consequence of uncontrolled systematic nuisance variation, rather than a reflection of some intrinsic population relation solely between time taken and level of difficulty.

‐‐‐oOo‐‐‐

How might either of these two ‘extreme’ allocations of puzzles to subjects arise? If the experimenter is the one who does the allocation, there is always a risk of bias (even if it is only unconscious bias), and it is not hard to think of reasons why this might be so. A simple way to counter this risk is to take the allocation out of human hands, and use a computer‐generated set of random numbers (see CHAPTER11) to randomly divide the set of subjects into four groups. A group is then chosen at random from the four, and all members of that group are assigned an easy Sudoku. The next group is randomly selected and assigned a medium Sudoku, and so on.

Now the tentative experimental approach we described initially has been improved. We have created, albeit in a simple way, a designed experiment that has neutralised (to a large extent) the influence of the ‘prior experience’ nuisance variable. How has it been neutralised? By nullifying its systematic influence in the statistical model of the experiment. In informal language, you can think of this as deleting the variable from the deterministic component of the model and adding it into the random component. In the technical language of statistics, the influence of the ‘prior experience’ variable has been randomised.

The resulting improved procedure is called a completely randomised design with one factor. Now that the ‘prior experience’ nuisance variable has been effectively (we trust!) dealt with, the single factor relates to the variable of interest in the intervention used. In this case, the intervention is assigning a Sudoku puzzle to be solved, and the factor is the level of difficulty of the puzzle. This approach can be generalised to two (or more) factors, where each subject does two (or more) different tasks with parallel theoretical scales of difficulty, e.g. a 9 × 9 Sudoku puzzle and a 6 × 6 Sudoku puzzle.

You will have noticed that, in order to form the four groups, the randomised design just described needs no knowledge of the actual level of experience that each subject has. If the levels of experience are, in fact, known, then a more efficient design is available (i.e. one which is more likely to lead to rejection of an incorrect null hypothesis). As always in statistical inference, the more correct information that is brought to bear on a problem, the more reliable the inference. The more efficient design is a randomised block design.

In the present context of a randomised block design with a single factor, the ‘blocks’ are two internally homogeneous groups of subjects – ‘more experienced solvers’ and ‘less experienced solvers’ – which must be set up first. The setting‐up process is called blocking. By an ‘internally homogeneous group of subjects’, we mean a group having less variability within the group – in the subjects’ puzzle‐solving experience – than in the population of all the subjects taken together. By an extension of this definition, if the population is divided into two non‐overlapping groups, each of which is internally homogeneous, the variability within each group is likely to be less than the variability between the groups.

Sudokus of all four levels of difficulty are then assigned at random to the subjects within each block. It is the lesser variability within the blocks, with regard to the subjects’ puzzle solving experience, relative to the variability between the blocks that gives this design its advantage. In the technical language of statistics, we say that the randomised block design avoids confounding the effect of the subjects’ puzzle solving experience with the effect of the level of difficulty of the Sudoku puzzle itself.

When there are two or more systematic nuisance variables to neutralise by blocking, a randomised block design can become very complicated, and may require a very large number of subjects for reliability of statistical hypothesis tests. Such a large number of subjects may be prohibitively expensive to seek out. For the case of one factor of interest and two nuisance variables, a more efficient experimental design is available – that is, one which requires fewer subjects than the corresponding randomised block design. It is called the Latin square design. For more on this design, showing also how it controls the influence of a nuisance variable, see QUESTION 19.3.

To this point, we have been describing experimental designs for testing a null hypothesis in an experimental context involving a single relationship of direct interest – in our example, the relationship of level of puzzle difficulty to time taken to solve it. However, these same designs can be applied to testing a null hypothesis comparing two relationships, to assess whether they are, or are not, significantly different. Two examples of such contexts are: deciding which of two chemical processes for producing a particular compound provides the best quality product; and deciding whether a new drug is, or is not, more effective than an existing drug for treating a particular illness.

We might now go on to describe the statistical tests that are appropriate to testing hypotheses under each of these experimental designs, and to discuss some of the more elaborate designs that have been devised. However, the technicalities involved would quickly take us beyond the intended purpose of this Overview. A good introductory treatment can be found in chapter 11 of Davies et al. (2005). A well‐regarded tertiary level textbook, with a bias to engineering applications, is Montgomery (2013).

Questions

QUESTION 19.1 (B)

The pioneering statistical ideas and methods for the design of experiments are due to R.A. Fisher, one of the founders of modern statistical inference. On page 11 of his path‐breaking treatise, Fisher (1935), he introduced his subject in this memorable way:

‘A lady declares that by tasting a cup of tea made with milk, she can discriminate whether the milk or the tea infusion was first added to the cup. We will consider the problem of designing an experiment by means of which this assertion can be tested … Our experiment consists in mixing eight cups of tea, four in one way and four in the other, and presenting them to the subject for judgment in a random order … Her task is to divide the 8 cups into two sets of 4, agreeing, if possible, with the treatments received.’

Fisher did not entirely invent this setting – it refers to an actual occurrence. Who was the ‘lady tasting tea’? And what were the real circumstances on which Fisher’s account is based?

QUESTION 19.2 (B)

Where is the Rothamsted Agricultural Research Station? What part did it have in the development of modern statistical inference prior to 1940 – in particular, by R.A. Fisher?

QUESTION 19.3 (B)

A scientist is interested in studying a particular agricultural relation – how crop yield varies with different amounts of a new chemical fertiliser, measured in grams per square metre. It is not adequate, for this purpose, simply to sow the crop in several plots of ground, then apply differing amounts of fertiliser to each plot, and then measure the weight of crop harvested from each plot. That is because there are inevitably other nuisance variables in the background that also affect the experimental outcome.

When there are two nuisance variables (for instance, the amount of moisture in the soil and the depth at which the seeds are sown), it is statistically efficient to use a Latin square experimental design. What is special about a Latin square design, and what is Latin about it?

QUESTION 19.4 (B)

What is a ‘placebo’? In what kinds of experimental contexts is a placebo useful? Are there situations where there is a caveat on the use of a placebo? And, just by the way, what is the linguistic connection between the words ‘placebo’ and ‘caveat’?

QUESTION 19.5 (B)

Name three disciplines from the physical, biological or social sciences where relationships among variables of interest are most commonly examined via experimental studies. Name three scientific disciplines where relationships among variables are most commonly studied non‐experimentally. Name three scientific disciplines where experimental studies and non‐experimental (also called ‘observational’) studies are both common. Is anything interesting revealed by this review?
References
PRINT
1. Davies, M., Francis, R., Gibson, W. and Goodall, G. (2005). Statistics 4, 3rd edition. Hodder Education (in the UK school textbook series MEI Structured Mathematics).
2. Fisher, R.A. (1935). The Design of Experiments. Oliver and Boyd.
3. Montgomery, D.C. (2013). Design and Analysis of Experiments, 8th edition. Wiley
20
In praise of Bayes

It is hard for us today to capture the intensity of the intellectual struggles that past pioneers in any field of knowledge engaged in as, with insight, creativity and sheer hard work, they laid the foundations of that field. However, we can improve our understanding of these struggles if we have some historical knowledge. That is why there are vignettes from the history of statistics in many places in this book.

CHAPTER 22, in particular, gives a broad perspective over some 400 years on the development of statistical inference. In the main, this is a history of frequentism in statistics.

Frequentism is a conceptual framework for statistical theory which takes its name from one of its fundamental axioms – that probability is best defined objectively as an empirical relative frequency. Unfortunately for any hope of a tidy intellectual evolution of the field, some 18th century statistical thinkers saw scope for an alternative framework for statistical theory, using as a fundamental axiom the subjective definition of probability. This conceptual framework has become known as Bayesianism, as we explain below.

Today, frequentism and Bayesianism are thriving as rival paradigms, both for designing theoretical techniques and for interpreting the results of applying those techniques to data. In this chapter, we look at the origins of Bayesianism and show why Bayesian inference is sometimes (its practitioners would say ‘always’) more appealing than the frequentist alternative.

‐‐‐oOo‐‐‐

It all began in 1654, the year that Blaise Pascal sought the aid of his great mathematical contemporary, Pierre de Fermat, to solve at last a fundamental question that had been studied only partly successfully for centuries: given an observed real‐world situation (call it the ‘cause’), where each of the possible outcomes (call it an ‘effect’) is a chance event, how can we systematically assign a quantitative measure – a probability – to the chance of occurrence of any one of these ‘effects’ of the observed ‘cause’?

During the following century, several alternative approaches to answering this fundamental ‘probability problem’ emerged. A major obstacle to arriving at a comprehensive general solution was that measurement is an elusive notion in the context of probability. Mathematical principles for assigning probabilities objectively were devised. However, the self‐evident fact that, in daily life, people commonly make their own subjective assessments of probabilities, could not be ignored; yet, there seemed to be no systematic principles that governed the formulation of such subjective probabilities. Worse still, there was no reason why subjective and objective probability assessments of the same event would be consistent with one another. Thus, by 1760, the hoped‐for comprehensive solution of the ‘probability problem’ was still rather in disarray.

At the same time, little headway had been made with another fundamental problem, dubbed the ‘inverse probability problem’. The fact that it was easy to state made the seeming intractability of its solution all the more galling to those who struggled with it.

The inverse probability problem can be expressed straightforwardly like this. Given an observed chance outcome (call it the ‘effect’) of some real‐world situation (call it a ‘cause’), and knowing the full set of possible real‐world situations (‘causes’) that could have given rise to this outcome (‘effect’), how can we systematically assign a probability to the chance that the observed ‘effect’ came from a particular one of the set of real‐world ‘causes’ that could have produced that ‘effect’?

If you contrast the relevant wording of the first and fourth paragraphs of this subsection, the reason for the name ‘inverse probability’ problem should be clear.

In 1763, a remarkable paper on probability was presented at a meeting of the Royal Society in London. It had been written by the Reverend Thomas Bayes (1702–1761), who had earned his living as a church minister in the English town of Tunbridge Wells. In company with many amateur mathematicians and scientists of that era, his research was done in his spare time and was unpaid. Bayes’ paper was titled ‘An Essay towards solving a Problem in The Doctrine of Chances’ (he was referring to an early text on probability, The Doctrine of Chances, published by Abraham de Moivre in 1718). Bayes’ Essay was essentially complete (though perhaps not yet polished) at his death, when it came into the hands of his literary executor, Richard Price.

Though Bayes’ discussion was difficult to grasp, Price understood that the problem on which Bayes had made progress was, in effect, the inverse probability problem. Price thought this important enough to bring it to the attention of the Royal Society, of which he was a member (as Bayes had been, too). You can see the original version of the essay online at [20.1].

To thinkers who followed Bayes, it seemed that Bayes had implicitly achieved more than to propose a constructive path to solving the inverse probability problem. He had also shown that there was scope, in practice, for synthesising objective and subjective numerical probabilities (so troublesomely distinct as concepts). Bayes’ discussion implied that an initial subjectively‐evaluated probability could be ‘revised’ in the light of further objective probability information from the real world, thus producing a probability assessment that was a meaningful blend of both evaluations.

Many advances in probability theory – and, indeed, in statistical inference – grew out of Bayes’ Essayover the next 150 years. There is a comprehensive overview in a technical book by Dale (1999). Bayes would be astonished!

Here we shall focus only on a single very important formula in modern probability theory that can be traced back in spirit to Bayes’ Essay, though it does not actually appear there. This formula – now variously called Bayes’ formula, Bayes’ rule or Bayes’ theorem – is central to solving problems in inverse probability.

All such problems involve conditional probabilities. A conditional probability is the probability that an event A occurs, given that the ‘conditioning’ event B has occurred. This is written as P(A|B), and is formally defined as P(AB)/P(B) – the ratio of the probabilities of the joint event and the conditioning event (this definition requires that P(B) is not equal to zero). From this, we may write P(AB) = P(A|B)P(B), a useful way of expressing the probability of a joint event.

All problems in inverse probability involve inversion of event and conditioning event. Let’s illustrate this inversion with a problem that Bayes himself posed. As a man of the Church, Bayes was interested in the question, ‘What is the probability that God exists, given all that I see around me in the extant world?’ Bayes realised that P(world exists | God exists) = 1 (since God can make whatever He likes), but what Bayes wanted to know was P(God exists | world exists).

Bayes’ formula, which expresses the relation of a conditional probability to the corresponding inverse conditional, can be written in various forms. We shall illustrate one of these in the context of forensic probabilities. A court of law is concerned with whether a suspect is guilty (G) or innocent (I), given the presence at the scene of a crime of some form of evidence (E), such as a fingerprint, a bloodstain or a DNA sample.

With some reasonable assumptions, we can usually evaluate P(E|G) and also P(E|I), the probability of the evidence being present given the guilt, or the innocence, of the suspect. But what the court actually aims to assess is the inverse of this, P(G|E), the probability that the suspect is guilty given the evidence. To find an appropriate expression, we need a few steps of simple algebra.

We begin with the identity in terms of joint probabilities:

We can rewrite this equation as

(1)

Similarly, since

we can write

(2)

Then we can form the ratio of the left hand sides and the right hand sides of equations (1) and (2), cancelling the (non‐zero) term P(E), to show Bayes’ formula in the following form:

In this version, the formula has an interesting theoretical (as well as practical) interpretation. The second term on the right hand side is termed the prior odds of guilt – that is, the odds of guilt before any evidence is considered. (You may recall that the odds of an event is the ratio of the probability that the event occurs to the probability that it does not occur. Odds is a measure of chance alternative to the usual 0 to 1 scale of probability.)

The first term on the right hand side is the ratio of the probability that the evidence is present, given that the suspect is guilty, to the corresponding probability, given that he is innocent. It is referred to as the likelihood ratio for the presence of the evidence. The term on the left hand side is the odds of being guilty, rather than innocent, given the evidence that has been considered. This is referred to as the posterior (or revised) odds of guilt.

In summary, Bayes’ formula can be expressed as:

Posterior odds = Likelihood ratio × Prior odds.

To see how the formula is applied in a legal context, consider this scenario. A suspect is on trial for a murder committed in Australia. A bloodstain found at the murder scene is of type AB–. The victim did not have this blood type, but the suspect does have AB– blood. What can we conclude from this piece of evidence?

First, we can make an assessment of the prior odds of guilt. Suppose there are only 50 people who could conceivably have been responsible for the murder; thus, we shall take the prior odds of guilt as 1/50 or 0.02. Next, we can consider the strength of the evidence. In Australia, only around 1% of people have this rarest type of blood. So P(E|G) = 1, while P(E|I) = 0.01, and the likelihood ratio is 100, representing a moderate strength of evidence. The posterior odds of guilt is thus 100 × 0.02 = 2, indicating that, after taking the evidence into account, the suspect is twice as likely to be guilty as innocent. This is quite an increase on the prior odds!

An important strength of this technique is that it can be applied repeatedly to take account of further independent kinds of evidence – for instance, a witness report that the murderer was a man, or the discovery of a fingerprint on the murder weapon. In each step, the current odds of guilt is multiplied by the likelihood ratio of the evidence, to produce a revised posterior odds that the suspect is guilty.

However, the technique evidently cannot proceed without an initial estimate of the probability of guilt. It may be difficult to obtain agreement on such a prior probability. An objective ‘frequentist’ argument could provide a starting point, as in our explanation above. However, for many events, such an initial assessment of chances has to be made in terms of subjective probability, because there is simply no other reasonable way to determine their probability. For example, the probability that a particular swimmer will win a gold medal at the next Olympic Games can, in principle, only be assessed subjectively. To many people, including some statisticians, this seems ‘unscientific’, and so the entire Bayesian procedure for revising odds ratios is dismissed. This seems quite an extreme reaction, given that – whatever element of subjectivity is injected first – that initial element is progressively synthesised with likelihoods evaluated from multiple pieces of accrued objective evidence.

In the last century, the notions that prior probabilities can be assessed subjectively and that, more generally, the theory of statistical inference should embrace subjective probabilities were, regrettably, the cause of frostiness and even acrimony among academic statisticians on many occasions. Rather than collaborating intellectually, Bayesians and frequentists took refuge in separate ‘camps’, each side proclaiming the virtues of their own stance and criticising the other. One of us (PP) recalls, as a young academic, attending a conference at which a speaker announced that he would be presenting a Bayesian analysis of a problem. On hearing this, about half the audience stood up and left the room!

Today, the rift is no longer so wide. As the strengths of Bayesian techniques are more widely understood, closer engagement of the ‘camps’ in applied statistical work is on the horizon.

Indeed, the Bayesian approach to statistics has had many notable successes. One spectacular example was the inferential approach taken by the team of British cryptologists, led by Alan Turing, that ultimately broke the code of the German Enigma message‐enciphering machines during the World War II. Winston Churchill claimed that Turing thus made the biggest single contribution to the Allied victory, and historians have estimated that the work of his team shortened the war by at least two years.

The methods of Turing’s team, and many other practical successes of Bayes’ formula, are described by Sharon McGrayne (2012) in her book, The Theory That Would Not Die. She shows strikingly how a simple rule that, in effect, formalises the notion of learning from experience, has been applied to a vast range of areas of human activity, from rational discussion about the existence of God to more efficient ways of keeping spam out of your mailbox.

Questions

QUESTION 20.1 (A)

Where is Thomas Bayes buried, and what important statistical institution is located nearby?

QUESTION 20.2 (B)

In this chapter’s Overview, we considered the following scenario: a suspect is on trial for a murder committed in Australia. A bloodstain found at the murder scene is of type AB–. The victim did not have this blood type but the suspect does have AB– blood.

Knowing this information, the prosecutor says to the jury: ‘In Australia, only around 1% of people have type AB– blood. Hence, the chance that the blood came from someone else is very small – only around 1%. So the suspect is fairly certain to be the murderer, with a probability of about 99%.’ What is wrong with the prosecutor’s argument?

QUESTION 20.3 (B)

In textbooks of advanced probability you will find a section devoted to so‐called ‘urn problems’. These are problems that involve selecting balls at random from a collection of different numbers of coloured balls in an urn, as a statistical model for certain real‐life sampling situations. (An urn is an opaque vase‐like container with a narrow top. In its statistical role, it is another one of the physical artefacts that we write about in CHAPTER 25.)

Many urn problems can be instructively solved using Bayes’ formula. Here is one example. An urn contains ten balls, each of which is either red or black. One ball is selected at random and found to be red. What is the probability that it was the only red ball in the urn? [You will need to make an assumption about the process by which the urn was initially filled with red and black balls.]

QUESTION 20.4 (B)

Working with his team of code breakers at Bletchley Park in England during World War II, Alan Turing developed the idea of a scale for measuring strength of evidence. What type of scale was this? What was the unit on this scale? And what was the origin of the name Turing coined for this unit of evidence?

QUESTION 20.5 (B)

In the frequentist approach to interval estimation, a confidence interval for a parameter (e.g. the population mean) is constructed using a procedure that captures the true population mean a specified percentage of the time, in repeated sampling. Suppose that a 95% confidence interval for a population mean is found to be (2.5, 3.5). Can we conclude that there is a 95% probability that the population mean is between 2.5 and 3.5?

What is the usual term for the Bayesian analogue of a confidence interval? What differences in interpretation are there between a numerical confidence interval calculated using the frequentist approach and one using the Bayesian approach?
References

PRINT

Dale, A.I. (1999). A History of Inverse Probability from Thomas Bayes to Karl Pearson. Springer.

McGrayne, S. (2012). The Theory That Would Not Die. Yale University Press.

ONLINE

[20.1] http://rstl.royalsocietypublishing.org/content/53/370.full.pdf

21
Quality in statistics

It seems to be obvious that statistics is a strictly quantitative discipline. However, that is not so, as we shall explain.

Certainly, statistics is a way of arriving at an understanding of the world using techniques for analysing numerical quantities, either measured or counted. ‘Numerical detective work’ is the way the great US statistician John Tukey described statistical analysis in his renowned book Exploratory Data Analysis.

Before the 19th century, statistics was literally ‘state‐istics’, that is, a description of the state (i.e. the nation) – a description which, moreover, focused heavily on qualitative (i.e. non‐numerical) analysis. Questions about a country’s productivity, wealth and well‐being were answered by analyses based on observed characteristics (without necessarily including any measurements), such as its progress in agriculture and industry and its accomplishments in the arts and architecture. An interesting historical essay by de Bruyn (2004) illustrates how this worked in practice.

Do qualitative analyses still have a place in modern statistics? Indeed, they do. Beginners in statistics may form the impression that it concerns itself only with quantitative data and quantitative analyses. However, qualitative data and qualitative analyses are a vital part of statistics, too.

What exactly do the terms ‘quantitative’ and ‘qualitative’ mean in this context? Dictionaries usually define these words by referring back to the terms ‘quantity’ and ‘quality’. The Macquarie Dictionary defines ‘quantity’ as ‘an amount or measure’, and ‘quality’ as ‘a characteristic, property or attribute’. Statisticians distinguish data on a quantitative variable from data on a qualitative variable by saying that the former are values, whereas the latter are states.

The quantitative/qualitative distinction is a very basic one; there are more elaborate ways of classifying variables. One such classification scheme was devised in 1946 by the US psychologist Stanley Stevens. His scheme puts variables into four classes: categorical (also called ‘nominal’), ordinal, interval and ratio.

Categorical data are associated with a fixed set of non‐overlapping categories. Examples of a categorical variable are city of birth and marital status. Ordinal data (as the name suggests) are assigned a place in an ordered scale according to some criterion. Examples of an ordinal variable are military rank and a composer’s opus numbers (that record the order of composition of musical works, without reference to the time elapsed between their dates of publication). Interval data are numerical values that have a precise position on a continuous scale, with an arbitrary zero. They are a step up from ordinal data, in that one can say how much further along a scale one item is than another. Examples of an interval variable are longitude and temperature in degrees Celsius. Finally, ratio data are numerical values that have a precise position on a continuous scale with an absolute zero. Examples of ratio variables are length and weight.

It should be clear from these definitions that interval and ratio variables are quantitative variables. Further, a categorical variable is clearly a qualitative variable. But what can we say about an ordinal variable? Is it quantitative or qualitative? This is a perplexing question, for some ordinal data appear to be quantitative (opus numbers, in our example), while others seem to be qualitative (e.g. military rank).

This issue has caused a great deal of controversy in statistics, especially in regard to psychological data. Psychologists routinely collect ordinal data in their experiments, and are accustomed to assigning numerical ranks to their observations before analysing them. Think, for instance, about the following behavioural question and its numerically ranked responses: Do you smoke? (often 1, sometimes 2, rarely 3, never 4).

These rank data look like interval data, but not all statistical calculations (e.g. the arithmetic mean) that are valid with interval data are meaningful with rank data. Why? Because a rank coding of responses is an essentially arbitrary choice. After all, if rarely or never smoking were regarded as personally exceptionally beneficial, then the four responses might, for instance, be coded 1, 2, 4, 8. The subtleties of accommodating ordinal variables in statistical analyses are explained in more detail in a (fairly technical) paper by Velleman and Wilkinson (1993).

Let’s look now at some qualitative aspects of modern statistical work that go beyond simply including qualitative variables in analyses.

There are, for example, qualitative issues in defining a qualitative variable – that is, defining the ‘states’ (or ‘categories’) of the qualitative variable. In many practical contexts, this can be complicated, and even controversial. For example, in Australia, a person is officially defined as ‘employed’ if he or she performed at least one hour of paid work in the week prior to the official employment survey. Such a definition has an obvious political implication. It enables the government to report a higher total employment than would be the case if a more stringent definition were adopted. In the world of sport, we distinguish amateurs from professionals. The categorisation of an athlete as ‘amateur’ was (until the 1980s) an indispensable requirement for participation in the Olympic Games. Not surprisingly, the definition of ‘amateur’ in this context became a matter of the sharpest dispute.

A remarkable book by Bowker and Star (2000) shows how the definition of categories plays an important role in the outcomes of statistical investigations, with some striking examples involving medical and racial classification.

There are also qualitative issues in including a quantitative variable in a statistical analysis, for we need first to decide exactly what to measure and how to measure it.

For example, if we are carrying out a study comparing the effectiveness of two different approaches to learning statistics – a traditional classroom course and an online course – we might initially think of basing conclusions on students’ final examination results. However, we know that it is not straightforward to measure the outcomes of learning in this way. This leads us to a consideration of other variables or combinations of variables that might do better. We may, for instance, choose to compare students’ attitudes towards statistics at the beginning and at the end of their studies, using an instrument such as the Survey of Attitudes Towards Statistics, developed by Candace Schau (online at [21.1]). Again, in a medical context, if we wish to compare two treatments for brain tumours, we might measure survival times, or we might put more emphasis on the quality of life during the survival time, and assess this using a quality‐of‐life survey – see Carr et al. (eds) (2002).

In both of these situations, you will notice that the alternative assessments proposed will produce ordinal data. As already noted above, we would need to be watchful that the statistical analyses applied to these data were practically meaningful.

Sometimes it happens that an investigator finds it too challenging to choose a measure for some particular real‐world variable, and so decides simply not to measure it at all. Then, regrettably, the influence of that variable may just be ignored. Consider cost‐benefit analysis (already mentioned in CHAPTER 9). Faced with a proposal to ‘develop’ some land by harvesting the trees growing on it, and then building houses on the cleared land, it may be difficult for a project‐assessment authority to measure the benefits of continuing to have trees growing in that particular location – benefits in terms of, say, their prevention of soil erosion, or their appeal as parkland. It is all too tempting to make the decision to approve or to disallow the proposal on the basis of only those economic variables that can be measured easily, such as the costs of harvesting the trees and of building the dwellings, and then weighing these costs against the benefit, evaluated solely as the amount that can be earned from sale of the timber and the dwellings.

As the Nobel Prize‐winning economist Joseph Stiglitz has written: ‘What we measure affects what we do. If we have the wrong metrics, we will strive for the wrong things.’

Questions

QUESTION 21.1 (A)

Language text (which is qualitative information) can be analysed using frequency counts of letters, words or phrases (that is, in a quantitative way) to attempt to resolve such matters as authorship disputes. Statisticians who contribute in the field of English textual analysis soon learn the order of letters by their frequency of occurrence in English prose. The first 12 letters of this ordered set have been used as a phrase in a variety of contexts. What is this phrase? Can you give a context in which it has been used?

QUESTION 21.2 (A)

FIGURE 21.1 is part of an historic map of London from the late 19th century showing by different shadings (originally, colourings) the socio‐economic status of each household. Who created the map? How were the data collected? What current statistical marketing technique is its direct descendent?

Snipped image of the map of central London, with the households in different shades. — **Figure 21.1** A map of central London (extract).

Reproduced with the permission of David Thomas.

QUESTION 21.3 (B)

We have seen the perplexing position of ordinal variables, lying between the quantitative and the qualitative in statistical analyses. Let’s examine this further. Suppose two groups of people – A and B – are suffering from the same illness. Those in group A receive treatment T₁, and those in group B receive treatment T₂. Afterwards, each person is asked to respond on a five‐point scale – strongly disagree, disagree, undecided, agree, strongly agree – to the statement ‘the treatment I received was completely effective’. These responses can be numerically coded as 1, 2, 3, 4, 5. We can use these response data to test whether people who receive T₁ have the same perception of the effectiveness of their treatment as those who receive T₂.

If this hypothesis is tested using a chi‐squared test of independence, what assumption is being made about the nature of the response variable? What if an independent‐samples t‐test is used? What test would be more appropriate than either of these?
Do any of the tests in part (a) throw light on whether the two treatments are equally effective?

QUESTION 21.4 (B)

In 1973, The American Statistician published a paper on sampling with this intriguing title: ‘How to get the answer without being sure you’ve asked the question’. What is the name for the type of sampling that the authors were describing, and in what situations might this type of sampling be useful?

QUESTION 21.5 (B)

Statisticians have made many contributions in military settings. A famous example is an investigation, during World War II, of the survivability of military aircraft hit by enemy fire. Which eminent statistician estimated the probabilities of an aircraft surviving a single hit on different parts of its body? What aspect of the damage data did he particularly notice, and what insightful contribution did that lead to for improving aircraft survivability under fire?

References

PRINT

Bowker, G. and Star, S. (2000). Sorting Things Out: Classification And Its Consequences. MIT Press.
Carr, A., Higginson, I. and Robinson, P. (eds, 2002). Quality of Life. Wiley.
de Bruyn, F. (2004). From Georgic poetry to statistics and graphs. The Yale Journal of Criticism 17, 107–139.
Stiglitz, J. (2009). Towards a better measure of well‐being. Financial Times, London, 13 September.
Tukey, J.W. (1977). Exploratory Data Analysis. Addison‐Wesley.
Velleman, P.F. and Wilkinson, L. (1993). Nominal, ordinal, interval, and ratio typologies are misleading. The American Statistician 47, 65–72.

ONLINE

http://www.evaluationandstatistics.com/view.html

22
History of ideas: statistical personalities and the personalities of statisticians

How much do you know about the historical development of today’s theory and practice of statistics?

The modern field of statistics is the cumulative intellectual achievement of hundreds of gifted thinkers over at least the past 400 years and, particularly, since about 1860. To learn about the history of ideas in statistics is to discover the names of those gifted statistical personalities. The scholarly literature of statistics may concentrate on the ideas and give the names only passing regard, but we should not take this as a signal that the names are unimportant. The names are important – not in themselves but, rather, for who they were, these energetic and creative builders of modern statistics. Knowing something of the personalities of these statisticians, we can hope for insights on ‘how they did it’.

In this hope, we statisticians are certainly not unique. It has long been popular to seek, in the personalities and life‐paths of the gifted, clues to their remarkable achievements – whether the gifted are thinkers (philosophers, historians, scientists, trainers, etc.) or doers (political leaders, explorers, engineers, athletes, etc.).

Sometimes, this pursuit is disappointing. The lives of the composer Mozart and the painter Rembrandt, for instance, offer few insights on how they created the works of genius that we treasure today. However, there are many other historical personalities whose lives convey much about the sparks that ignited their great achievements. Our insights come principally from two kinds of sources: their own informal writings and revelatory exchanges with their intellectual peers (e.g. personal diaries and private letters); and public documents (e.g. reports of debates and controversies, and biographical essays).

All of these sources are, unfortunately, far removed from the settings in which statistics is studied and practised today. Unless steered to these sources by a teacher or reference book, or propelled in their direction by incidental curiosity, few statisticians come upon them. The history of statistical ideas remains a little‐visited byway.

Does it matter? You be the judge!

The history of statistical ideas since about 1600 is a grand saga of intellectual endeavour. It tells of achievements in these major areas: how to conceptualise, measure and analyse chance in human experience; how to detect authentic ‘big picture’ meanings in detailed real‐world (and, therefore, chance‐laden) data; and – from the knowledge gained in those inquiries – how to evolve a set of inductive principles for generalising the detected meanings, as reliably as possible, to wider contexts.

Without a historical perspective, one has little idea which concepts and principles are recent and which long‐established, or which were easily established and which were, for a long time, intractable. Many statistics textbooks so neglect a historical perspective that it could well appear to beginning students that the entire body of theory was conceived just recently, and delivered soon afterwards – perhaps by a stork?

Cartoon illustration of a bird carrying a book titled Theory of Statistics.

In fact, the evolution of modern statistics has been a slow journey with deeply human dimensions.

Three aspects of this journey are worth your attention. First, when, and in what practical circumstances, pivotal ideas were born; second, how challenging it often was for the pioneers of theory just to frame clearly the questions that they wanted to answer; and third, how much rethinking was called for before satisfactory solutions were arrived at. Once you have some perspective over these matters – and, especially, the last one – you will, we feel sure, be cheered by just how much easier is the path to the same knowledge today.

FIGURE 22.1 gives a schematic view of some of the major stages in the laying of the dual foundations on which modern statistical inference rests, together with the names of the scholars with whom the important progressive ideas are associated. These foundations are probability theory, and methods of statistical data summarisation and display. We should emphasise that the contributions of all those mentioned in this table are far more extensive than what is shown. Only contributions relevant to the themes of the table are included here.

Table of history of ideas I displaying sections containing the names of the scholars and the summary of their contributions to statistics. — **Figure 22.1** History of ideas I.

FIGURE 22.2 shows how the structure of statistical inference was erected, after 1860, on the dual foundations in FIGURE 22.1.

Table of history of ideas II displaying sections containing the names of the scholars and their contributions to statistics. — **Figure 22.2** History of ideas II.

To enrich the story traced out in FIGURE 22.2, you may like to browse some of the following references. For each scholar mentioned, there are two items. The first summarises (without too much technicality) his contributions to statistical inference; the second offers insights into his personality and life‐path. We have selected these references from a profusion of material, much of it dating from the last 25 years, a period which has seen a wealth of new research in the history of statistics.

Galton	1. Forrest (1974)	2. Galton (1908), online at [22.1]
Pearson	1. Magnello (2009)	2. Porter (2004)
Gosset	1. Plackett (ed, 1990)	2. McMullen (1939)
Fisher	1. Zabell (2001)	2. Box (1978)
Neyman & Pearson	1. Lehmann (1993)	2a. Reid (1982); 2b. O’Connor and Robertson, (2003), online at [22.2]
Savage	1. Lindley (1980), online at [22.3]	2. O’Connor and Robertson (2010), online at [22.4]
Tukey	1. Brillinger (2002), online at [22.5]	2. Anscombe (2003), online at [22.6]

Once you’ve caught the ‘bug’ on the history of ideas in statistics, where can you turn to go on exploring these ideas more generally? That depends, of course, on where you are currently in your knowledge of the discipline. Here are some suggestions.

If you are currently involved in undergraduate studies, you’ll find the book by Stigler (1986) particularly readable on the history of probability theory in the 18th and 19th centuries (the ideas of Bernoulli, de Moivre, Gauss and Laplace), the birth of the normal distribution (the work of de Moivre and Gauss), and the creation of correlation and regression theory (by Edgeworth, Galton and Karl Pearson).

For earlier ideas on chance and probability (from Galileo, Pascal, Fermat and Bernoulli in the 16th and 17th centuries), the engagingly written book by David (1962) can be thoroughly recommended. The contributions of some 20th century statistical pioneers are reviewed in lively fashion by Salsburg (2001).

Short but very informative biographies of most of the 19th and 20th century statisticians mentioned in FIGURES 22.1 and 22.2, together with those of a couple of dozen other statistical pioneers, can be found in the MacTutor History of Mathematics online archive (see the index of names at [22.7]).

If you are a postgraduate student in statistics, or in a field where statistical ideas are important, there are many gems to be found in the following three books. Hacking (1990) shows how the evolution of ideas about probability advanced 19th century European society and culture. Stigler (1999) is a collection of 22 stimulating, and sometimes quirky, essays with settings ranging from the 17th to the 20th century.

Also very valuable is Weisberg (2014), which (unusually in this field) is written by an applied statistician. It presents a non‐mathematical perspective over the history of probability and statistics, from Pascal’s beginnings to the situation today. What makes the book especially interesting is that it elicits, from this history, challenges for tomorrow. These include how to repair a growing gap between the activities, in their increasingly separated worlds, of academic researchers and statistical practitioners in government and business. Academics push forward the frontiers of statistical theory using established quantitative concepts of probability. Practitioners, however, confront very diverse situations, in which the element of uncertainty often has qualitative dimensions not fully captured by formalised probability theory, making the application of academic statistical techniques a fraught matter. The author gives evidence for his views from contemporary statistical practice (e.g. in education, pharmacology, medicine and business). He also urges theorists and practitioners to re‐engage, holding up as a model R.A. Fisher’s fruitful melding of his contributions to theory and practice (see also the answer to QUESTION 19.2).

By the way, it can be very rewarding to dip into the original works of the statistical pioneers. English translations of most works by the French and German pioneers are available, if reading Latin, French or German is not among your skills. You may be surprised, for example, how directly and informally Galton and Karl Pearson share their thinking with the reader, even while they are still feeling their way towards their eventual technical achievements.

Finally, for the committed enthusiast about the history of statistical ideas, we recommend exploring Peter Lee’s extensive website, Materials for the History of Statistics, online at [22.8]. Among the diverse links brought together on this website, there are all kinds of unusual things to be discovered. Cited on this website, but meriting separate mention for its rich detail, is John Aldrich’s website, Figures from the History of Probability and Statistics, online at [22.9]. The contributions of even more statisticians of the past, worldwide, can be found in Heyde and Seneta (eds, 2001). Among the editors and compilers just mentioned, John Aldrich and Eugene Seneta have made their own extensive, and always engaging, scholarly contributions on the history of probability and statistics. Further recent writers who cover this field broadly and whose works repay seeking out are Lorraine Daston, Gerd Gigerenzer, Anders Hald, Robin Plackett and Oscar Sheynin.

When it comes to appreciating the innovative ideas of contemporary statisticians, the best source is often the perspective of the innovator him‐ or herself. Such perspectives can come to light nicely in informal conversations with the innovators, which are recorded and then transcribed for publication. Since the mid‐1980s, published ‘Conversations’ with leading contemporary statisticians have brought some remarkable ideas and personalities to life on the printed (or digitised) page. The journal Statistical Science has included a Conversation in most issues – you can search past issues online at [22.10]. There are (shorter) Conversations also in many issues of the non‐technical magazine Significance, published jointly by the Royal Statistical Society and the American Statistical Association.

Questions

QUESTION 22.1 (A)

Some statisticians have unexpected hobbies and interests. For example, the eminent British statistician, Maurice Kendall (1907–1983) applied his literary enthusiasm to writing an experimental‐design pastiche of Henry Wadsworth Longfellow’s poem, Hiawatha (reprinted as Kendall, 2003). W. Edwards Deming (1900–1993), the US statistician who promoted statistical quality control internationally, composed church music. And Persi Diaconis, Professor of Statistics at Stanford University, USA, is an expert conjuror.

Maurice Kendall is associated with another literary contribution of some repute, this time as ghost writer. It contains a now quite famous statement about the nature of statistics, to the effect that it is not the numbers that matter but, rather, what you do with them. Where did this statement first appear, and who was the publicly credited author?
Which US statistician, active during the 20th century, had typography as a hobby, and how did he apply that hobby to celebrating the importance of the normal distribution to scientific observation and experimentation?

QUESTION 22.2 (A)

What is the historical context of the diagram in FIGURE 22.3? Who constructed the original version of the diagram, and for what purpose? What is the name of this type of diagram?

Graph of causes of mortality in the army in the east, from April 1854 to March 1855. — **Figure 22.3** Our redrawing of a historical graph.

QUESTION 22.3 (A)

An arithmetic mean combines all the numerical values of the data in calculating the average, while finding a median requires only that the values be arranged in order of size. But what type of average involves first ordering the values and then combining some of them? In what situation would such an average be useful?
In 1972, John Tukey was a co‐author of a major study of different sample estimators of the central value of a symmetric population distribution. What was the title of this study, what was its objective, and what did it have to say about the average asked about in part (a)?

QUESTION 22.4 (B)

The scene is an English country livestock fair about a hundred years ago. A large animal is displayed, and there is a competition to guess its weight when it has been slaughtered and dressed (i.e. prepared for cooking). For a small sum, anyone can submit a guess and compete for the prize for the most accurate guess. Later, a famous statistician examines the recorded guesses and writes a short article based on them. Who was the statistician, what type of animal was the centrepiece of the competition, and what was the statistician’s conclusion?

QUESTION 22.5 (B)

The man who called himself ‘Student’ in almost all of his scholarly publications was William Sealy Gosset. What was Gosset’s day job when he partly solved the problem of finding the exact probability density function of what we know as Student’s t‐distribution? Why did he publish his result (Student, 1908) under a pseudonym?
But should it really be Student’s t‐distribution? Here is some historical perspective to clarify this question:
Given a random variable, X, normally distributed as N(μ, σ²), with σ² unknown, and given the mean, , of a random sample of size n from this population, we know the t‐statistic for testing the null hypothesis H₀: μ = 0 as , where . In his 1908 paper, however, Student (i.e. Gosset) found the exact distribution of a different statistic, which he denoted by z, namely, , where he defined s² as . Comparing the t‐ and z‐statistics here, we see that .

Several years later, R.A. Fisher, wanting a general test statistic that would unify tests on a single mean, on the difference of two means, on a regression coefficient and on the difference of two regression coefficients, worked out the exact distribution of the t‐statistic (exactly as defined above) – Fisher, himself, used the letter ‘t’ to denote the statistic – and published it in 1923. This exact distribution is a function involving a single parameter, which Fisher named ‘the degrees of freedom’. In symbols, his result was equivalent to the distribution of , where v is the number of degrees of freedom. Whereas Gosset’s z‐statistic suits only the test on a single mean, Fisher’s t‐statistic generalises to suit each of the above tests, provided the appropriate numerical value is used for the degrees of freedom.

So, if Fisher, with his t‐distribution, gave a complete solution for a class of hypothesis tests when the population variance is unknown, why is it today called Student’s t‐distribution, rather than Fisher’s t‐distribution?

References

PRINT

Box, J.F. (1978). R.A. Fisher: The Life of a Scientist. Wiley.
David, F.N. (1962). Games, Gods and Gambling – A History of Probability and Statistical Ideas. Griffin.
Forrest, D.W. (1974). Francis Galton: The Life and Work of a Victorian Genius. Elek.
Hacking, I. (1990). The Taming of Chance. Cambridge University Press.
Heyde, C.C. and Seneta, E. (eds, 2001). Statisticians of the Centuries. Springer.
Kendall, M.G. (2003). Hiawatha designs an experiment. Teaching Statistics 25, 34–35.
Lehmann, E.L. (1993). The Fisher, Neyman‐Pearson theories of testing hypotheses: one theory or two? Journal of the American Statistical Association 88, 1242–1249.
Magnello, M.E. (2009). Karl Pearson and the establishment of mathematical statistics. International Statistical Review 77, 3–29.
McMullen, L. (1939). ‘Student’ as a man. Biometrika 30, 205–210. Reprinted in Pearson, E.S. and Kendall, M.G. (eds, 1970). Studies in the History of Statistics and Probability. Griffin.
Plackett, R.L. (ed, 1990). ‘Student’ – A Statistical Biography of William Sealy Gosset. Clarendon Press, Oxford.
Porter, T.M. (2004). Karl Pearson: The Scientific Life in a Statistical Age. Princeton University Press.
Reid, C. (1982). Neyman – from Life. Springer.
Salsburg, D. (2001). The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century. Freeman.
Stigler, S.M. (1986). The History of Statistics: The Measurement of Uncertainty Before 1900. Harvard University Press.
Stigler, S.M. (1999). Statistics on the Table: The History of Statistical Concepts and Methods. Harvard University Press.
Student [W.S. Gosset] (1908). The probable error of a mean. Biometrika 6, 1–25.
Weisberg, H.I. (2014). Willful Ignorance: The Mismeasure of Uncertainty. Wiley.
Zabell, S.L. (2001). Ronald Aylmer Fisher, pp. 389–397 in Heyde and Seneta (eds, 2001).

ONLINE

Galton, F. (1908). Memories of My Life. Methuen. At http://galton.org
O’Connor, J.J. and Robertson, E.F. (2003). Egon Sharpe Pearson. MacTutor History of Mathematics Archive. At http://www‐history.mcs.st‐andrews.ac.uk/Biographies/Pearson_Egon.html
Lindley, D.V. (1980). L.J. Savage – his work in probability and statistics. Annals of Statistics 8, 1–24. At https://projecteuclid.org/euclid.aos/1176344889
O’Connor, J.J. and Robertson, E.F. (2010). Leonard Jimmie Savage. MacTutor History of Mathematics Archive. At http://www‐history.mcs.st‐and.ac.uk/Biographies/Savage.html
Brillinger, D.R. (2002). John Wilder Tukey (1915–2000), Notices of the American Mathematical Society 49, 193–202. At http://www.ams.org/notices/200202/fea‐tukey.pdf
Anscombe, F.R. (2003). Quiet contributor: the civic career and times of John W. Tukey, Statistical Science 18, 287–310. At https://projecteuclid.org/euclid.ss/1076102417
http://www‐history.mcs.st‐andrews.ac.uk/HistTopics/Statistics.html
http://www.york.ac.uk/depts/maths/histstat
http://www.economics.soton.ac.uk/staff/aldrich/Figures.htm
https://projecteuclid.org/all/euclid.ss

24
Statistical ‘laws’

When people speak of ‘the law of gravity’, they are generally referring to what is more exactly called ‘Newton’s Law of Universal Gravitation’. This law states that the gravitational force (that is, the mutual attraction) between any two physical bodies is directly proportional to the product of their individual masses and inversely proportional to the square of the distance between them.

Why would such a scientific relationship be called a ‘law’? An analogy, while imperfect, may be helpful. Think about the word ‘law’ as it is used in parliament.

A law is a rule of behaviour that parliament has agreed is binding on people everywhere in society. Parliamentarians agree on what behaviour should become law only after having clear evidence of the expected social benefits of the law. Similarly, a physical law is a rule of behaviour that scientists have agreed to regard as binding on physical matter everywhere in nature. Scientists agree on what behaviour of matter should be called a law only after having clear evidence of its major scientific importance.

To this italicised characteristic of a scientific law, we can add five more. When a scientific law represents a relationship between variables, that relationship can be expressed in simple terms: it relates the ‘response’ variable to just a few ‘stimulus’ variables. The relationship is usually causal: it implies not only a correlational connection between the stimulus variables and the response variable, but also a direct determining mechanism (for more on correlation and causation, see the answers to QUESTIONS 8.2 and 9.1). The relationship is stable – that is, the determining mechanism is unchanging over time and/or place. And because those who are able to identify such simple and stable causal relationships in an otherwise complex and turbulent world are, for that reason, quite remarkable scientists, scientific laws are mostly named in their honour – that is, they are eponymous, as the name ‘Newton’s Law’ illustrates (see CHAPTER 23 for more on eponymy).

Finally, because there can be no formal proof that a scientific law is universally true, even long‐established scientific laws are always vulnerable to being shown to be only approximations. In other words, they may need modification as observation becomes more acute, measurement becomes more accurate, and confirmatory experiments are conducted in more unusual or extreme situations. Newton’s Law is, again, a good example. Newton’s account of gravitational attraction implies that this force operates instantaneously, regardless of distance. This suffices as an excellent basis for Earth‐bound physics. However, this notion was contradicted by Einstein’s Theory of Relativity, a theory now empirically well confirmed over interplanetary distances.

Many other eponymous physical laws were established prior to 1900, including Boyle’s Law, Ohm’s Law, Hooke’s Law, and Kepler’s Laws. The 20th century was an era of huge growth in the social and behavioural sciences. It was natural, then, for scholars to ponder whether there are laws in these sciences, too. One way they could seek an answer was to search empirically for ‘law‐like’ relationships (that is, simple and stable relations among variables), using statistical methods.

Of course, a strong statistical correlation, together with a stable regression model, does not necessarily signal that a direct causal mechanism has been identified. However, it certainly is a constructive first step in that direction. Thereafter, one can theorise about a plausible general causal mechanism to explain the stability of the statistical findings. Just as important, one can map out the limits beyond which the causal mechanism is not expected to apply. In this way, a new law may be tentatively proposed, to be subjected to further tests for confirmation. Examples of this approach are given in Ehrenberg (1968).

‐‐‐oOo‐‐‐

Not all scientific laws represent relationships between variables. There are also statistically discovered laws that relate to the frequency distribution of just a single variable. It turns out that, for certain measured variables, the relative frequency of their repeated measurement in the real world is very well approximated by some standard probability model (see CHAPTER 13 for more on probability models). Indeed, that is precisely why such probability models became ‘standard’!

If a model’s fit remains close when applied to repeated measurement data on a particular variable collected in widely different settings, statisticians may ‘promote’ the model to the status of a probability law for that variable.

Take, for instance, the ‘random error of measurement’ of some fixed quantity. This is the variable for which the first probability law in the history of statistics was designated. It was Gauss who, in 1809, first proposed the probability model; later, it became known as the ‘normal law of error’. It was soon well confirmed that repeated measurement of some fixed quantity produces a roughly symmetric distribution of random measurement errors, x, which is well approximated by a normal probability distribution of the form:

where the parameter σ is the population standard deviation. We note that, in this context, it is reasonable to set the parameter μ (the population mean) to zero, since errors are equally likely to be positive or negative.

When Gauss proposed the normal as a probability model, its elaborate form must have astonished many. It can equally astonish beginning students of statistics today. How could such an unobvious and abstruse mathematical function (they must wonder) ever have been hit upon? That seems to us a perfectly understandable reaction, if students are introduced to the mathematical function without any background.

It will be helpful background if students come to see that the function we know as the normal distribution was not plucked out of the air. It was already known to Gauss from work by de Moivre, several decades earlier, on the limit of the binomial distribution as the sample size increases without limit. Gauss favoured it in 1809 precisely because – as we mention in CHAPTER 14 – it is the only symmetric continuous probability density function for which the mean of a random sample has the desirable property of being the maximum likelihood estimator of the population mean. The normal also underpins appealing statistical properties of many statistical tools, including point and interval estimators, and significance tests.

However, the normal probability model has its limitations! Though it has served statisticians superbly for over 200 years, the normal is not necessarily the best probability model for every symmetrically distributed unimodal variable.

Nor is it necessarily the best probability model for the mean of a large sample drawn randomly from any non‐normal population – despite what the Central Limit Theorem (CLT) promises. This powerful theorem (explained in CHAPTER 12) greatly widened the scope of the normal as a probability model after it was first established in 1809. Yet, the CLT has its limitations, too!

‐‐‐oOo‐‐‐

Let’s look now at a variable that will lead us to a probability law which is not the normal distribution. This is a law that has become increasingly significant over the past 50 years.

Since the 1960s, there have been many statistical studies of Stock Exchange data. Among the statistics studied was the daily average of relative price changes, over all the shares in the category labelled ‘speculative’ (i.e. shares liable to frequent strong stochastic shocks to their prices). It was soon noticed that the frequency distribution of these average short‐term relative price changes had many extreme values, both positive and negative. These empirical distributions were unimodal, and roughly symmetrical about a mean of zero, but the frequency in their tails was greater than a normal distribution would imply. In other words, the tails of the empirical distributions were ‘fatter’ (or ‘heavier’) than those of the normal (QUESTION 14.1 (c) shows just how ‘thin’ are the tails of the normal).

At first, attempts to model the mean of relative price changes proceeded by treating the extreme values as (alien) outliers, deleting these outliers from the data set and fitting the normal as a probability model to the central data values. Case‐by‐case explanations were then contrived, to account for the size and frequency of the outliers. The results of this piecemeal approach were not very satisfactory.

In 1963, Benoit Mandelbrot (1924–2010), a French‐American mathematician, proposed a new approach to the modelling challenge. He drew on a family of probability distributions identified by the French mathematician Paul Lévy (1886–1971), known as ‘stable distributions’. We have more to say about Mandelbrot’s work shortly. First it is useful to have a brief look at what exactly a stable distribution is, and what Lévy found out about this family.

Lévy’s explorations in this area were a by‐product of his research, over the years 1920–1935, on proofs of the CLT under progressively relaxed conditions. Recall, from CHAPTER 12, that the CLT says ‘if you draw a sample randomly from a population that is not normally distributed, the sample mean will nevertheless be approximately normally distributed, and the approximation will improve as the sample size increases’. We note that there is one condition that cannot be relaxed. The CLT is valid only for a population with a finite variance.

Means of random samples from a normal population have, of course, a normal distribution for everysample size. One might see this property of normal means as a sort of ‘trivial’ case of the CLT. But Lévy saw it differently! What he focused on was that here we have a case where the sample means have the same form of distribution as the individual sample values for every sample size. That made him wonder whether there might be a definable family of probability distributions that all have this property of ‘stability’.

Other mathematicians (Poisson and Cauchy in the 19th century and Pólya, Lévy’s contemporary) had discovered some individual members of the family, but it was Lévy who (in two short papers in 1923) elegantly characterised the entire family of stable distributions, both symmetric and non‐symmetric. Today, his results can be found in many advanced textbooks of probability and statistics.

Here is a sketch, without mathematical derivations, of some attributes of the family of stable probability distributions that Lévy discovered.

All the stable distributions are unimodal. They are unified by a ‘characteristic’ parameter (call it α) which lies in the range 0 < α ≰ 2. Stability is defined in a quite specific sense: if sample values are all drawn independently from the same stable distribution – say, the distribution with characteristic parameter value α* – then the sample mean will have the distribution with characteristic parameter value α* at every sample size.

Only two symmetric stable distributions have an explicit form of probability density function (pdf): the normal (corresponding to α = 2) and the Cauchy (corresponding to α = 1). For all others, a probability is defined formally in terms of the convergent sum to infinity of a rather forbidding algebraic expression. In practice, these probabilities are calculated by evaluating that sum up to any desired level of accuracy.

None of the stable distributions (except the normal) has a finite variance. It follows that none of the stable distributions with α in the range 0 < α < 2 conforms to the CLT.

Lastly, here is the property that made the stable distributions so particularly interesting to Mandelbrot: all the distributions with α in the range 0 < α < 2 have fatter tails than the normal.

If you would like to read about the mathematics of stable distributions, we suggest you start with the accessible account of basic ideas in Borak et al. (2010), online at [24.1]. A rather more advanced, yet invitingly written, overview is given in chapter 9 of Breiman (1992).

‐‐‐oOo‐‐‐

Let us pause for a moment to look at the Cauchy distribution – a symmetric stable distribution that is perhaps less familiar to you than the normal. The Cauchy distribution has no finite mean or variance. That explains why its two parameters – a measure of centrality and a measure of spread – are positional measures. Its pdf is:

where π = 3.14159…, θ is the median and λ is the semi‐interquartile range. The standard Cauchy distribution is given by setting θ = 0 and λ = 1. Its pdf is f(x) = [π(1 + x²)]^–1.

The standard Cauchy distribution is graphed together with the normal distribution having the same median and semi‐interquartile range (μ = 0, σ = 1.4827) in FIGURE 24.1.

Graph displaying two bell-shaped curves, illustrating the density functions of the normal (solid) and Cauchy distributions (dashed). — **Figure 24.1** Graphs of the density functions of the normal and Cauchy distributions.

You can see that beyond about ± 3, the Cauchy has fatter tails than the normal. This is the property that makes the Cauchy more useful than the normal for modelling financial data having multiple extreme values.

‐‐‐oOo‐‐‐

It was not until 40 years after Lévy announced the entire family of stable distributions – the first class of fat‐tailed distributions to be identified – that the importance of this theoretical work was recognised in applied research. As already mentioned, it was the study by Mandelbrot (1963) that first demonstrated the need to accommodate the fat tails of empirical distributions of average short‐term relative changes in share prices. A better fitting model than the normal distribution was needed. Since the fat‐tailed stable distributions were to hand, Mandelbrot tried them out; it was their pioneering empirical role.

In his 1963 article, Mandelbrot, a mathematician, writes mostly in algebraic terms about the modelling issues he confronted. You may find it easier to read Fama (1963). Eugene Fama, an economist and Mandelbrot’s younger colleague, commends the practical significance of Mandelbrot’s ideas to the journal’s readership of economists. Fama continued in this research direction (see, for example, Fama, 1965), even after Mandelbrot turned to different topics (most famously, the geometry of fractals). In 2013, Fama shared the Nobel Prize in Economics, in part for the lasting impact that these successful early studies have had on the evolution of financial mathematics and statistics.

Since this early work on modelling aspects of speculative share prices, many other risk‐related financial variables, including some that are non‐symmetrically distributed, have been found to have fat‐tailed distributions. These, too, have been effectively modelled by members of the family of stable distributions.

To this literature can be added a remarkably extensive array of stable models of fat‐tailed variables in physics, geology, climatology, engineering, medicine and biology. There is also an array of multiple regression models where there are grounds for assigning the random disturbance a non‐normal stable distribution, rather than the more usual normal distribution.

With so many fat‐tailed variables, in such diverse contexts, well modelled by stable distributions, there is ample evidence for ‘promoting’ them to the status of stable laws.

There are, however, fat‐tailed variables – especially non‐symmetric ones – for which the stable laws do not provide a well‐fitting model. For these cases, there are now several other theoretical distributions which may be deployed instead. They include the lognormal distribution, the generalised hyperbolic distribution, the geometric distribution, or one of the power laws investigated in QUESTION 24.3.

For the important statistical law called the ‘law of large numbers’ and two misconceived laws – the ‘law of averages’ and the ‘law of small numbers’ – see the answer to QUESTION 3.1.

Questions

QUESTION 24.1 (B)

The eminent French mathematician, probabilist, engineer and philosopher of science Henri Poincaré (1854–1912) taught at the Ecole Polytechnique in Paris at the peak of his career. At this time, he published a highly‐regarded textbook of probability (Poincaré, 1896). In chapter 10 of this book, Poincaré demonstrates algebraically how Gauss, in 1809, first obtained the probability density function of the normal law of error. To lighten the detailed mathematics, Poincaré interpolates an anecdote about himself and a colleague, the physicist Gabriel Lippmann. Perhaps they had been discussing the scientific community’s lack of interest in the true nature of disciplinary foundations, for Poincaré writes, about the normal law of error, ‘Everyone believes in it, Mr. Lippman once told me, since empiricists suppose it’s a mathematical theorem and mathematicians that it’s an experimentally determined fact.’

Which of these, would you say, is the true nature of the normal law of error?

QUESTION 24.2 (B)

In a large collection of (say, five‐digit) random numbers, you would expect that the digits 1, 2, 3, 4 … 9, 0 would turn up with roughly equal frequency as the leading (that is, first) digit of those numbers. Surprisingly, however, in many real‐world collections of numbers (for example, the serial numbers of business invoices or the money amounts on electricity bills), some leading digits are actually more likely than others (zero is excluded as a leading digit). This phenomenon was stumbled upon in the 19th century from an incidental observation that earlier pages of books of logarithms were more ‘worn’ than later pages.

What probability distribution is applicable, in these circumstances, for the frequency of occurrence of leading digits? How does knowledge of this distribution assist auditors looking for accounting fraud?

QUESTION 24.3 (B)

Newton’s Law of Gravitation is an instance of a power law. Many physical, economic and social variables have a frequency distribution that follows a power law. What, in general, is a power law?

It has been found nowadays that aspects of website traffic on the internet follow a power law. Give an example in this context.

QUESTION 24.4 (C)

A particular power law that relates to discrete data is Zipf’s Law. This was initially proposed as a probability law for the relative frequency with which individual words appear in an extended prose text, regardless of language. Who was Zipf, and what is the functional relationship that bears his name?

Subsequently, Zipf’s Law has been found to apply in many other situations as well, notably the rank‐size distribution of the cities in any particular country. For your country, write down at least the first 15 major cities, in order by population size (measured in thousands), ranking the largest city 1, the next largest 2 and so on. Next, create a scatter diagram of the data, with the X‐axis showing the logarithm of population and the Y‐axis the logarithm of the city rank. What do you see? If the software is available, fit a least squares regression line to the scatter. What do you find? Relate your finding to Zipf’s Law.

QUESTION 24.5 (B)

In the 1950s, a French husband‐and‐wife team of psychologists/statisticians published their discovery of a remarkable phenomenon, that famous sports people are statistically significantly more likely to have been born when the planet Mars was in particular positions in the sky – the so‐called ‘Mars effect’. Who were they? Has their discovery a record of confirmation since then that would justify giving it the status of a law?

References

PRINT

Breiman, L. (1992). Probability. Society of Industrial and Applied Mathematics (this book is a reprint of the original Addison‐Wesley publication of 1968).
Ehrenberg, A.S. (1968). The elements of lawlike relationships. Journal of the Royal Statistical Society, Series A 131, 280–302.
Fama, E.F. (1963). Mandelbrot and the stable Paretian hypothesis. The Journal of Business 36, 420–429.
Fama, E.F. (1965). The behavior of stock market prices. The Journal of Business 38, 34–105.
Mandelbrot, B. (1963). The variation of certain speculative prices. The Journal of Business 36, 394–419.
Poincaré, H. (1896). Calcul des Probabilités, Gauthier‐Villars (2nd edition in 1912).

ONLINE

[24.1] Borak, S., Misiorek, A. and Weron, R. (2010). Models for heavy‐tailed asset returns. SFB649 Discussion Paper 2010‐049, Humboldt University, Berlin. Downloadable at http://hdl.handle.net/10419/56648. (Note: this paper also appears as chapter 1 in Cizek, P., Härdle, W. and Weron, R. (eds, 2011). Statistical Tools for Finance and Insurance, 2nd edition. Springer).

jueves, 7 de junio de 2018