PERSONAL HISTORY OF BAYESIAN STATISTICS
Professor of Statistics, Universities of Wisconsin-Madison and
6. TO THE STARS
IN THE NINETIES
The human race
will go to the stars (James A. Koutsky, 1940-1994)
James A. Koutsky, Professor of Chemical Engineering, University of
In their highly influential
JASA 1990 paper, Alan Gelfand and Adrian Smith projected the
Bayesian paradigm towards the stars when they recommended Markov
Chain Monte Carlo (MCMC) simulations as a way of computing Bayesian
estimates and inferences for the parameters in a wide range of
complicated sampling models, in situations where it was well-nigh
impossible to achieve a solution using ordinary Monte Carlo or
Importance Sampling techniques. Alan and Adrian are to be
congratulated, throughout the ages, for their wonderful insights.
MCMC is a popular
special case of the Metropolis-Hastings Algorithm which was first
introduced in the Journal of Chemical Physics in 1953 with
the objective of facilitating calculations by fast computing
machines. The algorithm was generalised by the Canadian
mathematician W. Keith Hastings, who introduced it into the
Statistics literature in 1970 in a widely cited paper in
Biometrika. Acceptance Sampling is another important special
case which can be employed in some situations where MCMC is
difficult to implement. See the source papers by Nicholas
Metropolis, Arianna and Marshall Rosenbluth, and Augusta Teller and
Edward Teller , and W. Keith Hastings , and the easy-to-read
article by Siddhartha Chib and Edward Greenberg in JASA (1995).
MCMC is similar to the
Gibbs Sampling approach developed by Stuart and Donald Geman in
their 1984 paper ‘Stochastic Relaxation, Gibbs Distributions, and
the Bayesian Restoration of Images’ in IEEE Transactions on
Pattern Analysis and Machine Intelligence. The authors reported
the first ever proof of the convergence of the simulated annealing
algorithm in this, the highest cited paper in Engineering.
Stuart Geman is James
Manning Professor of Applied Mathematics at Brown, and Donald Geman
is an internationally eminent professor at Johns Hopkins.
W. Keith Hastings was
awarded his Ph.D. at the University of Toronto in 1962. His thesis
topic was Invariant Fiducial Distributions, and his Ph.D.
supervisors were Don Fraser and Geoffrey Watson.
Hastings wrote his
seminal 1970 paper while an Associate Professor of Mathematics at UT,
and after giving consulting advice to John Valeau, a UT Professor of
Hastings received his
tenure from the University of Victoria in British Columbia in 1974.
Now retired, he is still living in Victoria. It is unclear to me
whether Keith has been made aware of the full impact of his work
upon Bayesian inference.
One of the consequences of Gelfand and Smith’s (albeit somewhat
politically charged)recommendation was that Bayesian techniques
could now be potentially used to analyse a large number of detailed
models whose justification was based upon specialised scientific or
economic theory, including many of the complex theoretical models
that have been devised by experts in many other subjects. More
recently it has also been used to calculate the Deviance Information
Criterion DIC for model comparison, as an alternative to the
occasionally prohibitively complicated maximum likelihood
procedures which are needed to calculate Akaike’s criterion AIC.
While the Bayesian MCMC
‘cult’ has proved to some to be somewhat all-constraining, MCMC and
its extensions have enormously benefited the Bayesian paradigm in
all sorts of ways. Areas of application which have benefited from
MCMC include medical diagnosis, ecology, geology, computer science,
artificial intelligence, machine learning, genetics, astrophysics,
archaeology, psychometrics, educational performance, and sports
modeling. During the 1990s, many Bayesians focussed their energies
on applying MCMC to medical and pharmaceutical data sets, and others
to various models in Econometrics.
Suppose that two vectors
of parameters θ and ξ possess a joint posterior density (given the
realizations of the observations from a specified sampling model)
which is proportional to h(θ,ξ ) as a function of θ and ξ. Then the
MCMC simulations may, if technically possible, be completed as
1. Simulate a θ vector from
the conditional distribution of θ given the latest simulated vector
2. Simulate a ξ vector from
the conditional distribution of ξ given the latest simulated vector
3. Keep cycling between (1)
and (2) until you have achieved enough simulations of (θ, ξ) for the
purposes that you wish to put them to.
distributions in (1) and (2) need to belong to simple families of
distributions from which these simulations are possible. To
implement this procedure you will need to specify an initial vector
for ξ. Then discard, say the first thousand of your simulations for
(θ, ξ), after a ‘burn-in’ period, in order to minimise the
dependency of your subsequent simulated vectors upon your initial
vector for ξ. Finally, pretend that all further simulations for (θ,
ξ) are ordinary Monte Carlo simulations, and complete your
expectation of any bounded function g (θ, ξ ) of θ and ξ may be
computed by averaging g(θ , ξ) with respect to a sufficiently large
number of your further simulations for ( θ, ξ). The marginal
posterior density of any scalar component λ of (θ, ξ) may be
computed, often quite efficiently, as follows:
1. For each fixed, average h
(θ, ξ) with respect to your further simulations for the remaining
elements of (θ, ξ). This computes an ‘unnormalised marginal
posterior density’ of λ.
2. Use another numerical
routine to integrate your unnormalised density across all possible
values of λ. Then the normalised density may be computed by dividing
the unnormalised density by your integrand. If the parameter λ is
unbounded then its posterior moments, if they exist, may be computed
by reference to this density and the appropriate numerical
convergence’ of these procedures is theoretically guaranteed, it is
typically quite difficult to investigate whether the standard errors
of simulation are finite. Indeed MCMC can take a prohibitively long
time to converge and sometimes never appears to do so. It may also
be difficult to check whether the procedure has converged or not,
since it can appear to almost converge before diverging all over the
place. It such circumstances it is best to let your simulations run
for a while longer to see whether they settle down again. Any lack
of convergence could be exaggerated by rounding errors in your
In its more general
form, MCMC addresses k subvectors of the model parameters, and
refers to successive simulations from the conditional posterior
distributions of the individual subvectors, given the remaining k-1
subvectors. The procedure converges best in models which have been
parsimoniously parametrized and shown to fit the data well via a
preliminary data analysis. For poor fitting models, or models with
too many parameters, you may well find yourself iterating until the
cows come home, and you might wish to go to the beach to take a
This simple account, is
based on my experiences and those of my research associate Orestis
Papasouliotis when he was analysing multivariate hierarchical ANCOVA
models and the Scottish sex offender data. His methodology was
reported in 2000 in the first four chapters of Orestis’ University
of Edinburgh PhD. Thesis, and in our 2002 article in the
Encyclopedia of Envirometrics, which was reproduced in the
on-line John Wiley library in 2006.
My account is different
in several respects from the somewhat tentative suggestions made by
Gelfand and Smith in their seminal 1990 paper. Following that
monumental leap for mankind, at least one handsome Bayesian was
reportedly occasionally given to blinking rather more than usual
while he was confirming the convergence of his simulations! However,
the MCMC methodology has been considerable refined and generalised
over the years. See, for example the excellent books authored by the
Brazilian duo Dani Gamerman and Hedibert Lopes, and by Wally Gilks,
Sylvia Richardson and David Spiegelhalter.
Arnold Frigessi, Fabio
Martinelli and Julian Stander applied these ideas to Markov random
fields in their 1997 paper in Biometrika. Julian Stander and
Yuzhi Cai’s 2008 paper about quantile self-exciting threshold
autoregressive models is also worth a read. Just grab the right
issue of the Journal of Time Series Analysis from your
Julian Stander leads a
small Bayesian group in my hometown of Plymouth, Devon, where he is
a Reader in Statistics. I was born in Flete House, Yealmpton, near
the estuary of the Erme, in 1948, shortly after Adrian Smith was
born in nearby Dawlish, on the estuary of the Ex, where the waves
lash against the windows while the train chugs around the coast.
Adrian was raised in the delightful seaside village of Teignmouth.
Julian’s contributions add colour to the Bayesian Devonian
tradition, and his applied Bayesian colleague David Wright, who
lives in Ivybridge, has previously assisted both him and me with our
research endeavours. Dennis Lindley lives just over the border, in
Somerset, where the cider is pronounced ‘zider’ but isn’t as strong
as the Buckfastleigh scrumpy in Devonshire. Folk from the West
Country are sometimes known as ‘Janners’.
The coast train to Plymouth at
[Author's Note: On the fifth of February 2014,
part of the seawall at Dawlish was destroyed during a storm, and the
trains to Plymouth were cancelled until further notice. Maybe the
Gods of Synchronicity have indeed been reading my manuscript.]
Before using MCMC you
should always check that your solution cannot be solved analytically
or by numerical integration, or sufficiently well approximated by a
more readily computable technique that might involve a multivariate
normal or a conditional Laplacian approximation. Nowadays, you don’t
always need to include MCMC in your paper to get it accepted. Please
remember that while MCMC can be used to calculate theoretical
solutions, it can’t compensate for any limitations in your
During the 1990s, a broad spectrum of fascinating practical
applications and fresh theoretical innovations came continuously
into view. Their diversity created an impressively complex kaleidoscope
through which to view the success of the Bayesian paradigm.
In their 1991 article in the Canadian Journal
of Statistics, Joe Gastwirth, Wesley Johnson, and Dana Reneau
described an excellent Bayesian analysis of some AIDS/HIV data, and
in their 1994 in Biometrics, Scott Zeger and Peter Diggle
reported a more complicated empirical hierarchical Bayes approach to
another AIDS/HIV problem. The collection and analysis of AIDS/ HIV
data was co-ordinated during the 1980s by Steve Lagakos of Harvard
University, and he continued to do so until his tragic death in a
car accident in 2010. Steve believed that one of the key roles of
statisticians was to emphasise what you COULDN’T conclude from
haphazardly collected AIDS/HIV data. Bayesians should always beware
the possibly profound influences of confounding variables,
influences which could, for example, be concealed by the
complexities of an apparently all-encompassing hierarchical model.
For instance, it is now well-known in the at-risk community that,
while usually highly beneficial, anti-viral drugs such as AZT can
cause a number of serious physical conditions which were previously
thought to have been caused by AIDS itself. When constructing your
sampling model it will sometimes be important to take this
phenomenon into account.
Scott Zeger is Professor
of Biostatistics at Johns Hopkins University in Baltimore More
recently his work has been on Bayesian models for the etiology of
children’s pneumonia, and on methods for personalised medicine,
which at Johns Hopkins they call ‘personalised health’.
I recently watched a
performance of the musical ‘Hairspray’ which is set in Baltimore,
which I once frequently visited. It was fun identifying with all the
Stephen King-style characters once again, in particular Link Linus
and Edna Turnblad. Scott’s totally all-American, of course.
Michael Lavine wrote two
single-authored papers in JASA in 1991, on the sensitivity of
Bayesian inference, and on Bayesian robustness. Mark Berliner
published a paper in JASA on likelihood and Bayesian
inferences in chaotic systems.
In 1992, Nhu Le and Jim
Zidek co-authored an important paper in the Journal of
Multivariate Analysis titled ‘Interpolation with
uncertain covariances: A Bayesian alternative to kriging’. In 1995,
Jim Zidek and Constance van Eeden incorporated their group Bayesian
procedures for estimating the exponential mean as a component of
their review of the Wald theory in the edited volume Statistical
Decision Theory and Related Topics. Jim Zidek first
learnt about the nuances of the Bayesian paradigm during his
sabbatical year at University College London during 1971-2. After a
tentative start during the 1970s, he became one of the leading
applied Bayesians in Canada.
Jim Zidek FRSC
Constance van Eeden is
nowadays an honorary professor of statistics at the University of
British Columbia. She co-authored three Bayesian papers with Alec
Charras during the early 1990s, and she has also made many
Bayes-related contributions to decision theory. She is regarded as
the grande dame of Canadian Statistics.
Canada is indeed rich
in Bayesians. For example, Irwin Guttman’s areas of interest include
statistical inference, design problems, variable selection, and
medical diagnosis. Now a Professor Emeritus at the University of
Toronto, Irwin worked in the Department of Statistics there during
the 1970s and 80s when it was teeming with Don Fraser’s highly
political structural fiducialists. In between time, he served as
Chairman of the once-celebrated Department of Statistics a SUNY at
Buffalo. Irwin is a compulsive Bayesian, a pal of Norman Draper, and
one of our most likeable personalities.
Prem Goel and N.
Sreenivas Iyengar co-edited the splendid volume Bayesian
Analysis in Statistics and Econometrics in 1992. This was based
on the papers presented at an Indo-American Bayesian workshop in
Bangalore, India, in 1988.
Prem Goel advising a student
Prem Goel is Professor
of Statistics at Ohio State. His many areas of research interest
include Bayesian hierarchical modeling for non-linear dynamic
systems, and image processing for automatic pattern recognition of
vehicles in airborne and high-resolution satellite images. He is
In his outstanding 1992
paper in Statistica Sinica, Jun Shao proposed some ingenious
empirical Bayes estimators for several heteroscedastic variances in
the linear statistical model. He moreover investigated the
invariance, robustness, asymptotic, and mean squared error
properties of his estimators. This was a very thorough study out the
In their 1992 JASA
papers, Ludwig Fahrmeir used posterior modes to extend the Kalman
filter to non-linear multivariate dynamic models, Isabella
Verdinelli and Jay Kadane co-authored the paper ‘Bayesian Design for
Maximising Information and Outcome’, and Guido Consonni and Piero
Veronese proposed some conjugate priors for exponential families
with quadratic variance functions.
Still In 1992, John Hsu and
I reported a Bayesian procedure in the Annals of Statistics
for drawing inferences about the elements of the pxp covariance
matrix C of a multivariate normal distribution with a non-conjugate
prior distribution with a very flexible parametric structure when
contrasted with the quite restrictive parameterization of the
conjugate inverted Wishart distribution first proposed by Gwyn Evans
logarithm A of a positive definite matrix covariance matrix C
possesses the same eigenvectors as C, but its eigenvalues are equal
to the logs of the corresponding eigenvalues of C. Let q=p(p+1)/2,
and let α denote a q x1 vector consisting of the diagonal and upper
triangular elements of A, arranged in some specified order. Then α
is unconstrained in q-dimensional real space. John Hsu and I took
the prior distribution of α to be multivariate normal with specified
mean vector μ and covariance matrix D. There are q(q+3)/2 prior
parameters in this specification, when compared with the q+1
parameters appearing in the inverted Wishart distribution.
In the special case
where the prior mean matrix of A assumes intra-class form, we
propose specifying two further prior parameters, so that there are
in total four prior parameters. This expresses more flexible degrees
of belief than the inverted Wishart prior formulation employed by
Chan-Fu Chen in his 1979 JRSSB paper, which invokes three
We applied our general
prior specification to the situation where the observation vectors
constitute a random sample from a multivariate normal distribution
with zero mean vector μ and covariance matrix C. However, with minor
algebraic adjustments (e.g. replacing n by n-1), our procedure can
also be employed in the situation where the unknown population mean
vector θ is a priori uniformly distributed over p-dimensional
explicit approximate Bayesian inferences and exact importance
sampling procedures (which are based on simulations from a
generalised multivariate t-distribution) refer to the mathematical
physicist Richard Bellman’s recursive solution of a Volterra
integral equation, a second-order Taylor Series approximation to the
log-likelihood of α, and a spectral representation, in terms of its
eigenvalues and eigenvectors of the maximum likelihood matrix of A.
In their insightful
2013 paper in the Journal of Computational Graphics and
Statistics, Xinwei Deng and Kam-Wah Tsui generalise our
multivariate normal approximation to the likelihood of α to the
situation where the sample covariance is non-singular, and show that
the quadratic form in their special case prior density behaves like
a roughness penalty. Their posterior estimate of C simultaneously
regularizes the smallest and largest eigenvalues of the covariance
In 1996, Tom Y.M.
Chiu, Kam-Wah Tsui and I extended our 1992 paper in an article in
JASA by formulating the matrix logarithmic covariance model
for n independent multivariate normal vectors with different
covariance matrices. We indicated that the model could be
analysed using Bayesian, as well as maximum likelihood, methods.
relating to the matrix logarithm of a covariance matrix have since
been referenced by a number of authors in the Econometrics
literature, and applied and extended to spatial processes, random
effects models and multivariate time series models, some of which
provide alternatives to the classic stochastic volatility models, of
the type described by Neil Shephard and Stephen Pitt in their 1996
paper in Bayesian Statistics 6.
See, for example, p224 of Introduction to Spatial
Econometrics by James Le Sage and Kelley Pace, the review of
multivariate stochastic volatility by Manabu Asai, Michael McAleer
and Jun Wu in Econometric Reviews (2006), and the matrix exponential
GARCH time series models investigated by Hiroyuki Kawataksu in the
Journal of Econometrics (2006).
In their 2013 CIRJE discussion paper ‘Matrix exponential
stochastic volatility with cross leverage’, Tsunehiro Ishihara,
Yashiro Omori, and Manabu Asai extend our Bayesian ideas in quite
brilliant fashion, while describing the 1996 paper by Chiu et al as
I have frequently been ridiculed for
suggesting that the methodology for the linear model approach with
unequal variances described in my 1975 Technometrics paper
yielded, as a special case, an early approach for handling
stochastically volatile data. However, several authors have now
shown that multivariate extensions of my hierarchical 1975 model
can handle stochastically volatile data in even more general
terms. It's time to stop laughing, Nick Polson. You too, Neil
In 1993, Julian Besag and
Peter Green read a paper to the Royal Statistical Society about
modern Bayesian computational methods in spatial statistics. It was
Julian Besag F.R.S.
(1945-2010) was known chiefly for his work in spatial statistics,
with applications in epidemiology, image analysis, and agricultural
science, and in Bayesian computation using MCMC. I recall drinking
with him in Newcastle, Leamington, Oxford, Madison, and San
Francisco (see Author’s Notes below), and him advising me in
the early 1970s that ‘the overriding problem with Bayesian inference
is that the model’s never right’. I will always remember him. Both
of us had a tendency for speaking the honest truth, and I,
personally, have no problem with that.
Also in 1993, the
eminent medical statistician Jane Hutton co-authored ‘Bayesian
sample size calculations and prior beliefs about sexual abuse’ with
R.G. Owens, and ‘A Bayesian analysis for case control studies in
cancer epidemiology’ with Deborah Ashby and Magnus McGee. She
published a further article about Bayesian epidemiology with Deborah
Ashby in 1996. In 2012, she, Lorna Barclay, and Jim Smith reported
their experiences while embellishing a Bayesian network using a
chain event graph, and also co-authored a paper entitled ‘Chain
graphs for informed missingness’ in Bayesian Analysis. Jane
always mixes high quality mathematics with practical common sense.
In their 1993 papers in
JASA, Kung-Sik Chan reported on the asymptotic behaviour of
the Gibbs sampler, Richard McKelvey and Thomas Palfrey described
their Bayesian sequential study of learning in games, and Jim Berger
and Dongchu Sun reported their Bayesian analysis of the Poly-Weibull
Still in 1993, the
International Society for Bayesian Analysis (ISBA), held
their first ever conference, in the Hotel Nikko in San Francisco,
concurrently with the annual meetings of the Institute of
Mathematical Statistics (IMS) and the American Statistical
J. Stuart Hunter wrote
that he would address the conference dinner with the following
You asked for a ‘title of
my presentation’. I do not plan to do more than confess my
Bayesianism, and say a few words of greetings as the president of
Since then, ISBA has evolved
into a broadly-based interdisciplinary organization, linking many
areas of science, medicine and socio-economics. The Society’s
electronic journal Bayesian Analysis has become a popular and
much-respected resource, thanks to the efforts of Rob Kass, Brad
Carlin, and several other leading statisticians including Angelika
van der Linde. Bayesian Analysis currently makes the sixth
highest impact in the list of 117 Statistics and Probability
journals. ISBA later took over the organizational responsibilities
for the Bayesian Valencia Conferences, but with Jose Bernardo still
playing a leading role.
Tom Leonard, Arnold
Zellner and other Bayesians attending the inaugural ISBA
conference in the Hotel Nikko in San Francisco in 1993.
The four other Bayesians are Gordon Kaufmann, Wes Johnson,
Carl Morris and Shanti Gupta
Brad Carlin and his
colleagues made numerous and various important contributions during
the 1990s. They include an article with Stuart Klugman about
Hierarchical Bayesian Whittaker graduation, which appeared in the
Scandinavian Actuarial Journal, a discussion paper with
Kathleen Chaloner, Tom Louis and Frank Rhame on elicitation,
monitoring and analysis for an AIDS/ HIV trial, an application of
MCMC to model choice with Sid Chib, ‘Bayesian Tobit modeling of
longitudinal ordinal clinical trial compliance data with
non-ignorable missingness,’, with Mary Kathryn Cowles and John
Connett , which related to a lung health study and an excellent
1996 comparative review in Statistical Science, with Mary
Cowles, of convergence diagnostics for MCMC, where blinking was not
Brad Carlin and Tom
Louis’s bestselling text Bayesian Methods for Data Analysis,
combines the best elements of Bayesian and Fisherian Statistics. And
Brad is doubtlessly the most humorous musical Bayesian in the entire
world, as evidenced by his Bayesian Songbook.
In 1994, the much
respected Indian statistician Prakash Laud was appointed Acting
Director and Professor in the Division of Biostatistics at the
University of Missouri. His medical interests include the treatment
of injuries, and breast cancer and osteoporosis screening. He
specialises in parametric and semi-parametric Bayesian techniques
for generalised linear and mixed effects models, models for time to
event data, and genetic association studies.
In two joint papers
published in 1994 in JRSSB and Biometrika, Nick Polson
and Gareth Roberts investigated the geometric convergence of the
Gibbs sampler, and Bayes factors for discrete observations from
diffusion processes. In the same year, Eric Jacquier, Nick Polson
and Peter Rossi published their wonderful invited discussion paper
‘Bayesian Analysis of Stochastic Volatility Models’. This seminal
contribution was named one of the most influential articles in the
20 th. Anniversary issue of the Journal of Business
and Economic Studies.
applications were important. The authors used their highly complex MCMC computations for the analysis of stocks and portfolios. Maybe
their models were simple enough to facilitate a more algebraically
explicit approximate Bayesian analysis. It might, for example, be
possible to find some stochastic approximations which yield easily
applicable updating formulae of the Kalman type
Nick Polson obtained
his Ph.D. in 1988 from the University of Nottingham, where he was
supervised by Adrian Smith. He is currently Robert Law Jr. Professor
of Econometrics and Statistics at the University of Chicago. He is
also an excellent gossip. I don’t know what historians would do
without the likes of Professor Polson.
Jon Forster and Allan Skene published some computing algorithms in Statistics in
Computing in 1994 for the marginal posterior densities of the
parameters of multinomial distributions. In 1998, Jon Forster and
Fred Smith reported their model-based inferences from categorical
survey data subject to non-ignorable nonresponse in JRSSB.
Jon Forster has an
outstanding track record in the Bayesian analysis of categorical and
multivariate ordinal data, with a view to applications in the Social
Sciences. He is Professor of Mathematics at the University of
Southampton, where he is a very fine teacher of Bayesian technology.
In their 1994 papers in
JASA, Thomas Severini showed how to derive approximate
Bayesian inferences when the prior information is summarised by a
system of interval estimates, Ming-Hui Chen developed a procedure
for importance weighted marginal posterior density estimation,
Daniel Phillips and Adrian Smith constructed faces using
hierarchical template modeling.
During the mid-1990s there were
only a handful of active Bayesian statisticians in Germany. In 1995,
Ludwig Fahrmeir and Gerhard Tutz helped to spread Bayesian ideas
with their book Multivariate Statistical Modeling based on
Generalised Linear Models. In 1995, Angelika van der Linde
published her Bayesian interpretation of smoothing splines in
Test, and in 2000 she reported her reference priors for
smoothing and shrinkage parameters, in the Journal of Planning
Angelika is a well-known
and highly accomplished Bayesian, with an excellent perspective on
life. She recently retired as Extraordinary Professor of Mathematics
at the University of Bremen.
former colleague Angelika van der Linde (Oldenburg, 2012)
Ludwig Fahrmeir is
Emeritus Professor of Statistics at the University of Munich. His
recent research interests include Bayesian inference,
regularisation, smoothing and prediction, and applications in many
areas, including childhood morbidity, forest health, and survival
In 1997, Katja Ickstadt
of the University of Dortmund published an important joint paper in
JASA with Nicky Best and Robert Wolpert, on spatial Poisson
regression for health and exposure data. Katja has published
numerous Bayesian papers on medical topics ever since, and nowadays
is one of the leading Bayesian statisticians in Germany.
In 1995, James A. Smith,
Michael Goldstein, Peter Craig and Allan Seheult read an invited
paper to the Royal Statistical Society about their linear Bayes
approach for matching hydrocarbon reservoir history. The expert
knowledge of the reservoir engineer was incorporated. The
contributions of the fledgling first co-author were, according to
one of his co-authors, close to zilch (and reportedly even less than
zilch!). This was apparently since he was quite frequently
side-tracked, when working on his PC, by his socially important
obligations as a Boy Scout cub leader.
1995 text Neural Networks for Pattern Recognition took many
previously published advanced Bayesian techniques into Artificial
Bishop was to take
numerous pre-existing Bayesian semi-parametric and hierarchical
Bayes techniques even further into Machine Intelligence in his 2005
book Pattern Recognition and Machine Learning. The positive
impact of this transfer of knowledge upon the disciplines of
Artificial Intelligence and Machine Intelligence has been enormous.
It’s rather like the way the Moors handed over much of their vast
knowledge of the ancient Greek and Roman cultures to the Christians
during the 11 th. Century, in the Spanish City of Toledo.
Christopher Bishop is
to be congratulated on his tremendous perceptions and insights which
have led to immense advances in Machine Intelligence, following very
much in the spirit of Alan Turing. He is a Distinguished Scientist
at Microsoft Research Ltd. In Cambridge, England, and Professor of
Computer Science at the University of Edinburgh.
Michael I. Jordan has
made similarly impressive efforts. He is Pehong Chen Distinguished
Professor in the Departments of Electrical Engineering and Computer
Science, and Statistics at the University of California at Berkeley.
In 1995 and 1996, Rob Kass published two major review articles in JASA, the first
with Adrian Raftery on Bayes factors and the second with Larry
Wasserman on prior distributions.
Kass was to later
co-author two influential reviews on Statistics in neuroscience. He
was the founding editor-in-chief in 2006 of ISBA’s interdisciplinary
journal Bayesian Analysis.
In their 1995 JASA
paper, Guido Consonni and Piero Veronese proposed using hierarchical
partition models to combine results from several binomial
experiments. In similar spirit, Peter Green proposed a method for
model determination in Biometrika (1995) that constructs
Markov chain samplers which irreversibly jump between parameter
spaces of different dimensionality. The convergence problems were
phenomenal. Nevertheless, Peter’s paper has received over 3000
citations in the scientific literature. He certainly seemed to make
Guido Consonni is
Professor of Statistics at the Catholic University of Milan. He and
Fabrizio Ruggeri, a research director with the National Research
Council in Milan and a previous president of ISBA, are two of the
leading Bayesians in Italy.
Piero Veronese is
Professor of Statistics at Bocconi University. His research
interests include the main theoretical issues that Bayesians should
be involved in. As a sportsman, he believes that ‘the important
thing is to spend your free time outdoors’. He is fond of mountain
ski-ing, tracking with snowshoes and crampons, and fantastic
experiences in the Algerian desert near Tassili n’Ajjer.
In their 1995 JASA
paper, Michael Escobar and Mike West proposed their Bayesian
methodology for estimating densities by mixtures, and, in his 1995
JRSSB article, Peter Green combined his adventures with
reversible jump MCMC with Bayesian inference in complex stochastic
systems and spatial processes, forensic genetics, Bayesian semi-parametrics,
and graphical procedures.
An erstwhile President
of the Royal Statistical Society, and the recipient of Guy Medals in
silver and bronze, Peter is an Emeritus Professor of Statistics at
the University of Bristol where he worked for many years with Bernie
Silverman. Their high quality book Non-Parametric
Regression and Generalised Linear Models: Roughness Penalty Approach
was published in 1994.
Sylvia Richardson and
Wally Gilks reported two Bayesian methods in 1993 and 1994, in
Statistics in Medicine and the American Journal of
Epidemiology, for analysing conditional independence models for
epidemiological data. In 1997, Sylvia Richardson and Peter Green
co-authored a widely-cited paper in JRSSB concerning their
Bayesian analysis of mixtures with unknown numbers of components.
Since then, Sylvia has become a very keen proponent of MCMC and
other stochastic algorithms, which she applies to many areas of
genetics and medicine, including genomics and meta-analysis. She
advocated analysing Bayesian hierarchical models using the MCMC and
acceptance sampling procedures in WinBUGS.
Sylvia Richardson, who
is one of the leading French Bayesians, held the Chair of
Biostatistics at Imperial College London until 2012. She is now
Professor of Biostatistics and Director of the MRC Biostatistics
Unit at the University of Cambridge. She’d met a number of similarly
enthusiastic Bayesians during the 1970s while she was a lecturer in
Statistics at the University of Warwick.
Chris Glasbey and
Graham Horgan published their book Image Analysis for the
Biological Sciences in 1995 with John Wiley. The authors
worked at BIOSS (Biomathematics and Statistics Scotland) which is
housed in the King’s Buildings, Edinburgh, and applied their
methodology to microscopy, medical image systems, and remote
sensing. Chris was soon to be a Doctor of Science and Honorary
Professor of the University of Edinburgh. He was later elected
Fellow of the Royal Society of Edinburgh for his somewhat Bayesian
contributions, many of which found application in agriculture. Chris
also worked with the Roslin Institute e.g. with Caroline Robinson
who used a Bayesian template method to estimate the amount of meat
In 1996, the eminent Irish
statistician Adrian Raftery published ‘Approximate Bayes factors and
accounting for model uncertainty in generalised linear models’ in
Biometrika. In their 1997 article in Applied Statistics,
Chris Volinsky, David Madigan, Adrian Raftery, and Richard Kronmal
used their Bayesian model averaging procedures for proportionate
hazards models to assess the risk of strokes. A wonderful
Adrian Raftery is
Professor of Statistics and Sociology at the University of
Washington in Seattle. He was the world’s most cited mathematician
for the entire decade 1995-2005. In 2012 he was awarded the
prestigious Parzen Prize by Texas A&M University. His citation
mentioned his Bayesian applications in probabilistic forecasting,
model-based clustering and classification, time series, image
analysis, sociology, demography, environmental sciences, and health
sciences. He was recently elected to the Irish Royal Academy.
In their 1996 JASA
papers, Valen Johnson reported his Bayesian analysis of multi-ratio
ordinal data, with an application to automated essay grading,
Michael Newton, Claudio Czardo and Rick Chappell described their
Bayesian inferences for semi-parametric binary regression, and
Cinzia Carota, Giovanni Parmigiani and Nick Polson investigated some
diagnostic measures for model criticism.
Also in 1996, Keith Abrams, Deborah Ashby, and Doug Errington reported their Bayesian
analysis of Weibull survivor time models in Lifetime Data
Analysis, together with their applications to cancer trials. In
his 1990 Ph.D. thesis, John Hsu found that mixtures of Weibull
distributions gave a better fit to cancer survival data. In 1997,
Fayer, Ashby and Parmar published a biostatistics tutorial on
Bayesian monitoring in clinical trials.
Deborah Ashby is a
Bayesian biostatistician with interests in many areas of medicine.
She received her O.B.E. in 2009 and was elected to the Academy of
Medical Sciences in 2012. She holds the Chair in Medical Statistics
and Clinical Trials at Imperial College London.
Again in 1996, James
Bennett, Amy Racine-Poon and Jon Wakefield described how MCMC can be
employed for the analysis of non-linear hierarchical models. Their
paper appeared in Markov Chain Monte Carlo in Practice
(edited by Wally Gilks, Sylvia Richardson and David Spiegelhalter).
In that same year,
Stephen Walker and Jon Wakefield reported their Bayesian
semi-parametric approach, in Bayesian Statistics 5, for the
population modeling of a monotonic dose response curve.
In their 1997 paper in
JASA, Peter Müller and Gary Rosner applied a Bayesian
population model with non-linear hierarchical mixture priors to
blood count data.
The Austrian Bayesian
Peter Müller is Professor of Statistics at the University of Texas
at Austin, and a past president of ISBA. He works on semi-parametric
Bayesian inference, design problems, biomedical research, dependence
structures, graphical models, high throughput genomic data, and
population pharmokinetic and pharmodynamic models.
Radford Neal published
his book Bayesian Learning for Neural Networks in 1996.
Radford currently holds the Canada Research Chair in Statistics and
Machine Learning at the University of Toronto, and he has since
published numerous high quality Bayesian papers, many with a view to
applications in Machine Intelligence, but some with more general
In 1996, Stuart Coles
and Elwyn Powell reviewed the ongoing developments of Bayesian
methods in extreme value modeling in the International
Stuart Coles and Antony
Davison co-authored their book An Introduction to Statistical
Modeling of Extreme Values in 2001.
Stuart is currently an
Associate Professor of Statistics at the University of Padua.
In 1997, Jon Wakefield, Leon
Aarons and Amy Racine-Poon co-authored a Bayesian approach to
pharmokinetic/pharmacodynamic modeling in Case Studies in
Bayesian Statistics (edited by Brad Carlin and six equally
One of the co-authors’
principle aims was to discover, for a particular drug, the
relationship between dose administered , drug concentrations in the
body and efficacy/toxicity, They derived a sophisticated three-stage
hierarchical model from a set of differential equations, and this
helped them to achieve their objectives.
Amy Racine-Poon worked
for the Pharma division of Novartis in Basel, Switzerland. She has
co-authored a number of high quality applied Bayesian papers which
also refer to elegant mathematical theory, and she ranks highly
among mainland European Bayesians.
Jon Wakefield is
Professor of Statistics and Biostatistics at the University of
Washington in Seattle. He has authored important Bayesian applied
papers in numerous cases, including Genetic epidemiology and
Genome-wide association studies. He is always genuinely concerned
about the frequency properties of his Bayesian procedures, and
should perhaps, like Brad Carlin, be regarded in philosophical terms
as a ‘Bayesian-Fisherian’ statistician.
Jon Wakefield is yet
another of Sir Adrian Smith’s highly successful former Ph.D.
influence across the discipline was by the mid-1990s becoming
enormous. After assuming a number of important leadership roles, he
is currently Vice-Chancellor of London University, and also deputy
head of the U.K. Statistics Authority. Adrian has supervised 41
successful Bayesian Ph.D. students altogether, most of whom have
gone on to achieve greater heights. They include Michael Goldstein,
Uri Makov, Allan Skene, Lawrence Pettit, John Naylor, Ewart Shaw,
Susan Hills, Nick Polson, David Spiegelhalter and Mike West, a
phenomenal achievement. Both John Naylor and Ewart Shaw provided
Adrian with remarkably sound computing expertise during the 1980s
and before Bayesian MCMC came into vogue. They, for example,
developed a computer package known as Bayes 4 which employed some
reassuringly convincing, algebraically expressed, approximate
Bayesian techniques. Bayes 4 has recently been incorporated into
Ewart Shaw’s larger package BINGO. Maybe Sir Adrian should be
regarded as the Sir Isaac Newton of modern Bayesian Statistics.
Also in 1997, Bob Mau
and Michael Newton proposed using MCMC, in their article in the
Journal of Computational and Graphical Statistics, when
addressing phylogenetic inference for binary data on dendograms.
In the same year, Bruce
Craig, Michael Newton, Robert Garrott, John Reynolds and J. Ross
Wilcox proposed using MCMC, in their Biometrics paper, to
analyse aerial survey data on Florida Manatee. In 1996, Bruce, the
son of University of Wisconsin Dean Judy Craig, had won an ENAR
student paper prize for his endeavours.
Michael Newton, who
hails from Nova Scotia, Canada, is currently director of the
Biostatistics program and co-director of the Cancer Genetics program
at the University of Wisconsin-Madison, and is one of our most
brilliant younger middle-aged Bayesians. He was the recipient of the
George Snedecor Award and the COPSS Presidents Award, in 1997 and
2004, and he has received several further top honours.
In 1997, Edward George
and Robert McCulloch reported their Bayesian approaches to variable
selection in Statistica Sinica.
Ed George is Universal
Furniture Professor of Statistics at the University of Pennsylvania.
He is interested in the Bayes/ empirical Bayes compromise, and his
application areas include business, Bayesian ensemble learning, and
serial genetics. He has made many wonderful contributions.
Rob McCulloch is
Katherine Dusak Miller Professor of Econometrics and Statistics at
the University of Chicago. Similarly prolific, he is also interested
in machine learning.
John Kent reviewed the
literature of Bayesian methods for image analysis by deformable
templates in 1997 in The Proceedings in the Art and Science of
Image Analysis. The observation vectors are usually modelled by
a mixture of multivariate normal distributions with fixed locations
and simple covariance structures and (assumption sensitive) prior
distributions are then assigned to the mixing probabilities and a
single dispersion parameter. John co-authored a paper with Duncan
Lee and Kanti Mardia in the same proceedings, where they used a
related Bayesian approach to tag cardiac MR images.
In their 1997 papers in
JASA, Iain Weir reported his fully Bayesian reconstructions
for single photon emission computed tomography data, Michael Evans,
Zvi Gillula, Irwin Guttman and Tim Schwarz described their Bayesian
analysis of stochastically ordered distributions of categorical
variables, Cindy Christiansen and Carl Morris analysed their
hierarchical Poisson regression models, and Jim Albert and Sid Chib
proposed their Bayesian tests and diagnostics in conditionally
independent hierarchical models.
Newton Bowers, James Hickman, Cecil Nesbit, Donald Jones and Hans
Gerber published their magnum opus Actuarial Mathematics in
James C. Hickman
(1927-2006) was the predominant force in introducing Bayesian
inference into the actuarial sciences. He was the erstwhile Dean of
the University of Wisconsin-Madison Business School. I first met him
and his lovely wife in Iowa City in 1972. He was a man of vision who
was always ready to incorporate subjective information into his
analyses e.g. when trying to set the insurance premiums for Jumbo
jets in the era before any Jumbos had crashed.
James C. Hickman (1927-2006)
Christian Robert co-authored
six important Bayesian papers during 1998. He and Constantinos
Goutis used Kullback-Liebler projections to make Bayesian choices
between competing generalised linear models. To cap that, Christian
combined with Mike Titterington to develop some reparametrization
strategies for hidden Markov models, and Bayesian approaches to
maximum likelihood estimation. Christian also published four joint
papers during 1998 on MCMC and its applications to Bayesian
Christian Robert has
been an impressive advocate of the Bayesian paradigm ever since, as
epitomised by his influential book The Bayesian Choice. He is
currently Professor of Statistics at Université Paris Dauphine. Like
his compatriot Sylvia Richardson, he was very much influenced by his
earlier good times with the Bayesian school at the University of
Also in 1998, Malay Ghosh, Kannan Nataragan, Tom Stroud, and Brad Carlin reported some
MCMC procedures in JASA for the analysis of sample survey
data via generalised linear models for small area estimation. They
extended their general theorem to the case of spatial models and
reviewed the related literature.
Malay Ghosh and Glen
Meeden had published their important text Bayesian Methods for
Finite Populations in 1997. Ghosh, an eminent Indian
statistician who was born in Bengal, is a Distinguished Professor at
the University of Florida. He served from 1996 to 2001 on the US
Census Advisory Committee.
Glen Meeden is Professor
of Statistics at the University of Minnesota. He has published
extensively in Bayesian inference and decision theory. When I first
met Glen, at Arbor Michigan in 1978, he exclaimed, ‘Oh, you’re the
guy who obtained that neat estimate for the mean of a normal
Glen was presumable
referring to my note in Biometrika 1974, when I modified the
usual Bayes estimate of a normal mean by reference to a prior with
infinitely thick tails. If I’d used the posterior mean rather than
the mode, then my highly robustified generalised Bayes estimate
would have been even more convincing. Maybe a keen student somewhere
would like to work out the algebra.
Tom Stroud has now
retired from Queen’s University, Kingston, Ontario, after a very
productive career in applied Bayesian Statistics during which he
worked with Louis Broekhoven in his department’s STATLAB.
Louis made various
contributions to spline theory and Bayesian applications. He also
contributed lots of the expertise during the preparation of our
joint paper with Jim Low in the American Journal of
Obstetrics and Gynaecology (1981). Louis was a former student of
Florence David’s at UCL, where he and a cohort of further
postgraduates were expected to grind out endless asymptotic
Florence once famously
declared ‘You’re not getting into my car, George Box!’ after George
had criticised her talk on a similarly boring topic to the Royal
In their JASA 1998
papers, Claudia Tebaldi and Mike West reported their Bayesian
inferences for network traffic data, Jim Dickey and Thomas Jiang
described their filtered-variate prior distributions for histogram
smoothing, and Babette Brumback and John Rice presented a
prestigious discussion paper on smoothing spline models for the
analysis of nested and crossed samples of curves.
The volume Maximum
Entropy and Bayesian Methods was co-edited in Garching, Germany
in 1998 by Wolfgang von den Linden, Volker Dose, Rainer Fischer and
Roland Preuss as a contribution to the fundamental theories of
This outstanding volume
includes a dedication to Edwin Thompson Jaynes (1922-98) by G. Larry
Bretthorst. Jaynes was born and raised in Iowa.
A. A paper by Fisher, Jacob,
von den Linden and Dose on the Bayesian reconstruction of electron
energy distributions from emission line intensities.
B. An article by Richard
Silver of the Los Alamos National Laboratory in New Mexico on
quantum entropy regularization.
C. A Bayesian reflection on
surfaces by David R. Wolf.
Larry Bretthorst is a
professor in the Department of Chemistry and Radiology at Washington
University in St. Louis. He has published numerous exciting papers
e.g. on Bayesian spectrum analysis.
In 1999, Bob Mau, Michael
Newton and Bret Larget reported their most recent MCMC procedures
for Bayesian phylogenetic inferences, in Biometrics.
Bob Mau is a senior
scientist in the Genome Evolutionary Laboratory at the University of
Wisconsin-Madison, and he has co-authored several Bayesian articles
in his area of specialism.
In their 1999 paper in
Statistics in Medicine, Luke Tierney and Antonietta Mira of
the University of Minnesota in Minneapolis developed some adaptive
strategies which adjust the MCMC algorithm to a particular context
based upon information obtained during sampling together with
information provided by the problem. The authors used their adaptive
MCMC analysis of a pharmokinetic model to investigate the plasma
concentrations of the drug Caldralazine in cardiac failure patients.
Also in 1999, the New
Zealand Bayesian Russell Millar  used the WinBUGS code, with
Renate Meyer for fitting a state space surplus production model.
Russell Millar is
Associate Professor of Statistics at the University of Auckland, and
editor of the Australian and New Zealand Journal of Statistics.
Russell has co-authored several papers on the Bayesian state-space
modeling of fisheries dynamics, including surplus production and
age-structured models. He moreover published his book Maximum
Likelihood and Inference with John Wiley in 2011.
In the same year, Paula Macrossan and five co-authors from the Queensland University of
Technology reported their Bayesian neural network for prediction in
the Australian dairy industry to the Third International
Symposium on Intelligent Data Analysis in the Netherlands. One
of the co-authors, Hussein Abbass worked in his university’s machine
learning centre. The authors were able to successfully predict dairy
daughter milk production from dairy dam, sire, herd and
To cap that, Steven MacEachern and Merlise Clyde suggested reported their sequential
importance sampling simulations for semi-parametric Bayesian models
in the Canadian Journal of Statistics, and Merlise
developed some Bayesian model averaging and model search strategies
in Bayesian Statistics 6. Her model average procedures do
seem to be potentially quite sensitive to small changes in the
highly complex prior distributional assumptions.
Merlise Clyde is a
professor in the all-Bayesian Department of Statistical Science at
Duke University. She applies her techniques to applications in
proteomics, bioinformatics, astro-statistics, air pollution and
health effects, and environment sciences.
Merlise Clyde, ISBA President 2013
Merlise is the current
(for 2013), enormously productive and somewhat scatter-brained,
President of the International Society for Bayesian Analysis (ISBA).
Not to be outdone,
Helio Migon and Dani Gamerman published their splendid text
Statistical Inference: An Integrated Approach, also in 1999.
Once postgraduate students at the University of Warwick, the ‘Boys
from Brazil’ are nowadays two of their country’s leading Bayesians.
In their papers in
JASA during 1999, Jim Berger and Julia Mortera proposed some
default Bayes factors for one-sided hypothesis testing, Alan Gelfand
and Sujit Sahu addressed identifiability problems, improper priors
and Gibbs sampling for non-linear models, and Florence Forbes and
Adrian Raftery co-authored the fascinating paper ‘Bayesian
morphology: Fast unsupervised Bayesian image analysis’.
A leading UCL Bayesian was very much involved in
the eventually very tragic 1999 Sally Clark case, when a 35 year old
solicitor was convicted of murdering her two babies, deaths which
had originally been attributed to sudden infant death syndrome.
Professor Philip Dawid,
an expert called by Sally Clark’s team during the appeals process
pointed out that applying the same flawed method to the statistics
for infant murder in England and Wales in 1996 could suggest that
the probabilities of two babies in one family being murdered was 1
in 2,152,224,291 (that is 1 in 2152 billion) a sum even more
outlandish than 1 in 73 million (as suggested by the highly eminent
prosecution expert witness Professor Sir Roy Meadows, when the
relevant probability was a 2 in 3 chance of innocence. Sir Roy was
later disbarred as a medical practitioner, but was reinstated upon
appeal. Sally Clark committed suicide after being released from
prison following several appeals.
As the decade drew to a close, Robert Cowell, Phil Dawid, David
Spiegelhalter and Steffen Lauritzen combined their grey and silver
matter to co-author the outstanding book Probability Networks and
Expert Systems, which earned them the prestigious 2001 De Groot
prize from ISBA. Here are their chapter headings:
Uncertainty and Probability
using Probabilistic Networks
Mixed Discrete Gaussian Networks
Multi-Stage Decision Networks
The authors’ epilogue
includes the standard conjugate analyses for discrete data models, a
discussion of Gibbs sampling, and an information section about
software on the worldwide web.
This was one of the most
exciting Bayesian books since Morris De Groot published Optimal
Statistical Decisions in 1970. It is a must for all criminal
investigators and forensic scientists, and a good bedtime read for
all fledgling Bayesians.
This is a good point to
take a chronological break, I will discuss the Bayesian advances
during Anno Domini 2000 in the next chapter. Dennis Lindley’s
read paper will be pivotal in my discussion of the transition of
Bayesian ideas from the twentieth century to the next, and several
of the papers published in 2000 were very relevant to future
developments. These include the American Statistical Association
vignettes composed by Jim Berger and several other Bayesians.
The Bayesian history of the
twentieth century would not, however, be complete without a
discussion of the ways misguided versions of Bayes theorem were
misused in the O.J. Simpson murder and Adams rape trials during
attempts to quantify DNA evidence in the Courtroom.
Suppose more generally
that some traces of DNA are found at a crime scene, and that a
suspect is subsequently apprehended and charged. Then it might be
possible for the Court to assess the ‘prior odds’ Φ/ (1-Φ) that the
trace came from the DNA of the defendant, where the prior
probability Φ refers to the other, human, evidence in the case.
Then according to an easy rearrangement of Bayes theorem, the
‘posterior odds of a perfect match’ multiplies the prior odds by a
Bayes factor R which is a ‘measure of evidence’ in the sense that it
measures the information provided by the forensic sciences relating
to their observations of the suspect’s DNA. This has frequently been
based on their observations of the suspect’s allele lengths during
15 purportedly independent probes.
In the all too frequent
special case where Φ is set equal to 0.5, the posterior probability
of a perfect match is λ=R/(R+1). In the United Kingdom, R turns out
to be a billion with remarkable regularity, in which case 1-λ,
loosely speaking the posterior probability that the trace at the
crime scene doesn’t come from the defendant’s DNA, is one in a
billion and one, and this is usually regarded as overwhelming
evidence against the defendant.
While this procedure is
being continuously modified, it has frequently been, at least in the
past, outrageously and unscientifically incorrect, for the following
1.If there is no prior
evidence, then, according to Laplace’s Principle of Insufficient
Reason, Φ should be set equal to 1/N, where N is the size of the
population of suspects, for example the size, maybe around 25
million, of the population of eligible males in the United Kingdom.
In contrast, a prior
probability of 0.5 is often introduced into the courtroom via Erik
Essen-Möller’s shameless device  of a ‘random man’, which is
just a mathematical trick, or by similarly fallacious arguments
which have even been advocated by some leading proponents of the
Bayesian paradigm. Essen-Möller’s formula was published in Vienna in
1938 around the time of the Nazi annexation of Austria. Maybe Hitler
was the random man!
Erik Essen-Möller later
made various influential contributions to the ‘genetic study’ of
psychiatry and psychology, include schizophrenia, and was much fêted
in his field. Jesus wept!
2. The genocrats don’t
represent R by a Bayes factor, but rather a likelihood ratio. In the
case where 15 DNA probes are employed, the combined likelihood ratio
R incorporates empirical estimates of 15 population distributions of
allele lengths. These non-parametric empirical estimates were,
during the 1990s, frequently derived from 15 small, non-random,
samples of the allele lengths and can be highly statistically
deficient in nature. See, for example, the excellent 1994 discussion
paper and review by Kathryn Roeder in Statistical Sciences.
3. The overall, combined
likelihood ratio R has typically been calculated by multiplying 15
individual likelihood ratios together. The multiplications would be
open to some sort of justification if the empirical evidence from
the 15 probes could be regarded as statistically independent. The
genocrats and forensic scientists seek to justify statistical
independence by the genetic independence which occurs in homogeneous
populations which are in a state of Hardy-Weinberg equilibrium e.g
when individuals choose their mates at random from all individuals
of opposite gender in the population. However, most populations are
highly heterogeneous and most people I know don’t choose their
partners at random. For example, most heterosexual people choose
heterosexual or bisexual partners.
homogeneity assumption can greatly inflate the overall likelihood
ratio R as a purported measure of evidence against the defendant.
For example, if the individual likelihood ratios are each equal to
four, then R is four to the power fifteen, which exceeds a billion,
in situations where the ‘true’ combined measure of evidence might be
During the trial
(1995-6) of the iconic American football star O.J. Simpson for the
murder of his long-suffering wife Nicole, the celebrated prosecution
expert witness Professor Bruce Weir of the University of North
Carolina attempted to introduce blood and DNA evidence into the
Courtroom via a similarly misguided misapplication of Bayes Theorem.
The scientific content of his evidence was refuted by our very own
Aussie Bayesian, Professor Terry Speed of the University of
California at Berkeley. Terry was assisted in his efforts by an
arithmetic error made by Professor Weir during the presentation of
the prosecution testimony. However, when Simpson was found innocent
in criminal court, the media largely attributed this to the way a
police officer had planted the forensic evidence by throwing it over
a garden wall.
During the early 1990s,
the defendant in the Denis Adams rape case was convicted on the
basis of the high value of a purported likelihood ratio, even though
the victim firmly stated that he wasn’t the man who’d raped her.
Adams moreover appeared to have had a cast iron alibi, since several
witnesses said that he was with them a many miles away at the time
of the crime.
The defence expert witness, the much-respected Professor Peter
Donnelly of the University of Oxford, asked the members of the jury
to assess their prior probabilities by referring to the human
evidence in the case, in a valiant attempt to counter the influence
of the large purported combined likelihood ratio. In retrospect,
Peter should have gone straight for the jugular by refuting the
grossly inflated likelihood ratio along the applied statistical
lines I describe above.
When the Court of
Appeals later upheld Adams’ conviction, they strongly criticised the
use of prior probabilities as a mode for assessing human evidence
and effectively threw Bayes Theorem out of Court, in that particular
case at least. Adrian Smith, at that time the President of the Royal
Statistical Society, was not at all amused by that rude affront to
his raison d’être.
viewpoints regarding the apparent misapplications of Bayes theorem
in legal cases, see the well-cited books by Colin Aitken and Franco
Taroni, and David Balding, and the 1997 Royal Statistical Society
invited discussion paper by L.A. Foreman, Adrian Smith and Ian Evett.
Colin Aitken is
Professor of Forensic Statistics at the University of Edinburgh.
Jack Good’s earlier ‘justification’ of the widespread use of Bayes
factors as measures of evidence is most clearly reported on page 247
of Colin and Franco’s book widely cited book Statistical
Evaluation of Evidence for Forensic Scientists, and on
page 389 of Good’s 1988 paper ‘The Interface between Statistics and
Philosophy of Science’ in Statistical Science.
Jack Good invokes Themis,
the ancient Greek goddess of justice, who was said to be holding a
pair of scales on which she weighed opposing arguments. Colin Aitken
is a leading advocate of the use of Bayes factors in criminal cases,
and his officially documented recommendations to British Courts of
Law depend heavily on Good’s key conclusion that any sensible
additive weight of evidence must be the log of a Bayes factor. That
creates visions of the Goddess Themis weighing the logs of Bayes
factors on her scales and putting herself at loggerheads with the
Maybe I’m missing
something, though perhaps not. Good’s ‘justification’ seems to be
rather circuitous, and indeed little more than a regurgitation of
the additive property,
log-odds = Prior log-odds +log (Bayes factor),
which can of course be
extended to justify the addition of the Bayes factors from
successful experiments. This simple rearrangement of Bayes Theorem
would, at first sight, appear to justify Good’s apparently seminal
conclusion. However, since Bayes factors frequently possess
counterintuitive properties (see Ch. 2), the entire idea of
assigning of assigning a positive probability to a ‘sharp’, i.e.
simple or only partly composite, null hypothesis, is open to serious
question in situations where the alternative hypothesis is
The approach to
multivariate binary discrimination described by John Aitchison and
Colin Aitken in their 1976 paper in Biometrika is much more
convincing, as they use a non-parametric kernel method to
empirically estimate the denominator in the Bayes factor. But a
posterior probability cannot justifiably be associated with it, even
as a limiting approximation.
Colin’s parametric Bayes
factors are of course still useful if they are employed as test
statistics, in which case they will always possess appealing
frequency properties. See , pp 162-163. It’s when you try to use
Bayes theorem to directly convert a Bayes factor into a posterior
probability that there’s been trouble at Mill. The trouble could be
averted by associating each of Colin’s Bayes factors with a Baskurt-Evans
style Bayesian p-value. Perhaps Colin should guide the legal
profession further, by writing another book on the subject.
In  and , Jack
Good used Bayes factors as measures of evidence when constructing an
ambiguously defined measure of explicativity, which is,
loosely speaking, ‘the extent to which one proposition or event
explains why another should be believed’. I again find Jack’s
mathematical formulation of an interesting, though not all
pervading, philosophical concept to be a bit too fanciful.
The co-authors of a 1997
R.S.S. invited paper on topics relating to the genetic evaluation of
evidence included Dr. Ian Evett of the British Forensic Science
Service and my nemesis Adrian Smith. They, however, chose not to
substantively reply, in their formal response to the discussants, to
the further searching questions which I included in my written
contribution to the discussion of their paper.
These were inspired by
my numerous experiences as a defence expert witness in U.S. Courts
In 1992, I’d successfully challenged an alleged probability of
paternity of 99.99994% in Phillips, Wisconsin, and district
attorneys in the Mid-West used to settle paternity testing cases
when they heard I was coming. The 1992 ‘Rosie and the ten
construction workers case’, when I refuted a prior probability of
paternity of 0.5 and a related posterior probability of over 99.99%
in Decorah, Iowa, seemed to turn a Forensic Statistics conference in
Edinburgh in 1996 head over heels, and Phil Dawid and Julia Mortera
made lots of amusing jokes about it over dinner while Ian Evett and
Bruce Weir fumed in the background. I, however, always declined to
participate in the gruesome rape and murder cases in Chicago.
Nevertheless, during a
subsequent light-hearted public talk on a different topic at the
1997 Science Festival in Edinburgh, Adrian somewhat petulantly
singled me out as a person who disagreed with him. I’m glad that I
had the temerity to do so.
statisticians, and forensic scientists, should be ultra-careful not
to put the public or the remainder of their profession into a state
of mystification. Jimmie Savage had a habit of putting down
statisticians who questioned him e.g. his very sharp, though
self-effacing, brother-in-law Frank Anscombe, who was an occasional
Bayesian, and John Tukey is said to have put pressure on George Box
to leave Princeton in 1960 for related reasons, after Tukey received
some of the same medicine when he visited Ronald Fisher for
afternoon tea. Holier-than-thou statisticians should realise that
they are likely to be wrong some of the time, at least in concept,
along with everybody else. As George Box once said, ‘We should
always be prepared to forgive ourselves when we screw up.’
Some further problems
inherent in the Bayesian evaluation of legal evidence are satirized
in Ch. 14: Scottish Justice of my self-published novel In
the Shadows of Calton Hill.
At the risk of provoking
further negative reactions e.g from the genetics and forensic
science professions, here are some possible suggestions for
resolving the immensely socially damaging DNA evidence situation:
1. Since most of the genetic
theory that underpin them is both suspect and subject to special
assumption, combined likelihood ratios and misapplications of Bayes
theorem should be abandoned altogether. In the case where there are
fifteen DNA probes, an exploratory data analysis should be performed
of the 15 corresponding (typically non-independent) samples of the
allele lengths or their logs, and used to contrast the 15x1 vector X
of log allele lengths measured from the trace at the crime scene
with the vector Y of the log allele lengths measured from the
suspect’s DNA. The quality of the data collection would need to be
greatly improved to justify doing this.
2. A Bayesian analysis of
the n 15x1 vectors of the log-allele lengths should then be employed
to obtain a posterior predictive distribution for the vector Z of
the 15 log-allele lengths for a randomly chosen individual from the
reference population. Various predictive probabilities may then be
used to contrast the elements of X and Y, and some criterion would
need to be decided upon by the Courts to thereby judge whether X and
Y are close enough to indicate a convincing enough match.
3. As an initial suggestion,
it might be reasonable to assume that the n vectors of log-allele
lengths constitute a random sample from a multivariate normal
distribution with unknown and unconstrained mean vector μ and
covariance matrix C. If the prior distribution of μ and C is taken
to belong to the conjugate multivariate normal/ inverted Wishart
family (e.g. , p290), then the posterior predictive distribution
of Z is generalised multivariate-t. This facilitates technically
feasible inferential Bayesian calculations for contrasting the
elements of X and Y, and prior information about μ and C can be
incorporated if available e.g. by reference to other samples.
In the meantime,
thousands of potentially innocent people are still being falsely
convicted using versions of Bayes Theorem. This was easily the most
terrifying misapplication of the Bayesian paradigm of the twentieth
century, and it is something which the Bayesian profession should
not be proud of.
[In early January 2014,
the Bayesian forensics expert Professor David Balding of University
College London kindly advised me that many of the statistical
problems inherent in the evaluation of DNA evidence are now in the
process of being overcome. See, for example David’s paper
‘Statistical Evaluation of Forensic DNA Profile Evidence’ (with
Christopher Steele, Ann. Rev. Stat. Appl., 2014)]