A
PERSONAL HISTORY OF BAYESIAN STATISTICS 

Thomas Hoskyns
Leonard 

Retired
Professor of Statistics, Universities of WisconsinMadison and
Edinburgh 

5. THE ENIGMATIC
EIGHTIES 

The real thing
in this world is not where we stand, but in which direction we are
moving (Julius Nyrere) 



The Mayflower Rose, by Fabio Cunha 

When I remember the
promising early years of the Reagan era in America, I recall my
metaphoric poem, The Mayflower Rose. Maybe Rose was
America itself. 

In what wondrous
dream
Do I suppose
I met the Mayflower
Rose?
Her petals turned
pink
In a hug and a
blink;
Her stem twisted in
the breeze,
When I fell to my
knees;
Her aura turned
heavenly and angelic,
As I plied her with
Dumnonian magic.
But when the
prickly thistle flew in,
Rose was gone in
the din
I twisted and
turned
As I drank like a
tank for ever and a day.
When Yank fought
Assyrian in the Gulf of Tears,
She, the Voice
behind the Screen,
Spoke as if I’d
never been.
Now she beams
across the mind waves
In my wondrous
dreams. 


The 1980s were made even
more enigmatic by George Box’s seminal article [67], in which Box
argued that while ‘coherent’ Bayesian inferences are appropriate
given the assumed truth of a sampling model, the assumed model
should always be checked out against a completely general
alternative using a frequencybased significance test together with
other nonBayesian data analytic techniques. His notsooriginal
philosophy ‘All models are wrong, but some are useful’ has since
become engrained in statistical folklore. Consequently, Bayesian
researchers were forced to acknowledge that modelbased developments
do not usually completely solve the applied problem.
Box’s approach posed a
serious enigma. Should Bayesians continue blandly with their
timehonoured modelbased research, or try to compete in the ‘model
unknown’ arena, where they would need to develop new talents and
techniques? Many Bayesians opted out by focussing on the first of
these options, and others were faced with daunting technical and
conceptual difficulties while seeking theoretical procedures that
might justify their choices of parameterparsimonious sampling
model. However the increased computational feasibility of
modelbased inferences made it easier for them to concentrate more
of their energies on trying to do this.
Many economists
persisted with the SavageLindley gobsmacking philosophy ‘A model
should be as big as an elephant’ (see [68]) while failing to
reasonably identity all the parameters in their, albeit highly
meaningful, models from finite data sets. However, for sensibly parametrized nonlinear models, importance sampling continued to
play a key modelbased inferential role, and AIC and other
information criteria were frequently used for model comparison.
When the sample space
is continuous, Box referred to the prior predictive density p(y) of
the nx1 observation vector y. This averages the hypothesised
sampling density of y with respect to the prior distribution of the
unknown parameter vector θ. The elements of y of course consist of
the numerical realizations of the corresponding elements of a random
vector Y. Box then recommended testing the hypothesised distribution
against a general alternative by reference to the ‘significance
probability’, 

α = prob [log p (Y) < log p(y)] 

For this definition to
make sense in general terms ‘prob’ should be interpreted as
shorthand for ‘prior predictive probability’, rather than sampling
probability, Therefore α cannot in general be regarded as a
classical significance probability. This criterion is heavily
dependent upon the choice of prior distribution for θ, as evidenced
by the various special cases considered in Box’s paper. In special
cases e.g. involving the linear statistical model, α depends upon
the realization of a lowdimensional modelbased sufficient
statistic and not further upon the observations or residuals, a
curious property.
Shortly before
George read his paper to the Royal Statistical Society, I talked to
him about the large sample size properties of α (See Author’s
Notes below). These seemed to suggest that α was too
overlydependant on the prior assumptions for effective practical
model checking. George nevertheless went proudly ahead as planned,
his paper was very well received by the assembled Fellows, and his
pathbreaking philosophies have become engrained in applied Bayesian
Statistics ever since.
Model testing procedures
which adapted Box’s ideas by instead referring to a ‘posterior
predictive pvalue’ proved to be somewhat more appealing. See, for
example, Andrew Gelman, XianLi Meng and Hal Stern’s wellcited
paper in Statistica Sinica (1996).
Nevertheless, the
question remains as to how small a pvalue would be needed to
justify rejecting the hypothesised model. In conceptual terms, you
might wonder whether it is reasonable to aspire to checking out a
model against a complete general alternative (including models with
variously nonindependent error terms), without any class of
alternative models in mind. From a frequency perspective any fixed
size test procedure which is reasonably powerful against some
choices of alternative model, might be less powerful against other
choices. Moreover, a significance test is next to worthless if it
possesses unconvincing power properties.
If you don’t think that
flagging your hypothesised sampling models with tail probabilities
provides the final answer, then that leaves you with residual
analyses and the wellversed criteria for model comparison, which
nowadays include AIC, DIC and crossvalidation. Assigning positive
prior probabilities to several candidate models with several unknown
parameters is not usually a good idea, unless the model parameters
are estimated empirically, since the posterior probabilities will
depend upon Bayes factors which are subject to the sorts of
paradoxes and sensitivities that I described in Ch. 2. However, in
the empirical case, one possibility would be to estimate the
sampling density by its estimated posterior mean value function.
In his 1980 article in the Annals of Statistics, Jim Berger
reported Bayesian procedures which improved on standard admissible
estimators in continuous exponential families. His books on
Statistical Decision Theory were published in 1980 and 1985, and his
1984 monograph with Robert Wolpert is titled: The Likelihood
Principle: A Review and Generalisations. Jim has always
analysed Bayesianrelated problems with outstanding mathematical
rigor and a strong regard for the frequency properties of the
estimation, inferential, and decisionmaking procedures. He is
currently an Arts and Sciences Distinguished Professor in the
Department of Statistical Sciences at Duke University.
In their 1980 paper in the Annals of
Statistics, Connie Shapiro (now Connie Page) and Bob Wardrop
applied Dynkin’s identity to prove optimality of their sequential
Bayes estimates for the rate of a Poisson process, with a loss
function which involved costs per arrival and per unit time. They
investigated the frequency properties of their procedures in an
asymptotic situation.




Connie Page 

In the same year,
Shapiro and Wardrop reported their Bayesian sequential estimation
for oneparameter exponential families in an equally high quality
paper in JASA.
Now an Emeritus
Professor at Wisconsin, and as ever an outstanding teacher, Bob
Wardrop published his book Statistics: learning in the presence
of variation in 1995. Connie Page, who has always played an
important leadership role in American Statistics, went on to focus
on statistical consultancy at Michigan State. 



Tom's
Facebook friend Bob Wardrop with his three grandchildren, Lodi,
Wisconsin, 2013 

In 1981, Mike West, one of
Adrian Smith’s many highly successful research students, published
an important paper in JRSSB out of his Ph.D. thesis about
robust sequential approximate Bayesian estimation. In 1983, Mike and
Adrian reported an interesting application of the Kalman Filter to
the monitoring of renal transplants. Mike’s path breaking joint
paper in JASA (1985) with Jeff Harrison and Helio Migon set
fresh horizons for the forecasting of counted data within a dynamic
generalised linear model framework that was easily expressible in
terms of two or more stages of a hierarchical distributional
structure. Mike West reported his many achievements in Bayesian
forecasting much later, in his magnificent books with Harrison, and
Harrison and Pole.
In 1981, Dr. Jim Low of
Kingston General Hospital, Ontario, and six coauthors, including
Louis Broekhoven and myself, reported their 1978 Bayesian analysis
of the Ontario foetal metabolic acidosis data in an invited
discussion paper which was published in the American Journal of
Obstetrics and Gynaecology. See [15], pp9295 for a
description. Various birth weight distributions were modelled using
Edgeworth’s skewed normal distribution, which I thought that I was
inventing for this purpose. 



Dr. James Low, Kingston General Hospital 

The many advances in
Bayesian medical statistics of between 1981 and 2006 were capably
reviewed by Deborah Ashby and published online by Wiley in 2006.
Important developments during the early 1980s include Jay Kadane and
Nell Sedrank’s Bayesian motivations towards more ethical clinical
trials, the methodology developed by Laurence Freedman and David
Spiegelhalter for the assessment of subjective opinion concerning
clinical trials, the clinical decisionsupport systems devised by
David Spiegelhalter and Robin KnillJones, with very useful
applications in gastroenterology, and the empirical Bayes estimates
of cancer mortality rates which were proposed by Robert Tsutakawa,
Gary Shoop and Carl Marienfield.
In his 1990 paper
‘Biostatistics and Bayes’ in Statistical Sciences, Norman
Breslow reviewed the changing attitudes of biostatisticians towards
the Bayesian and Empirical Bayesian paradigms.
Peter Armitage,
Geoffrey Berry and John Matthews are sympathetic towards Bayesian
methods in medical statistics in their expository treatise
Statistical Methods in Medical Research (4^{ }th.
edition, 2002).
In 2002, John Duffy and
I coauthored an article in Statistics in Medicine concerning the
MantelHaenzel model for several 2x2 contingency tables. Our
Laplacian approximations facilitated a metanalysis for data from
several clinical trials and we applied our methodology to various
sets of ear infection data.
The leading Bayesian
statistician Morris De Groot bravely led an ASA delegation to Taiwan
to investigate the brutal murder in Tapei in 1981 of the Han Chinese
martyr Chen Wen Chen, who was De Groot’s colleague in the Department
of Statistics at CarnegieMellon University. De Groot was lucky to
get out of Taiwan alive when the medical doctor in the delegation
performed an autopsy which revealed the true cause of death. The
Associated Press Reporter Tina Chou was blacklisted, and hounded for
many years afterwards for reporting the results of the autopsy. She
has not been seen or heard of for at least a decade.
Wen Chen’s
assassination by the axewielding Taiwanese secret police is
regarded as a seminal event in the history of Taiwan. It would
appear, according to various reports I have received down my
grapevine, that the crime may have been brought about by some scary
interstate academic intrigue within the United States, maybe even
with undertones of professional jealousy. The apparent truth of the
situation has put the fear of God in me for the last four decades.
Morry De Groot’s endeavours on behalf of the ASA, with the
assistance of his very supportive colleagues at CarnegieMellon,
were extremely laudable. Nevertheless American Bayesians should not
be completely proud of what happened, if an unexpected email I
received in Madison during the 1990s is to be believed. Chen Wen
Chen’s case was reopened in Taiwan several years ago. It would,
perhaps, still benefit from further investigation. 



Chen Wen Chen (19501981)
Han Chinese Martyr 



Morris De Groot (19311989) 

In 1981, I helped George Box
to organize a special statistical year at the U.S. Army’s
Mathematics Research Center (M.R.C.) at the University of
WisconsinMadison. George’s motivation was to orchestrate a big
debate about the Bayesianfrequency controversy, and concerning
which philosophy to adopt when checking out the specified sampling
model. If he was looking for some sort of consensus of opinion that
the frequency modeltesting procedures he’d proposed in his 1980
paper [67] were the way to go, then that is what he eventually
achieved after some relatively courteous disagreement from the
diehard Bayesians.
With these purposes in
mind, a number of Bayesian and frequentist statisticians were
invited to visit M.R.C. during the year. Their lengths of stay
varied between a few weeks and a whole semester. Then, in December
1981 everybody was invited back to participate in an international
conference in a beautiful auditorium overlooking Lake Mendota. The
conference proceedings Scientific Inference, Data Analysis, and
Robustness were edited by George, myself, and Jeff C.F. Wu, but
I never knew how many copies were sold by Academic Press. Maybe the
proceedings fell stillborn, David Humestyle, from the press.
The Mathematics
Research Center was housed at that time on the fifth, eleventh and
twelfth floors of the elegantly thin WARF building on the extreme
edge of the UW campus, near Picnic Point which protrudes from the
southeast shoreline of the still quite temperamental Lake Mendota.
The quite gobsmacking covert activities were not altogether confined
to the thirteenth floor.




The WARF Building, Madison, Wisconsin 

During the Vietnam
War, M.R.C. was instead housed in Sterling Hall in the middle of the
UW campus. However, on August 24, 1970, it was bombed as a protest
by four young people as a protest against the University’s research
connections with the U.S. Military. The bombing resulted in the
death of a university physics researcher and injuries to three
others.



The Bombing of Sterling Hall 

Visitors during the
special year included, as I remember, Hirotugu Akaike, Don Rubin,
Mike Titterington, Peter Green, Granville TunnicliffeWilson, Irwin
Guttman, George Barnard, Colin Mallows, Morris De Groot, Toby
Mitchell, Bernie Silverman, Phil Dawid, George Barnard, and Michael
Goldstein. However Dennis Lindley was the Bayesian who George Box
really wanted to attract to Madison. Dennis was by then an expensive
commodity on the American circuit where he wasn’t always kowtowed
to. He, for example, reportedly got into a classic confrontation
with a Dean at Duke a year or so later when the Dean invited him to
teach a basic level course on frequentist statistics. 



Hirotugu Akaike (19272009) 

George emptied the
military’s coffers to the tune of almost $50000 in his ultimately
successful efforts to tempt Dennis into visiting us for the first
semester of 1981. George advised me that it’s always important to
bury the hatchet with your longterm adversaries since we’re all
really ‘one big happy family’, or words to that effect (he and
Dennis had crossed swords over twenty years previously on a
nonacademic issue, at which time Dennis had reportedly ‘displayed
an iron fist from behind a velvet glove’), and Dennis and his kindly
wife Joan stayed in George Tiao’s enormous house on the icy west
side of Madison for the duration of their visit.
After an awkward
settlingin period, Dennis and George proceeded to debate the
fundamental philosophical issues during a series of special seminars
at M.R.C. While I thought that Dennis was winning, primarily because
of the way his mathematical sharpness and sense of selfbelief
contrasted with George’s aura of practical relevance and imminent
greatness, one of my students thought that George was treating him
(Dennis) like a punch bag. Anyway, the debate petered out in April
1981 when Dennis departed in his usual gentlemanly style to continue
his early retirement routine at another American college.
George presented a
variation of his, albeit strangely flawed, 1980 paper [67] on prior
predictive modelchecking at M.R.C.’s international conference on
the shores of Lake Mendota in December 1981. However, he’d
previously declared victory when Dennis suddenly pulled out of the
muchplanned and expensivelyfunded confrontation, for reasons best
known to himself, by abruptly declining to attend the meeting.
The conference was
nevertheless highly successful and George received lots of pats on
the back for his refreshing reinterpretation of Bayesian
Statistics. Indeed, Bob Hogg was so thrilled about it that he
invited me to give a seminar on a similar topic in Iowa City. As I
was to discover afterwards, this was so that he could tease and
irritate one of his ‘coherently Bayesian’ colleagues with the good
news. What a soap opera, but I’m sure that the U.S. Army got value
for money.




Robert V.Hogg 

I still seriously
wonder whether some of the research produced at M.R.C. during the
special year was used by the U.S. Army Research Office to enhance
Reagan’s Star Wars program. With some of the stuff going on behind
closed doors involving the statistics of Pershing missiles, nuclear
missiles hitting silos, and whatever, I wouldn’t put it past them.
George Box’s academic
life story is published in his autobiography The Accidental
Statistician. Whilst never humble or an extremely brilliant
mathematician, his forte lay in his ability to perceive simple
practical solutions in situations where the mathematrizers couldn’t
untangle the wood from the trees. That was his saving grace, and at
that he was supremely magnificent. By 1981, when he was 62, he had
doubtlessly achieved his muchlongedfor everlasting greatness.
George did not however retire until many years afterwards, even
though he was presented with a retirement cheque and a wooden
rocking chair during a faculty reception in 1992.
In 1982, S. James Press impressed the statistical world with the
second edition of his magnificently detailed book Applied
Multivariate Analysis, which referred to both Bayesian
and frequentist methods of inference in meaningful multidimensional
models. Jim was to publish three further Bayesian books with John
Wiley. The second of the trio was coauthored with Judith Tanur.
Jim is an Emeriti Professor of Statistics at
the University of California at Riverside. He was appointed Head of
the Statistics Department there when the longexiled iconic English
statistician Florence David retired from the position in 1977.
Jim founded the
Bayesian Statistical Sciences Section of the American Statistical
Association in 1992, in collaboration with Arnold Zellner and ISBA.
He has always been a powerful figure in Bayesian Statistics, with an
ultracharming personality and a wry sense of humour. He learnt his
trade as a coherent Bayesian while visiting UCL in 19712, but never
lost sight of his practical roots. 



Irving Jack Good 

In their 1982 JASA
paper, Jack Good and James Crook used the concept of the strength
of a significance test, in the context of contingency table
analysis. The strength of a test averages the power with respect to
the prior distribution. Bayesian decision rules can yield optimal
fixedsize strength properties. See Lehmann’s Testing Hypotheses,
p91, and exercise 3.10g on page 162 of [15]. When the null
hypothesis is composite, the average of the probability of a Type 1
error with respect to a prior measure could be considered rather
than the size.
Valen Johnson
pursued Good and Crook’s ideas further in his 2005 paper in JRSSB by
considering Bayes factors based on test statistics whose
distributions under the null hypothesis do not depend on the unknown
parameters. Johnson’s significance testing approach removes much of
the subjectivity that is usually associated with Bayes factors. In
his 2009 article in JASA with Jianhua Hu, the courageous coauthors
extended this methodology to model selection, a wonderful piece of
work which should influence Bayesian model selectors everywhere.
Valen Johnson is Professor of Statistics at Texas A&M University.
His applied Bayesian research interests include educational
assessment, ordinal and rank data, clinical design, image analysis,
and reliability analysis. More to the point, his theoretical ideas
are highly and refreshingly original, and I only wish that I’d
thought of them myself. He is a worthy candidate for the Bayesian
Hall of Fame, should that ever materialise.




Valen Johnson 

Debabrata Basu and
Carlos Pereira published an important Bayesian paper in the
Journal of Planning and Inference in 1982 on the problem
of nonresponse when analysing categorical data.
Basu is famous for his
counterexamples to frequentist inference, and for Basu’s theorem
which states that any bounded complete sufficient statistic is
independent of any ancillary statistic.
S.H. Chew’s 1983 Econometrica paper ‘A generalization of the
quasilinear mean with applications to the measurement of income
inequality and decision theory resolving the Allais paradox’
affected the viability of some aspects of the foundations of
Bayesian decision theory.
Also in 1983, Carl
Morris published an important discussion paper in JASA on
parametric empirical Bayes inference, in which he shrunk the
unbiased estimators for several normal means towards a regression
surface. Similar ideas had been expressed by Adrian Smith in
Biometrika in 1973, using a hierarchical Bayesian approach, but
Carl derived some excellent mean squared error properties for his
estimators.




Carl Morris 

In contrast to the
proposals by Carl Morris and Adrian Smith, Amy RacinePoon developed
a Bayesian approach in her 1985 Biometrics paper to some
nonlinear random effects models using a LindleySmithstyle
hierarchical prior, where she estimated the hyperparameters by
marginal posterior modes using an adaptation of the EM algorithm.
Her method performed rather well in small sample simulations. In a
real example, her method, when applied to only 40% of the
observations, yielded similar results to those obtained from the
whole sample.
In their 1981
exposition in JASA, John Deely and Dennis Lindley had
proposed a ‘Bayes Empirical Bayes’ approach to some of these
problems. They argued that many empirical Bayes procedures are
really nonBayesian asymptotically optimal classical procedures for
mixtures. However, an outsider could be forgiven for concluding that
the difference between ‘Empirical Bayesian’ and ‘applied Bayesian’
is a question of semantics and a few technical differences, and
certainly not something which is worth going to war over. 



John Deely 

John Deely is
Emeritus Professor of Statistics at the University of Canterbury in
Christchurch, and continuing itinerate lecturer in the Department of
Statistics at Purdue. He has graced the applied Bayesian literature
with many fine publications, and is wonderfully frank and honest
about everything.
At Valencia 6 in
1998, John was to provide me with several valuable Bayesian
insights, though he got plastered while I was discussing them with
Adrian Smith. Undeterred, I rose to the occasion during the final
evening song and comedy skits, and played Rev. Thomas Bayes
returning from Heaven in a white sheet for an interview with Tony
O’Hagan, only to get into an unexpected tangle with the microphone. 



Tom playing Rev. Thomas Bayes returning from Heaven
Valencia 6 Cabaret 1998.
The Master of Ceremonies
Tony O'Hagan played ping pong with Tom at UCL 

In similar spirit to the
DeelyLindley Bayes Empirical Bayes approach, Rob Kass and Duane
Steffey’s ‘Bayesian hierarchical’ methodology for nonlinear models,
which they published in JASA in 1989, referred to estimates
for the hyperparameters which were calculated empirically from the
data. Like Deely and Lindley, they used Laplacian approximations as
part of their procedure.
The KassSteffey
conditional Laplacian approach was to create the basis for many of
the routines in the INLA package, some of which refer to marginal
mode estimates for the hyperparameters, which was reported by Rue,
Martino and Chopin in there extremely long paper in JRSSB,
fully twenty years later in 2009.
Back in 1983, Bill
DuMouchel and Jeffrey Harris published an important discussion paper
in JASA about the combining of results of cancer studies in
humans and other species. They employed a twoway interaction model
for their estimated response slopes together with a hierarchical
prior distribution for the unknown row, column, and interaction
effects. There methodology preceded more recent developments in
Bayesian metaanalysis.
In his 1984 paper in the
Annals of Statistics, the statistician and psychologist Don
Rubin proposed three types of Bayesianly justifiable and relevant
frequency calculations which he thought might be useful for the
applied statistician. This followed Don’s introduction in the
Annals of Statistics (1981) of the Bayesian bootstrap,
which can, at least in principle, be used to simulate the
posterior distribution of the unknown parameters, in rather similar
fashion to Galton’s Quincunx.
The Bayesian bootstrap
is similar operationally to the more standard frequencybased
bootstrap. While the more standard bootstrap is presumably not as
old as the Ark, it probably is as old as the first graduate student
who fudged his answers by using it. It is however usually credited
to Bradley Efron, and does produce widerangingly useful results.
Both bootstraps can be regarded as helpful roughandready devices
for scientists who are unwilling or unprepared to develop a detailed
theoretical solution to the statistical problem at hand.
Don Rubin later
coauthored the highly recommendable book Bayesian Data Analysis
with Andrew Gelman, John Carlin and Hal Stern.
Andrew Gelman is
Professor of Statistics and Political Science at Columbia
University. He has coauthored four other books: 

Teaching
Statistics: A Bag of Tricks,
Data Analysis Using Regression and Multilevel/ Hierarchical
Models,
Red State, Blue State, Rich State, Poor State: Why Americans
Vote the Way They Do 


and he coedited A
Quantitative Tour of the Social Sciences with Jeronimo Cortina.
Gelman has received
the Outstanding Statistical Applications Award from the ASA, the
award for best article published in the American Political Science
Review, and the COPPS award for outstanding contributions by a
statistician under the age of 40. He has even researched arsenic in
Bangladesh and radon in your basement. 



Andrew Gelman 

In 1985, John S.J. Hsu, nowadays Professor and Head of Statistics
and Applied Probability at UCSB, made a remarkable computational
discovery while working as a graduate student at the University of
WisconsinMadison. He found that conditional Laplacian
approximations to the marginal posterior densities of an unknown
scalar parameter θ can be virtually identical to the exact result,
with three decimal point accuracy even way out in the tails of the
exact density. He then proceeded to show that this was true in more
general terms.
Suppose that θ and a vector ξ of further unknown parameters possess
posterior density π (θ, ξ). Assume further that the parameters have
already been suitably transformed to ensure that the conditional
posterior distribution of ξ given θ is roughly speaking multivariate
normal for each fixed θ. Then the conditional Laplacian
approximation to the marginal posterior density of θ should be
computed by reference to the following three steps:
1. For each θ, replace ξ in
π (θ, ξ) by the vector which maximises π (θ, ξ) with respect to ξ
for that value of θ.
2. For each θ, divide your
maximised density by the square root of the determinant of the
conditional posterior information matrix R(θ) of ξ given θ.
3. The function of θ thus
derived is proportional to your final approximation. So integrate it
numerically over all possible values of θ, and divide your function
by the integrand, thus ensuring that it integrates to unity.
Similarly approximations
are available to the prior predictive density of a vector of
observations y, or the posterior predictive density of y given the
realizations of a vector z of previous observations. I justified all
of these approximations during 1981 via backwards applications of
Bayes theorem. The results were published in JASA in my 1982
comment on a paper by Lejeune and Faulkenberry about predictive
likelihood. 



John and Serene Hsu, Santa Barbara,
California (Christmas 2012) 

Following my three
month visit to the Lindquist Center of Measurement in Iowa City in
the summer of 1984, the approximations described above to marginal
posterior densities were published in 1986 by myself and Mel Novick
in the Journal of Educational Statistics as part of our
‘Bayesian full rank marginalization’ of twoway contingency tables,
which was used to analyse the ‘Marine Corps’ psychometric testing
data, and numerically validated by Jim Albert in his wonderful 1988
JASA paper. They were also employed by the now famous
geologist and Antarctic explorer Jon Nyquist, then a fellow chess
player at the University of Wisconsin, when he was estimating the
depth of the MidContinental Atlantic Rift. See Nyquist and Wang
[69].
During my visit to Iowa
City in 1984, I helped ShinIchi Makewaya to develop an empirical
Bayesian/ EM algorithm approach to factor analysis for his 1985
Ph.D. thesis ‘Bayesian Factor Analysis’, which was supervised by Mel
Novick. I don’t know whether this important work ever got published.
ShinIchi is currently a highly accomplished professor in the
Graduate School of Decision Science and Technology at the Tokyo
Institute of Technology.
For various
generalisations and numerical investigations of conditional Laplacian approximations, see ‘Bayesian Marginal Inference’ which
was published in JASA in 1989 by John Hsu, KamWah Tsui and
myself, John Hsu’s 1990 University of Wisconsin Ph.D. thesis
Bayesian Inference and Marginalization, sections 5.1B and 5.12A
of [15] and the further references reported therein, which include
an application, with Christian Ritter, of the ‘Laplacian t
approximation’ to the Ratkowsky MG chemical data set.
The conditional Laplacian approximation in my 1982 JASA note, and a Kass,
Tierney and Kadane rearrangement thereof, was adopted as a standard
technique in Astronomy and Astrophysics. After the two
astrophysicists A.J. Cooke and Brian Espey visited the University of
Edinburgh STATLAB in 1996, they published their paper [70] with
Robert Carswell of the Institute of Astronomy, University of
Cambridge, which concerned ionizing flux in high redshifts.
See also Tom Loredo’s
Cornell University short course Bayesian Inference in Astronomy
and Astrophysics, and the paper by Tom Loredo and D.Q. Lamb in
the edited volume Gamma Ray Bursts (Cambridge
University Press, 1992).Brian Espey is currently senior lecturer in
Physics and Astrophysics at Trinity College Dublin.
Meanwhile, Luke Tierney
and Jay Kadane and Tierney, Kass, and Kadane investigated the
asymptotic properties of these and related procedures when the
sample size n is large, though without demonstrating the remarkable
finite n numerical accuracy which John Hsu and I illustrated in our
joint papers. This was partly because their computations were
computerpackage orientated while ours referred to specially devised
routines and partly because we always employed a preliminary
approximately normalising transformation of the nuisance parameters.
In some cases Kass, Tierney and Kadane were able to prove
saddlepoint accuracy of the conditional Laplacian approximations.
However, saddlepoint accuracy does not necessarily imply good
finite n accuracy. In one of the numerical examples reported in our
1989 JASA paper an overambitious saddlepoint approximation
proposed by Tierney, Kass, and Kadane is way off target.
In their widely cited
1993 JASA paper, Norman Breslow and David Clayton used
Laplacian approximations to calculate very precise inferences for
the parameters in generalised mixed linear models.
During the 1990s
conditional Laplacian approximations were, nevertheless, to become
less popular than MCMC, which employed complicated simulations from
conditional posterior distributions when attempting, sometimes in
vain, to achieve the exact Bayesian solution. However, during the
early 21 st. century, the enthusiasm for the timeconsuming MCMC
began to wear off. For example, the INLA package developed by Håvard
Rue of the Norwegian University of Science and Technology in
Trondheim, and his associates, in 2008 (see Martino and Rue [71] for
a manual for the INLA program) implements integrated nested
Laplacian approximations for hierarchical models with some emphasis
on spatial processes, and use posterior predictive probabilities and
crossvalidation to check the adequacy of the proposed models. Rue
et al are, for example, able to compute approximate Bayesian
solutions for the logGaussian doubly stochastic process. The INLA
package offers all sorts of attractive possibilities (see my Ch. 7
for more details).
The celebrated Bayesian
psychometrician Melvin R. Novick, of Lord and Novick fame, died of a
second major heart attack at age 54 while he was visiting ETS in
Princeton, New Jersey during May 1986. His wife Naomi was at his
side. He had experienced a debilitating heart attack in Iowa City in
early 1982 while I was on the way to visit him and Bob Hogg at the
University of Iowa to present a seminar on the role of coherence in
statistical modeling, and I was devastated to hear about his sudden
heart attack from Bob, who was thoroughly distraught, on my arrival.
In 2012, I was
contacted from New York by Charlie Lewis, a Bayesian buddy of Mel’s
at the American College Testing Program in Iowa City way back in
June 1972. At that time, I’d advised Charlie, in somewhat dramatic
circumstances, about Bayesian marginalization in a binomial
exchangeability model, and this enabled him to coauthor the paper
‘Marginal Distributions for the Estimation of Proportions in M
Groups’ with Mel Novick and MingMei Wang in Psychometrika (1975)
which earned him his tenure at the University of Illinois. While the
coauthors briefly cited my A.C.T. technical report on the same
topic, they did not acknowledge my advice or offer me a
coauthorship. I was paid $400 plus expenses for my work at A.C.T.
that summer on a predoctoral fellowship.
Beware the ‘We cited
your technical report’ trick, folk. It happens all too often in
academia, and sometimes leads to misunderstandings and even to the
sidelining of the person with the creative ideas in favour of
authors in need of a meal ticket. It happened to me on another
occasion, when a gigantic practicallyminded statistician desperate
for a publication wandered into my office, grabbed a technical
report from my desk, and walked off with it.
Charlie is
nowadays a presidential representative to ETS regarding the fairness
and validity of educational testing in the US. After an exchange of
pleasantries, forty years on, he came close to agreeing with me that 

‘Bayesians
never die, but their data analyses go on before them’. 



Charlie Lewis receiving a Graduate
Professor of the Year Award at Fordham University 

THE AXIOMS OF SUBJECTIVE
PROBABILITY: In 1986, Peter Fishburn, a mathematician
working at Bell Labs in Murray Hill, New Jersey, reviewed the
various ‘Axioms of Subjective Probability’, in a splendid article in
Statistical Science. These have been regard by many Bayesians
as justifying their inferences, even to the extent that a scientific
worker can be regarded as incoherent if his statistical
methodology is not completely equivalent to a Bayesian procedure
based on a proper (countably additive) prior distribution i.e. if he
does not act like a Bayesian (with a proper prior).
This highly normative
philosophy has given rise to numerous practical and reallife
implications, for instance,
(1) Many empirical and
applied Bayesians have been ridiculed and even ‘cast out’ for being
at apparent variance with the official paradigm.
(2) Bayesians have at times
risked having their papers rejected if their methodology is not
perceived as being sufficiently coherent, even if their theory made
sense in applied terms.
(3) Careers have been put at
risk, e.g. in groups or departments which pride themselves on being
completely Bayesian.
(4) Some Bayesians have
wished to remain controlled by the concept of coherence even in
situations where the choice of sampling model is open to serious
debate.
(5) Many Bayesians still use
Bayes factors e.g. as measures of evidence or for model comparison,
simply because they are ‘coherent’, even in situations where they
defy common sense (unless of course a Bayesian pvalue of the type
recently proposed by Baskurt and Evans is also calculated).
Before 1986, the
literature regarding the Axioms of Subjective Probability was quite
confusing, and some of it was downright incorrect. When put to the
question, some ‘coherent’ Bayesians slid from one axiom system to
another, and others expressed confusion or could only remember the
simple axioms within any particular system while conveniently
forgetting the more complicated ones. However, Peter Fishburn
describes the axiom systems in beautifully erudite terms, thereby
enabling a frank and honest discussion.
Fishburn refers to a
binary relation for any two events A and B which are expressible as
subsets A and B of the parameter space Θ. In other words, you are
required to say whether your parameter θ is more likely (Fishburn
uses the slightly misleading term ‘more probable’) to belong to A
than to B, or whether the reverse is true, or whether you judge the
two events to be equally likely. Most of the axiom systems assume
that you can do this for any pair of events A and B in Θ. This is in
itself a very strong assumption. For example, if Θ is finite and
contains just 30 elements then you will have over a billion subsets
and over billion billion pairwise comparisons to consider.
The question then
arises as to whether there exists a subjective probability
distribution p defined on events in Θ which is in agreement with all
your binary relations. You would certainly require your
probabilities to satisfy p(A) >p(B) whenever you think that θ
is more likely to belong to A than to B. It will turn out that some
further, extremely strong, assumptions are needed to ensure the
existence of such a probability distribution.
Note that when the
parameter space contains a finite number k of elements, a
probability distribution must, by definition, assign nonnegative
probabilities summing to unity to these k different elements. In
1931 De Finetti recommended attempting to justify representing your
subjective information and beliefs by a probability distribution, by
assuming that your binary relations satisfy a reasonably simple set
of ‘coherency’ axioms.
However, in 1959,
Charles Kraft, John Pratt, and the distinguished Berkeley algebraist
Abraham Seidenberg proved that De Finetti’s ‘Axioms of Coherence’
were insufficient to guarantee the existence of a probability
distribution on Θ that was in agreement with your binary relations,
and they instead assumed that the latter should be taken to satisfy
a strong additivity property that is extremely horribly
complicated.
Indeed the strongly
additive property requires you to contrast any m events with any
other m events, and to be able to do this for all m=2,, p. In Ch.
4 of his 1970 book Utility Theory for Decision Making,
Fishburn proved that if your binary relations are nontrivial then
strong additivity is a necessary and sufficient condition for the
existence of an appropriate probability distribution.
The strong additivity
property is valuable in the sense that it provides a mathematical
rigorous axiomatization of subjective probability when the parameter
space is finite, where the axioms do not refer to probability in
themselves. However, in intuitive terms, it would surely be much
simpler to replace it by the more readily comprehendible assumption:


Axiom T: You can
represent your prior information and beliefs about θ by assigning
nonnegative relative weights (to subsequently be regarded as
probabilities), summing to unity, to the k elements or ‘outcomes’ in
Θ, in the knowledge that you can then calculate your relative weight
of any event A by summing the relative weights of the corresponding
outcomes in Θ.
I do not see why we need
to contort our minds in order to justify using Axiom T. You should
be just be able to look at it and decide whether you are prepared to
use it or not. When viewed from an inductive scientific perspective,
the strong additive property just provides us with highly elaborate
window dressing. If you don’t agree with even a tiny part of it,
then it doesn’t put you under any compulsion to concur with Axiom T.
The socalled concept of coherence is pie in the sky! You either
want to be a Bayesian (or act like one) or you don’t.
When the parameter
space Θ is infinite, probabilities can, for measuretheoretic
reasons only be defined on subsets of θ, known as events, which
belong to a σalgebra e.g. an infinite Boolean algebra of subsets of
θ. Any probability distribution which assigns probabilities to
members of this σalgebra must, by definition, satisfy Kolmogorov’s
‘countable additivity property’ in other words your probabilities
should add up sensibly when you are evaluating the probability of
the union of any finite or infinite sequence of disjoint events.
Most axiom systems require you to state the binary relationship
between any two of the multitudinous events in the σalgebra, after
considering which of them is more likely.
Bruno De Finetti and
Jimmie Savage proposed various quite complicated axiom systems while
attempting to guarantee the existence of a probability distribution
that would be in agreement with your infinity of infinities of
binary relations. However, in his 1964 paper on quality σalgebras,
in the Annals of Mathematical Statistics, the
brilliant Uruguayan mathematician Cesareo Villegas proved that a
rather strong property is needed, in addition to De Finetti’s four
basic axioms, in order to ensure countable additivity of your
probability distribution, and he called this the monotone
continuity property.
Trivialities can be more
essential than generalities, as my friend Thomas the TankEngine
once said.
Unfortunately,
monotone continuity would appear to be more complicated in
mathematical terms than Kolmogorov’s countable additivity axiom, and
it would seem somewhat easier in scientific terms to simply decide
whether you want to represent your subjective information or beliefs
by a countablyadditive probability distribution, or not. If
monotone continuity is supposed to be a part of coherence, then this
is the sort of coherence that is likely to glue your brain cells
together. Any blame associated with this phenomenon falls on the
ways the mathematics was later misinterpreted, and we are all very
indebted to Cesareo Villegas for his wonderful conclusions.
Your backwards
inductive thought processes might well suggest that most other sets
of Axioms of Subjective Probability which are pulled out from under
a stone would by necessity include assumptions which are similar in
strength to strong additivity and monotone continuity, since they
would otherwise not be able imply the existence of a countably
additive probability distribution on a continuous parameter space
which is in agreement with your subjectively derived binary
relations. See for example, my discussion in [8] of the five axioms
described by De Groot on pages 7176 of Optimal Statistical
Decisions. Other axiom systems are brought into play by the
discussants of Peter Fishburn’s remarkable paper.
In summary, the Axioms
of Subjective Probability are NOT essential ingredients of the
Bayesian paradigm. They’re a torrent, rather than a mere sprinkling,
of proverbial Holy Water. If we put them safely to bed, then we can
broaden our wonderfully successful paradigm in order to give it even
more credibility in Science, Medicine, SocioEconomics, and wherever
people can benefit from it.
Meanwhile, some of the
more diehard Bayesian ‘High Priests’ have been living in ‘airy
fairy’ land. They have been using the axioms in their apparent
attempts to control our discipline from a narrowminded
powerbase, and we should all now make a determined effort to break
free from these Romanesque constraints in our search for scientific
truth and reasonably evidencebased medical and social conclusions
which will benefit Society and ordinary people.
Please see Author’s
Notes (below) for an alternative view on these key issues, which
has been kindly contributed by Angelika van der Linde, who has
recently retired from the University of Bremen.
Cesareo Villegas
(19212001) worked for much of his career at the Institute of
Mathematics in Montevideo before moving to Simon Fraser University
where he become Emeritus Professor of Statistics after his
retirement. He published eight theoretical papers in the IMS’s
Annals journals, and three in JASA. He is wellknown for his
development of priors satisfying certain invariance properties, was
actively involved with Jose Bernardo in creating the Bayesian
Valencia conferences, and seems to have been an unsung hero.
Bruno de Finetti's
contributions to the Axioms of Subjective Probability were far
surpassed by his (de Finetti’s) development of the key concept
‘exchangeability’, both in terms of mathematical rigor and
credibility of subsequent interpretation, and it is for the latter
that our Italian maestro should be longest remembered.




Bruno de Finetti 

Also in 1986, Michael
Goldstein published an important paper on ‘Exchangeable Prior
Structures’ in JASA, where he argued that expectation (or
prevision) should be the fundamental quantification of individuals’
statements of uncertainty and that inner products (or belief
structures) should be the fundamental organizing structure for the
collection of such statements. That’s an interesting possibility.
I’m sure that Michael enjoys chewing the rag with Phil Dawid.
Maybe Michael should be
known as ’Mr. Linear Bayes U.K.’. He and his coauthors, including
his erstwhile Ph.D. supervisor Adrian Smith, have followed in the
footsteps of Kalman and Bucy by publishing many wonderful papers on
linear Bayes estimation. See Michael’s overview in the
Encyclopedia of Statistical Sciences, and his book Bayes
Linear Statistics with David Woolf. 



Michael Goldstein 

The 1987 monograph
Differential Geometry in Statistical Inference by Amari,
BarndorffNielsen, Kass, Lauritzen and Rao is of fundamental
importance in Mathematical Statistics. In his 1989 article ‘The
Geometry of Asymptotic Inference’ in Statistical Science, Rob
Kass provides readers with a deep understanding of the ideas of Sir
Ronald Fisher and Sir Harold Jeffreys as they relate to Fisher’s
expected information matrix.
Rob Kass is one of the
most frequently cited mathematicians in academia, and he is one of
Jay Kadane’s proudly Bayesian colleagues at CarnegieMellon
University.
Kathryn Chaloner, more
recently Professor of Statistics, Biostatistics, and Actuarial
Science at the University of Iowa, also proposed some important
contributions during the 1980s. These include her optimal Bayesian
experimental designs for the linear model and nonlinear models with
Kinley Larntz, and her Bayesian estimation of the variance
components in the unbalanced oneway model. More recently, she and
ChaoYin Chen applied their Bayesian stopping rule for a single arm
study to a case study of stem plant transplantation. Then she and
several coauthors developed a Bayesian analysis for doubly censored
data that used a hierarchical ‘Cox’ model, and this was published in
Statistics in Medicine.




Kathryn Chaloner 

Kathryn was elected
Fellow of the American Association for the Advancement of Science in
2003. After graduating from Somerville College Oxford, she’d
obtained her Master’s degree from University College London in 1976
and her Ph.D. from CarnegieMellon in 1982.
John Van Ryzin, a
pioneer of the empirical Bayes approach died heroically from AIDS in
March 1987 at the age of 51, after continuing to work on his
academic endeavours until a few days before his death. He was the
coauthor, with Herbert Robbins, of ‘An Introduction to Statistics’,
an authority on medical statistics, survival analysis, and the
adverse effects of radiation and toxicity, and a member of the
science review panel of the Environment Protection Agency. A fine
family man, he was also a most impressive seminar speaker.
In 1987, Gregory Reinsel
and George Tiao employed random effects regression/time series
models to investigate trends and possible holes in the stratospheric
ozone layer at a time when it was being seriously affected by
polluting chemicals. Their approach, which they reported in an
outstanding paper in JASA, is effectively hierarchical
Bayesian.








Greg
Reinsel(19482005) 

George Tiao 


In 1988, Jim Albert,
previously one of Jim Berger’s Ph.D. students at Purdue, reported
some computational methods in JASA, based upon Laplacian
approximations to posterior moments, for generalised linear models
where the parameters are taken to possess a hierarchical prior
distribution. He, for example, analysed a binomial/beta exchangeable
logit model, which is a special case of the LeonardNovick
formulation in our 1986 paper in the Journal of Educational
Statistics and provided an alternative to my exchangeable prior
analysis for binomial logits in my 1972 Biometrika paper.
Jim has also published a
series of papers that assume Dirichlet prior distributions, and
mixtures thereof, for the cell probabilities in rxs contingency
tables, thereby very soundly extending the work of I.J. Good.



THE AXIOMS OF UTILITY:
In 1989, Peter Wakker, a leading Dutch Bayesian economist on the
faculty of Erasmus University in Rotterdam published his magnum opus
Additive Representations of Preferences: A New Foundation of
Decision Analysis. Peter for example analysed ‘Choquet expected
utility’ and other models using a general trade off technique for
analysing cardinal utility and based on a continuum of outcomes. In
his later book Prospect Theory for Risk and Ambiguity he
analyses Tversky and Kahneman’s cumulative prospect theory as yet
another valid alternative to Savage’s expected utility. In 2013,
Peter was awarded the prestigious Frank P. Ramsey medal by the
INFORMS Decision Analysis Society for his high quality endeavours.
Peter has recently
advised me (personal communication) that modifications to Savage’s
expected utility which put positive premiums on the positive
components of a random monetary reward which are regarded as certain
to occur, and negative premiums on those negative components which
are thought to be certain to occur, are currently regarded as the
state of the art e.g. as it relates to Portfolio Analysis. For an
easy introduction to these ideas see Ch. 4 of my 1999 book [15].
In an important
special case, which my Statistics 775: Bayesian Decision and
Control students, including the Catalonian statistician Josep
Ginebra Molins and the economist Jean Deichmann, validated by a
small empirical study in Madison, Wisconsin during the 1980s, a
positive premium ε can be shown to satisfy ε =2φ1 whenever a
betting probability φ, which should be elicited from the investor,
exceeds 0.5.
The preceding
easytounderstand approach, which contrasts with the ramifications
of Prospect theory, was axiomatized in 2007 by Alain Chateauneuf,
Jürgen Eichberger and Simon Grant in their article ‘Choice under
uncertainty with the best and the worst in mind: New additive
capacities’ which appeared in the Journal of Economic Theory.
While enormously
complex, the new axiom system is nevertheless the most theoretically
convincing alternative that I know of to the highly normative Savage
axioms. Their seven mathematically expressed axioms, which refer to
a preference relation on the space of monetary returns, may be
referred to under the following headings:
Axiom 0: Nontrivial
preferences
Axiom 1: Ordering (i.e. the
preference relation is complete, reflexive and transitive)
Axiom 2: Continuity
Axiom 3: Eventwise
monotonicity
Axiom 4: Binary comonotonic
independence
Axiom 5: Extreme events
sensitivity
Axiom 6: Null event
consistency
If you are able to
understand all these axioms and wish to comply with them, then that
puts you under some sort of obligation to concur with either the
Expected Utility Hypothesis, or the simple modification suggested in
Ch.4 of [15] or obvious extensions of this idea. Alternatively, you
could just replace expected utility by whatever modification or
alternative best suits your practical situation. Note that, if your
alternative criterion is sensibly formulated, then it might, at
least in principle, be possible to devise a similarly complex axiom
system that justifies it, if you really wanted to. In many such
cases, the cart has been put before the proverbial horse. It’s
rather like a highly complex spiritual biblical prophecy being
formulated after the event to be prophesied has actually occurred.
Maybe the Isaiahs of decision theoretic axiomatics should follow in
the footsteps of Maurice Allais and focus a bit more on empirical
validations, and the scientific and socioeconomic implications of
the methodology which they are striving to selfjustify by these
somewhat overleftbrained, potentially Nobel prizewinning
theoretical discourses.
Their 1989 paper in the
Journal of Econometrics, the Dutch econometrician Mark Steel
proposed, with JeanFrançois Richard, a Bayesian analysis of
seemingly unrelated regression equations that used a recursive
extended conjugate prior density. Mark Steel published several
papers on related topics during the next few years e.g. on robust
Bayesian and marginal Bayesian inferences in skewed sampling
distributions and elliptical regression models. Mark’s research
ideas are extremely imaginative and he has continued to publish them
until this very day. 



Mark Steel 

Mark Steel is currently
Professor of Statistics at the University of Warwick and also holds
the Chair of Excellence at the Carlos the Third University of
Madrid. His interests have now moved away from Econometrics, and
into mainstream theoretical and applied Bayesian Statistics. Mark
was awarded his Ph.D. by the Catholic University of Louvain in 1987.
His Ph.D. topic was A Bayesian Analysis of Exogeniety: A
Monte Carlo Approach.
Susie M. J. Bayarri and
Morris De Groot coauthored thirteen joint papers between 1987 and
1993 including ‘A Bayesian view of selection models’, ‘Gaining
weight: A Bayesian Approach’, and ‘What Bayesians expect of each
other’. This was doubtlessly one of the most prolific Bayesian
coauthorships of all time, and it helped Susie to become one of the
leading Bayesians in Spain and one of the most successful woman
Bayesians in history.




Bayesettes: Alice Carriquiry, Susie
Bayarri and Jennifer Hill
(Archived from Brad Carlin's Collection)


The leading Brazilian
statistician Carlos Pereira published his Associate Professor thesis
in 1985, on the topic of the Bayesian and classical interpretations
of multidimensional hypothesis testing. He has published over 200
papers in scientific journals, many of them Bayesian, with important
applications in a number of disciplines, including medicine and
genetics. Carlos has also supervised 21 Ph.D. theses and 23 Masters
dissertations, and encouraged numerous other Bayesian researchers.
His book The Elements of Bayesian Inference was coauthored
with Marlos Viana and published in Portuguese in 1982. Other notable
Brazilian Bayesians include Carlos’s brother Basilio Pereira,
Gustavo Gilardoni, Heleno Bolfarine, Helio Migon, Dani Gamerman,
Jorge Achcar, Alexandra Schmidt, Hedibert Lopes
and Vinicius Mayrink.
The paper ‘Aspects of reparametrization in approximate Bayesian Inference’ by Jorge Achcar
and Adrian Smith in Bayesian and Likelihood Methods in Statistic
and Econometrics (1990) is particularly impressive.








Carlos
Pereira 

Vinicius
Mayrink 


Carlos Pereira’s
contributions during and beyond the 1980s epitomise the success of
the Bayesian approach in evolving into a paradigm that is socially
and practically relevant in the way it addresses data sets in the
context of their social, medical, and scientific backgrounds,
sometimes at a grass roots level, and draws logical conclusions that
would be unavailable from within the Fisherian and likelihood
paradigms. 

