A
PERSONAL HISTORY OF BAYESIAN STATISTICS 

Thomas Hoskyns
Leonard 

Retired
Professor of Statistics, Universities of WisconsinMadison and
Edinburgh 

4: THE PRAGMATIC SEVENTIES
(CONTINUED) 

Let us be evidencebased,
though not spuriously evidencebased 







Tom Leonard (University of Warwick, 1976) 

Tom's father, Captain Cecil Leonard (19132001) 


The first part of this
chapter appears in the John Wiley online library 

Until 1972, much of the
Bayesian literature was devoted to procedures which either emulated
the frequency approach or incorporated prior information via
conjugate prior distributions where the prior information could have
instead been represented by ‘prior likelihoods’ which required the
specification of ‘prior observations and sample sizes’. See Antony Edwards’ 1972 book
Likelihood. Consequently, the practical advantages of the
Bayesian paradigm had not so far been fully exploited.
Dennis Lindley’s and
Adrian Smith’s widely cited landmark paper ‘Bayes Estimates for the
Linear Model’ [57] seemed to change all that in 1972. In the context
of Mgroup regression, they assume betweengroup De Finettistyle
prior exchangeability of the M vectors of regression
parameters, in the homogeneous variance situation where the sample
variances of all the observations in the groups are assumed to be
equal. Their shrinkage estimators for the vectors of regression
coefficients are therefore special cases of the procedures
described in a previously unpublished technical report by Lindley
which was based upon work which he completed at the American Testing
Program (A.C.T) in Iowa City. In this A.C.T. report, he addressed
the heterogeneous variance situation by assuming that the M unequal
variances corresponding to the different regressions were also a
priori exchangeable.
The empirical validation
so famously reported in the LindleySmith paper was based upon
computations completed by Melvin Novick, Paul Jackson, Dorothy
Thayer, and Nancy Cole at A.C.T. and reported in the British
Journal of Mathematical and Statistical Psychology (1972).
However, their validation related to the more general methodology
developed in Lindley’s A.C.T. technical report rather to the special
homogeneous variance case described in the LindleySmith paper.
According to my longterm memory, and I suppose that my brain cells
could have blown a few gaskets over the years, Paul Jackson, a
lecturer at the University College of Wales in Aberystwyth, advised
me at A.C.T. in June 1971, with a smile and a wink, that he’d
handled a serious overshrinkage problem relating to Lindley’s
estimates (see below) by setting ‘an arbitrarily small’ degrees of
freedom hyperparameter to be equal to a smallenoughlooking value
which was however large enough to ensure that the ‘collapsing
phenomenon’ became much less evident.
As I remember, Paul
Jackson’s choice of hyperparameter ensured that the posterior
estimates of the M sampling variances were substantially different,
and this in turn ensured that the quasiBayesian estimates for the M
regression vectors were also comfortingly different. It was a new
experience for me, if I’ve remembered it correctly, to see how the
more ingenious of our applied statisticians can handle awkward
situations with a flick of their fingers.
Lindley and Smith described their hierarchical prior
distribution in the following two stages:
Stage 1: The vectors
of regression parameters constitute a random sample from a
multivariate normal distribution with unspecified mean vector μ and
covariance matrix C. The common sampling variance V possesses a
rescaled inverted chisquared distribution with specified degrees of
freedom.
Stage 2: The first
stage mean vector μ and covariance matrix C are independent, μ is
uniformly distributed over pdimensional Euclidean space. The
covariance matrix C possesses an inverted Wishart distribution with
specified degrees of freedom and pxp mean matrix.
The posterior mean
vectors of the regression vectors, conditional on V and C, can then
be expressed as matrix weighted averages of the individual least
squares vectors, and a combined vector which can be expressed as a
matrix weighted average of these vectors. Lindley and Smith then
make the conceptual error of estimating the V and C via the joint
posterior modes of V, C, μ, and the regression parameters. These are
not, of course, either Bayes or generalised Bayes estimates.
Lindley and Smith did
not apparently notice a ‘collapsing phenomenon’ that causes their
estimates of the regression coefficients to radically overadjust
the least squares estimates. Marginal modal estimates of V and C may
instead be calculated, and these will be asymptotical consistent for
the ‘true’ V and C as M gets large and under the twostage
distributional assumptions described above.
A straightforward
application of the EM algorithm, as described by Art Dempster, Don
Rubin, and Nan Laird in JRSSB in 1977, would yield equations
for these marginal estimates which include important extra terms,
when compared with the LindleySmith equations, which do not vanish
as M gets large. See the equations for the marginal maximum
likelihood estimates of V and C described in Exercise 6.3 h on pages
271272 of [15].
In his 1976 paper in
Biometrika Tony O’Hagan showed that joint and marginal posterior
modes can be quite different in numerical terms. Moreover, in their
reply to my trivially intended contribution to the discussion of
their paper, Bradley Efron and Carl Morris [27] showed that the
LindleySmith estimators of the location parameters have quite
inferior mean squared properties, in a special case, when compared
with their empirical Bayes estimators.
In 1980, Bill Fortney
and Bob Miller [58] sorted out Mgroup regression with a view to
applications in business, and showed in a special case that
O’Hagan’s procedures yield much better mean squared error properties
for the estimators of the regression coefficients than the
LindleySmith joint modal, estimators. In their 1996 JASA
paper, Li Sun et al were to show in a special case, that as M gets
large the LindleySmith estimators have far worse properties than
maximum likelihood over arbitrarily large regions of the parameter
space and for any choices of the prior parameter values
Adrian Smith waited
until 1973 to publish a numerical example of his Mgroup regression
methodology, again in JRSSB.
I recall bumping into
Adrian in the bar in, as I remember, the downtown Hilton hotel,
during the ASA meetings in San Francisco in 1987. I was sober at the
time since I had just completed my ASA short course lectures on
categorical data analysis for the day. During a convivial
conversation with Adrian we got to talking about the computation
difficulties associated with the 1972 LindleySmith paper and which
hadn’t been resolved during the following fifteen years, to my
satisfaction at least. I recall Adrian smiling into his drink and,
and I’d put my hand on a Bible on this one, advising me to the
effect that ‘he’d always stopped after the first stage in the
iterations (for the posterior modes), since that provided the users
with what they really wanted’.
The consequences of
Adrian’s youthful enthusiasm appear to be evident in Table 2 of the
LindleySmith paper, where the authors compare their estimates for a
single regression vector, under a special prior formulation, with
the HoerlKennard ‘ridge regression’ estimates. Most of the,
presumably incorrectly calculated, LindleySmith estimates of the
regression coefficients shrink the least squares estimates much less
towards zero than do the ridge regression estimates.
In 1977, Norman Draper
and Craig Van Nostrand [59] reported detailed results which suggest
that the ridge regression estimates, which refer to a graph of the
ridge trace, themselves shrink too much when compared with
Steinstyle estimates. If Lindley and Smith had pursued their
iterations to the limit, then their correctly calculated numerical
results could well have been much closer than they claim to the
lessthandesirable HoerlKennard estimates.
In the orthonormal case, and as the number
of unknown regression parameters gets large, the methodology
employed by Sun et al (JASA,1996) may be easily generalised,
in this ridge regression situation, to show that the correctly
calculated LindleySmith joint posterior modal estimates have far
worse mean squared error properties than least squares over
arbitrarily large regions of the parameter space and for any
choices of the prior parameter values.
Draper and Nostrand
moreover take ridge regression to task for other compelling
reasons. Some of their criticism can also be directed at the
LindleySmith estimates in this single regression situation.
[Legal Note: Adrian reported his Ph.D.
research in the LindleySmith 1972 paper and his 1973 papers in JRSSB and Biometrika, and these continue to influence our
literature and events in my personal history of Bayesian
Statistics. I have not read his 1971 University of London Ph.D.
thesis in recent years, and I hope that he included enough caveats
in his presentation to ensure authenticity of his thesis. As of
14th January 2014, I am still trying to resolve all these issues
with Adrian, and I will amend this chapter as appropriate].
In Table 1 of their
paper, Lindley and Smith report the results of the Novick et al
‘empirical validation’ of Lindley’s previous quasiBayesian
estimates in the Mgroup regression situation. The attempted
validation, which was completed at A.C.T., referred to the
prediction from previous data of student’s grade point averages
collected in in 1969 at 22 different American colleges and the
authors measured ‘predictive efficiency’ by average mean squared
error. Lindley and Smith claimed that their estimates yielded a
predictive efficiency of 0.5502, when compared with 5.596 for least
squares. However, when a 25% subsample of the data are considered,
their estimates gave a purported 0. 5603 predictive efficiency, when
compared with 0. 6208 for least squares.
When proposing the vote
of thanks after the presentation of Lindley and Smith’s paper to the
Royal Society, the confirmed frequentist Robin Plackett concluded
that their estimates ‘doubled the amount of information in the
data’. This conclusion appears, in retrospect, to be quite
misguided, for example because the empirical validation at A.C.T.
was (according to my own perception) affected by Paul Jackson’s
judicious ‘juggling’ of a prior parameter value during the
application of an otherwise untested computer package, and because
it was unreplicated.
The empirical validation
may indeed have simply ‘struck lucky’ (This interpretation seems
extremely plausible; just consider the unconvincing nature of the
theory underpinning the forecasts). Moreover, Novick et al made
their forecasts quite retrospectively, about two years after the
grade point averages they were predicting had actually been observed
and reported. This gave the authors considerable scope for selective
reporting, and it is not obvious why they restricted their attention
to only 22 colleges.
A celebrated Bayesian
forecaster, who learnt the tricks of his trade at Imperial Chemical
Industries, later advised me that ‘The only good forecasts are those
made after the event’, and I am not totally convinced that he was
joking! He also said ‘I like exchangeability. It means that you can
exchange your prior when you’ve seen your posterior.’ Unfortunately
that’s exactly what Paul Jackson would appear to have done when he
changed the value of a degrees of freedom hyperparameter. In so
doing, he would appear, to me at least, to have altered the course
of the history of Bayesian Statistics.
Our senior, more
experienced statisticians hold a privileged position in Society;
maybe they should try to reduce the professional pressures on
graduate students and research workers to produce ‘the right result’
since this can even create situations where a wrong result is
projected as the truth. This problem was examined as recently as
the 2nd. September 2013 at the Royal Society of Edinburgh when
possible inaccuracies in the official Scottish health projections
were discussed with members of the UK Statistics Authority.
I am surprised that
Dennis and Adrian never formally retracted the empirical conclusions
in their paper but instead waited for other authors to refute them
after further painstaking research. Thank goodness that Fortney and
Miller [58] returned some sanity to Mgroup regression, in their,
albeit largely unpublicised, paper. I would never have known about
their work if Bob Miller hadn’t mentioned it to me during the late
1980s.
It is possible to derive
an approximate fully Bayesian solution to the Mgroup regression
problem by reference to a matrix logarithmic transformation of the
first stage prior covariance matrix and conditional Laplacian
approximations. Maybe somebody will get around to that sometime. 


Robert B. Miller was a
Professor of Statistics and Business at the University of
WisconsinMadison. He is currently Associate Dean of Undergraduate
Studies in the Business School at Wisconsin, where he was a
colleague of the late Dean James C. Hickman of Bayesian actuarial
fame. The brilliant actuary and statistician Jed Frees has followed
in Hickman and Miller’s footsteps, at times from a Bayesian
perspective. 


Various procedures for
eliciting prior probabilities or entire prior distributions e.g.
from an expert or group of experts, were investigated during the
early 1970s.
In pp7176 of his 1970 book Optimal Statistical Decisions,
Morris De Groot recommended calibrating your subjective probability
by reference to an objective auxiliary experiment which generates a
random number in the unit interval, e.g. by spinning a pointer, and
he used this calibration procedure as part of an axiomatic
development of subjective probability. His innocuouslooking fifth
axiom is very strong in the way it links the assessor’s preference
ordering on elements of the parameter space to the outcome of the
auxiliary experiment in a manner consistent with the first four
axioms. See [16].
Parts of the extensive literature on the elicitation of prior
distributions were reviewed in JASA in 2005 by Paul
Garthwaite, Jay Kadane and Tony O’Hagan.
Paul Garthwaite is Professor of Statistics at the Open
University, where his research interests also include animal and
food nutrition, medicine, and psychology.
Jay Kadane is the Leonard J. Savage Professor of Statistics at
CarnegieMellon University, and one of the world’s leading
traditional Bayesians. Here is a brief synopsis of some of the key
elements of the JASA review:
In 1972, Arnold Zellner proposed a procedure for assessing
informative prior distribution for sets of regression coefficients.
In 1973, James Hampton, Peter Moore and Howard Thomas reviewed the
previous literature regarding the assessment of subjective
probabilities. In 1973, Daniel Kahneman and Amos Tversky
investigated various psychological issues associated with these
procedures, and a 1974 experiment by the same authors demonstrated a
phenomenon called ‘anchoring’, where an initial assessment, called
an anchor by a subjective probability is adjusted by a quantity
which is too small when the person in question reassesses his
probability. In his 1975 review, Robin Hogarth maintains that
‘assessment techniques should be designed both to be compatible with
man’s abilities and to counteract his deficiencies’.
In a seminal paper published in JASA in 1975, Morris De Groot
used the properties of recurrence and transience in Markov chains to
determine conditions under which a group of experts can reach a
consensus regarding the probabilities in a discrete subjective
distributions. In doing so, he extended the Delphi Method, a
management technique for reaching a consensus among a group of
employees, which was reviewed and critiqued by Juri Pill in 1971.
De Groot’s ideas on
combining and reconciling expert’s subjective distributions were
pursued in 1979 by Dennis Lindley, Amos Tversky, and Rex V. Brown in
their rather overelaborate discussion paper ‘On the Reconciliation
of Probability Assessments’ which they read to the Royal Statistical
Society, and by the brilliant Brazilian Bayesian Gustavo Gilardoni
and many others. However, beware the philosophy Formalisations
impede the natural man, as Jim Dickey, or maybe it was me, once
said.
The question arises as
to whether a subjective distribution for a parameter assuming values
in a continuous parameter space can be meaningfully elicited from a
subjectmatter expert in a consulting environment. It would appear
to be quite difficult to complete this task in its entirety in the
short time that is typically available, unless the distribution was
assumed to belong to a special class of distributions. Without this
type of assumption, it might be possible to elicit, say, the
expert’s subjective mean and standard deviation, and two or three
interval probabilities, in which case you could use the distribution
which maximises Shannon’s entropy measure given the specified
information. It would appear to be very difficult to completely
assess a joint subjective probability distribution for several
parameters, though it might be worth assuming that some
transformation of the parameters is multivariate normally
distributed.
A deeper philosophical
question is whether or not a Bayesian statistician should try to
elicit subjective probabilities on a regular basis in a consulting
environment. In my opinion, it would at times be more beneficial for
the statistician to just analyse the observed statistical data,
while interacting with his client regarding the subjectmatter
background of the data.
In 1972 Mervyn Stone and Phil Dawid reported some ingenious
marginalization ‘paradoxes’ in Biometrika which seemed to
suggest that improper prior distributions do not always lead to
sensible marginal posterior distributions even if the (joint)
posterior distribution remains proper. In 1973, they and Jim Zidek
presented a fuller version [60] of these ideas to a meeting of the
Royal Statistical Society, and their presentation was extremely
amusing and enjoyable with each of the three coauthors taking turns
to speak.
The purported paradoxes
involve discussions between two Bayesians who use different
procedures for calculating a posterior in the presence of an
improper prior. I was too scared to participate in the debate that
took place after the presentation of the paper, because I felt that
the ‘paradoxes’ only occurred in pathological cases and that they
were at best implying that ‘You should regard your improper prior as
the limit of a sequence of proper priors, while ensuring that your
limiting marginal posterior distributions are intuitively sensible
when contrasted with the marginal posterior distributions
corresponding to the proper priors in the sequence’.
Phil Dawid asked me
afterwards why I hadn’t contributed to the discussion. I grunted,
and he said, ‘say no more.’
The Dawid, Stone, Zidek
‘paradoxes’ have subsequently been heavily criticized by E.T. Jaynes
[61], and by the Bayesian quantum physicist and probabilist Timothy
Wallstrom [62] of the Los Alamos National Laboratory in New Mexico,
who believes that, ‘the paradox does not imply inconsistency for
improper priors since the argument used to justify the procedure of
one of the Bayesians (in the twoway discussion) is inappropriate’,
and backs up this opinion with mathematical arguments.
Wallstrom has also
made seminal contributions to Bayesian quantum statistical inference
and density estimation, as discussed later in the current chapter,
and to the theory of stochastic integral equations.




Timothy Wallstrom 

Other very mathematical
objections to the marginalization ‘paradoxes’ are raised by the
highly respected mathematical statisticians Joseph Chang and David
Pollard of Yale University in their 1997 paper ‘Conditioning as
disintegration’ in Statistica Neerlandica. However, according
to a more recent unpublished UCL technical report, Stone, Dawid and
Zidek still believe that their paradoxes are wellfounded.
Despite these potential caveats, Dennis Lindley took the
opportunity during the discussion of the 1973 invited paper to
public renounce the use of improper priors in Bayesian Statistics.
By making this highly influential recantation, Dennis effectively
negated (1) Much of the material contained in his very popular pink
volume, (2) Several of the procedures introduced in his own landmark
papers (3) Large sections of the timehonoured literature on inverse
probability including many contributions by Augustus De Morgan and
Sir Harold Jeffreys, (4) Many of Jeffreys’ celebrated invariance
priors, and (5) A large amount of the work previously published by
other highly accomplished Bayesians in important areas of
application, including numerous members of the American School.
Sometime before the
meeting, Mervyn Stone said that ‘as editor of JRSSB he would
give a hard time to authors who submitted work to him that employed
improper priors’. And so he did! 


STOP PRESS! On 22nd January 2014, and after a
helpful correspondence with my friend and former student Michael
Hamada, who works at the Los Alamos National Laboratory in New
Mexico, I received the following email from Dr. Timothy Wallstrom: 

Dear Professor Leonard,
I was flattered to learn from Mike Hamada that
I am appearing in your Personal History of Bayesian Statistics,
both for the work on QSI with Richard Silver, and for my 2004
arXiv publication on the Marginalization Paradox.
I have since come to realize, however, that my 2004 paper on
the Marginalization Paradox, while technically correct, misses the
point. Dawid, Stone, and Zidek are absolutely correct. I realized
this as I was preparing a poster for the 2006 Valencia conference,
which I then hastily rewrote! Dawid and Zidek were at the
conference, and I discussed these ideas at length with them there,
and later with Stone via email. My new analysis was included in
the Proceedings; I've attached a copy. I also presented an
analysis at the 2007 MaxEnt conference in Sarasota, NY. That
writeup is also in the corresponding proceedings, and is posted
to the arXiv, at http://arxiv.org/abs/0708.1350 .
Thank you very much for noting my work, and I apologize for
the confusion. I need to revise my 2004 arXiv post; I'm encouraged
to do so now that I realize someone is reading it!
Best regards,
Tim Wallstrom
P. S. A picture of me presenting the work at the Maxent conference
(which is much better than the picture you have) is at




I would like to thank Timothy Wallstrom for
advising me that he had already apparently resolved the
longstanding Marginalisation Paradox controversies, It is good to
hear that my colleagues at UCL were, after all, apparently
correct. Quite fortuitously, Dr. Wallstrom's advice does not
affect the conclusions and interpretations which I make in this
Personal History regarding the viability and immense advantages of
inverse probability and improper priors in Bayesian Statistics.


In 1971, Vidyadhar
Godambe and Mary Thompson read a paper to the Royal Statistical
Society entitled ‘Bayes, fiducial, and frequency aspects of
statistical inference in survey sampling’ which was a bit
controversial in general terms,
When seconding the
vote of thanks following the authors’ presentation, Mervyn Stone
described their paper as ‘like a plum pudding, soggy at the edges
and burnt in the middle’.
Mervyn’s reasons for
saying this would again appear to have been quite esoteric; he
thought that Godambe’s sampling models were focussing on a model
space of measure zero. Vidyadhar’s sampling assumptions seemed quite
reasonable to me at the time, though his conclusions seemed a bit
strange. Most models are approximations, anyway. I of course totally
agree with Mervyn that the fiducial approach doesn’t hold wash.
If you use improper vague
priors in situations where there is no relevant prior information,
then your prior to posterior analysis will not always be justifiable
by ‘probability calculus’. Nevertheless, if you proceed very
carefully, and look out for any contradictions in your posterior
distributions, then your conclusions will usually be quite
meaningful in a pragmatic sense and in the way they reflect the
unsmoothed information in the data. Therefore most of the
improper prior approaches renounced and denounced by Lindley and
Stone are still viable in a practical sense. They are indeed key and
absolutely necessary cornerstones of the ongoing Bayesian
literature, which can be far superior in nonlinear situations, for
small to moderate sample sizes, to whatever else the frequentists
and fiducialists serve up.
The somewhat
inquisitional Bayesian ‘High Church’ activities in London around
that time perhaps even hinted of a slight touch of megalomania.
In contrast, John
Aitchison was using statistical methods with strong Bayesian
elements to good practical effect in Glasgow. During the early 1970s
he focussed on the area of Clinical Medicine e.g. by devising
diagnostic techniques, by an application of Bayes Theorem, which
reduced the necessity for exploratory surgery. For example, in his
1977 paper in Applied Statistics with Dik Habbema and
Jim Kay the authors considered different diagnoses of Conn syndrome
while critically comparing two methods for model discrimination. See
also p232 of Aitchison’s 1975 book Statistical Prediction
Analysis with Ian Dunsmore. Aitchison’s medical work during the
1970s and beyond is reported in his book Statistical Concepts and
Applications in Clinical Medicine with Jim Kay and Ian Lauder.




John Aitchison F.R.S.E. 

Aitchison was motivated
by Wilfred Card of the Department of Medicine at the University of
Glasgow. In 1970, Jack Good and Wilfred Card coauthored ‘The
Diagnostic Process with Special Reference to Errors’ in Methods
of Information in Medicine.
John Aitchison is a
longtime Fellow of the Royal Society of Edinburgh, and a man of
science. He was awarded the Guy Medal in Silver by the Royal
Statistical Society of London in 1988, and probably deserved to
receive the second ever Bayesian Guy Medal in Gold. The first had
been awarded to Sir Harold Jeffreys in 1962.
In 1974, Peter Emerson
applied Bayes theorem and decision theory to the prevention of
thromboembolism in a paper in the Annals of Statistics and
coauthored ‘The Application of Decision Theory to the Prevention of
Deep Vein Thrombosis following Myocardial Infarction’ in the
Quarterly Journal of Medicine, with Derek Teather and
Antony Handley.
Furthermore Robin KnillJones,
of the Department of Community Medicine at the University of
Glasgow, reported his computations via Bayes theorem of diagnostic
probabilities for jaundice in the Journal of the Royal College of
Physicians London. In his 1975 article in Biometrika,
John Anderson derived a logistic discrimination function via Bayes
theorem and used it to classify individuals according to whether or
not they were psychotic, based upon the outcomes of a psychological
test.
In their 1973 JASA paper, Stephen Fienberg
and Paul Holland proposed Empirical Bayes alternatives to Jack
Good’s 1965 hierarchical Bayes estimators for p multinomial cell
probabilities. They devised a ratiounbiased databased estimate for
Good’s flattening constant α, and proved that the corresponding
shrinkage estimators for the cell probabilities possess outstanding
mean squared error properties for large enough p. Fienberg and
Holland show that, while the sample proportions are admissible with
respect to squared error loss, they possess quite inferior risk
properties on the interior of parameter space when p is large, a
seminal result.
Stephen Fienberg is Maurice Falk University Professor of Statistics
and Social Sciences in the Department of Statistics, The Machine
Learning Department, Cylab and iLab at CarnegieMellon University.




Steve Fienberg 

I first met Steve at
a conference about categorical data analysis at the University of
Newcastle in 1977. He is one of the calmest, intensely prolific
scientists I have ever met.
Stephen was
President of the IMS in 199899 of ISBA in 199697 and the IMS in
199899. A Canadian hockey player by birth, he is an elected Fellow
of the National Academy of Science and four further national
academies. During the time that he served as VicePresident of York
University in Toronto, he was affectionately known as the ‘American
President’.
Paul Holland
holds the Frederick M. Lord Chair in Measurement and Statistics at
the Educational Testing Service in Princeton, following an eminent
career at Berkeley and Harvard.
Meanwhile, Thomas Ferguson
generalised the conjugate distribution for multinomial cell
probabilities in his lead article in the Annals of Statistics
by proposing a conjugate Dirichlet process to represent the
statistician’s prior beliefs about a sampling distribution which is
not constrained to belong to a particular parametric family. His
posterior mean value function for the sampling c.d.f., given a
random sample of n observations, updates the prior mean value
function Λ in the light of the information in the data. It may be
expressed in the succinct weighted average form GF+(1G)Λ, with F
denoting the empirical c.d.f and G=n/(n+α), Λ is the prior mean
value function of F, and α is the single ‘prior sample size’ or a
priori degree of belief in the prior estimate. The posterior
covariance kernel of the sampling c.d.f. can be expressed quite
simply in terms of α, Λ, F.
While Ferguson’s
estimates shrink the sample c.d.f. towards the corresponding
prior estimate of the c.d.f, they do not invoke any posterior
smoothing of the type advocated by Good and Gaskins in
their 1971 paper in Biometrika.
David Blackwell and
James MacQueen proved that Dirichlet processes can be expressed as
limits of Polya urn schemes, while Charles Antoniak generalised
Ferguson’s methodology by assuming a prior mixing distribution for α
Λ. Both methodologies were published in 1974 in the Annals of
Statistics.
It’s not obvious how
well the corresponding hierarchical and empirical Bayes procedures
work in practice. In my 1996 article in Statistics and
Probability Letters about exchangeable sampling distributions,
I show that α is unidentifiable from the data when Λ is
fixed, unless there are ties in the data.
Dale Poirier’s 1973 Ph.D. thesis in Economics at the University
of WisconsinMadison was entitled Applications of Spline
Functions in Economics. Dale has published many Bayesian papers
in Econometrics since, and he is currently Professor of Economics at
the University of California at Irvine.
Dale Poirier is listed in Who’s Who in
Economics as one of the major economists between 1700 and 1998.
In their 2005 survey of
Bayesian categorical data analysis, Alan Agresti and David Hitchcock
regard my 1972 and 1973 papers in Biometrika, and my 1975 and
1977 papers in JRSSB and Biometrics. as initiating a
logistic transformation/ first stage multivariate normal/
hierarchical prior approach to the analysis of counted data, which
evolved from the seminal LindleySmith 1972 developments regarding
the linear statistical model.




Alan Agresti 

In my 1972 paper, I
derived shrinkage estimates for several binomial probabilities, by
constructing an exchangeable prior distribution for the logodds
parameters. In my alltimefavourite 1973 paper, I smoothed the
probabilities in a grouped histogram by assuming a first order
autoregressive process for the (multinomial) multivariate logits at
the first stage of my hierarchical prior distribution. This type of
formulation (maybe a second order autoregressive prior process would
have yielded somewhat more appealing posterior inferences) provides
a more flexible prior covariance structure in comparison to the
Dirichlet formulation employed by Jack Good, and by Fienberg and
Holland. 


My 1973 paper influenced
the literature of semiparametric density estimation, and preceded
much of the literature on multilevel modelling for Poisson counts.
My histogram smoothing method effectively addressed the problem of
filtering the logmeans of a series of independent Poisson counts,
since the conditional distribution of the counts, given their sum,
is multinomial.
In my 1975 JRSSB
paper, I constructed a threefold prior exchangeability model for
the logitspace row, column, and interaction effects in an rxs
contingency table. However, when analysing Karl Pearson’s 14 x14
fathers and sons occupational mobility table, I added an extra set
of interaction effects along the diagonal, and referred to a
fourfold prior exchangeability model instead, with interesting
practical results which appealed to Henry Daniels and Irwin Guttman,
if nobody else.
My Biometrics
1977 paper is the best of the bunch in technical terms. In this, not
particularly wellcited, article, I analysed the BradleyTerry model
for Paired Comparisons, again by reference to an exchangeable prior
distribution. I used my shrinkage estimators to analysis Stephen
Fienberg’s ‘passive and active display’ squirrel monkey data, and
they indicated a different order of dominance for the monkeys in the
sample than suggested by the maximum likelihood estimates. In my
practical example, I used a uniform prior for the first stage prior
variance of my transformed parameters, rather than selecting a small
value for the degrees of freedom of an inverted chisquared prior.
The Editor of
Biometrics, Foster Cady was very keen to publish the squirrel
monkey data, though he seemed much less concerned about the
bothersome details of the statistical theory. The United States
Chess Federation once showed an interest in using similar procedures
for ranking chess players, but it didn’t come to anything.
Unfortunately my
posterior modal estimates in my 1972, 1973 and 1975 papers, when the
smoothing parameters are unspecified, all suffer from a
LindleySmithstyle overshrinkage problem in the situation where
the first stage prior parameters are unspecified during the prior
assessment.
This difficulty was
resolved in my Biometrics paper, where approximations to the
unconditional posterior means of the model parameters were proposed,
following suggestions in Ch. 2 of my 1973 University of London Ph.D.
thesis ( that also hold for more general hierarchical models). A
numerical example is discussed in Appendix A. See Chs. 5 and 6 of
[15] for some more precise techniques. Nowadays, the unconditional
posterior means and marginal densities of any of the model
parameters in my early papers can be computed exactly using modern
Bayesian simulation methods.
In her 1978 JRSSB
paper, Nan Laird, a lovely ScottishAmerican biostatistician at
Harvard University reported her analysis of a special case of the
contingency table model in my 1975 paper, which just assumed
exchangeability of the interaction effects. Nan used the EM
algorithm to empirically estimate her single prior parameter by an
approximate marginal posterior mode procedure. The EM algorithm may
indeed be used to empirically estimate the prior parameters in any
hierarchical model where the first stage of the prior distribution
of an appropriate transformation of the model parameters is taken to
be multivariate normal with some special covariance structure. These
include models which incorporate a nonlinear regression function,
and also time series models. Many of these generalisations were
discussed in my 1973 Ph.D. thesis. If the observations are Poisson
distributed then I use my prior structure for the logs of their
means.




Nan
Laird 

The techniques I
devised for Bayesian categorical data analysis during the 1970s have
been applied and extended by numerous authors including Arnold Zellner and Peter Rossi, Rob Kass and Duane Steffey, Jim Albert,
William Nazaret, Matthew Knuiman and Terry Speed, John Aitchison,
Costas Goutis, Jon Forster and Allan Skene, and by Michael West and
Jeff Harrison to time series analysis and forecasting. Jim Hickman
and Bob Miller extended my histogram smoothing ideas to the
graduation of mortality tables and to bivariate forecasting in
actuarial science. Many of the models incorporated into WinBUGS and
INLA can be regarded (at least by me!) as special cases or
extensions of my formulations for binomial and multinomial logits
and Poisson logmeans.
Matthew Knuiman is
Winthrop Professor of Public Health at the University of Western
Australia where he teaches and researches biostatistics and
methodology, sometimes from a Bayesian viewpoint. The diversity and
quality of his applications in medicine are tremendously impressive.
William Nazaret, who
extended my methodology, in Biometrika in 1987, to the
analysis of threeway tables, with a bit of guidance from myself,
has a Ph.D. in Statistical Computing from Berkeley. After a
successful subsequent career at AT&T Bell Labs, he became CEO of
Intelligense in the U.S., and then C.E.O. for Digicel in El
Salvador, and finally C.E.O. for Maxcom Telecommunications S.A.B. de
C.V., for whom he now serves as an external advisor. The life of a
Venezuelan polymath on a multiway table must be very interesting
indeed, particularly when he’s a fan of shrinkage estimators.
The British General Elections in February and October 1974 were both
a very close call. In the first election, the Labour Party won 4
more seats than the Tories, but were 17 short of an overall
majority. The Tory Prime Minister, the celebrated puffchested
yachtsman Ted Heath, remained stoically in Downing Street while
trying to form a coalition with the Liberals. When the Liberal
leader Jeremy Thorpe (later to go for an early bath after allegedly
conspiring to kill his lover’s dog) laughed in Heath’s face, Harold
Wilson (a future cigarpuffing President of the Royal Statistical
Society) flourished his working class pipe and became the first
Labour Prime Minister since Clement Attlee. Then In October 1974
Labour achieved an overall majority of 3 seats.
Wilson was an advocate
of shortterm forecasting but wasn’t obviously Bayesian, though he
had a strong sense of prior belief.
The BBC were ably
assisted during the 1974 Election Nights by Phil Brown and Clive
Payne who used a variation of Bayesian ridge regression (see Lindley
and Smith [57]) to forecast the final majorities. Brown and Payne
reported their, at times amusing, experiences, in their 1975 invited
paper to the Royal Statistical Society. They went on to predict the
outcomes during several further General Election nights while
entwining with British political history.




Philip
Brown 

In 1974, Abimbola Sylvester
Young, from Sierra Leone, was awarded his Ph.D. at UCL. The title of
his splendid Ph.D. thesis was Prediction analysis with some types
of regression function, and he was supervised by Dennis Lindley.
After an eminent career,
working with the Governments of Uganda and Malawi and various
international food, agriculture, and labour organisations, Abimbola
is currently a consultant statistician with the African Development
Bank. Jane Galbraith, when a tutor at UCL, reportedly decided that
he should be called Sylvester, because she could not pronounce
Abimbola.
Masanao Aoki’s monumental
1975 book Optimal Control and System Theory in Dynamic
Economic Analysis follows earlier developments by Karl Johan
Aström in Introduction to Stochastic Control Theory.
Aoki follows a Bayesian decision theoretic approach and develops a
backwardsandforwards updating procedure for determining the
optimal way of controlling dynamic systems. In so doing, he extended
the KalmanBucy approach to Bayesian forecasting, which was pursued
by Harrison and Stevens in their 1976 paper in JRSSB, and
later by West and Harrison in their 1997 book Bayesian
Forecasting and Dynamic Models The books on Markov
decision processes, by the eminent American industrial engineer
Sheldon Ross are also relevant to this general area. See Markov
Decision Processes by Roger Hartley for an excellent review and
various extensions.
Also in 1975, the
economists Gary Chamberlain and Edward Leamer published a conceptual
important article in JRSSB about the matrix weighted averages
which are used to compute posterior mean vectors in the linear
statistical model in terms of a least squares vector and a prior
mean vector. Noting that the components of the posterior mean vector
are not typically expressible as simple weighted average of the
corresponding least squares estimate and prior mean, the authors
used geometric arguments and multidimensional ellipsoids to compute
bounds on the posterior means that illustrate this phenomenon
further. The posterior smoothing implied by a nontrivial prior
covariance matrix can be even more complex than indicated by these
bounds and sometimes initially quite counterintuitive.
Still in 1975, Bruce Hill of the University of Michigan
proposed a simple general approach in the Annals of Statistics
to inference about the tail of a sampling distribution. His results
were particularly simple for population distributions with thick
right tail behaviour of the Zipf type. The inferences were based on
the conditional likelihood of the parameters describing the tail
behaviour, given the observed values of the extreme order
statistics, and could be implemented from either a Bayesian or a
frequentist viewpoint. This followed a series of fascinating papers
by Hill concerning Zipf’s law and its important applications.
Norman Draper and Irwin Guttman published
their Bayesian approach to two simultaneous measurement procedures
in JASA in 1975. Known as ‘Mr. Regression’ to his students,
Norman Draper has also published several important Bayesian papers.
He was one of George Box’s longterm confidants and coauthors at
the University of WisconsinMadison, where he is now a muchadmired
Emeritus Professor of Statistics.
In 1976, Peter Freeman read
his paper ‘A Bayesian Analysis of the Megalithic Yard’ to a full
meeting of the Royal Statistical Society. His analysis of Alexander
Thom’s prehistoric stone circle data was wellreceived. As Thom was
both a Scot and an engineer, he always meticulously recorded his
archaeological measurements. However, it has always been debated
whether or not the megalithic yard is a selfimagined artefact.
Maybe the Stone Age men just paced out their measurements with their
feet.
Even though he was an
accomplished Bayesian researcher and lecturer, Peter Freeman was given an
unfairly hard time by his senior colleagues at UCL during the late
1970s. Maybe he was too practical. In 1977, the University
administrators even refused to let him open the skylight in his
office, a key event in the history of Bayesian Statistics which
preceded Dennis Lindley’s early retirement shortly afterwards. When
Peter fled to Leicester to take up his duties as Chair of Statistics
in that forlorn neck of the woods, he discovered that Jack Good, a
previous candidate for the Chair, had made far too many eccentric
demands, including the way his personal secretary should be
shortlisted and interviewed by the administrators ‘specially for
him’. Peter is currently a highly successful statistical consultant
who works out of his cottage on the Lizard peninsula on the southern
tip of Cornwall with his second wife.
According to George
Box, Sir Ronald Fisher (who’d dabbled with Bayesianism in his youth)
always hated the obstructive administrators at UCL, together with
the ‘beefeaters’ patrolling the Malet Street quadrangle. When one of
the beefeaters manhandled a female colleague while she was trying
the clamber through a window, Sir Ronald beat the rude fellow with
his walking stick for ‘not behaving like a gentleman.’ 


Also in 1976, Richard Duda,
Peter Hart and Nils Nilson, of the Artificial Intelligence Center at
the Stanford Research Institute, investigated existing inference
systems in Artificial Intelligence when the evidence is uncertain or
incomplete. In ‘Subjective Bayesian Methods for RuleBased Inference
Systems’, published in a technical report funded by the U.S.
Department of Defence, they proposed a Bayesian inference method
that realizes some of the advantages of both formal and informal
approaches.
In 1977, Phil Dawid and
Jim Dickey published a pioneering paper in JASA concerning
the key issue of selective reporting. If you only report your best
practical results then this will inflate their apparent practical
(e.g. clinical) significance. The authors accommodated this problem
by proposing some novel likelihood and Bayesian inferential
procedures.
Jim Dickey is currently
an Emeritus Professor of Statistics at the University of Minnesota.
He has made many fine contributions to Bayesian Statistics, and his
areas of interest have included sharp null hypothesis testing,
scientific reporting, and special mathematical functions for
Bayesian inference. He has worked with other leading Bayesians at
Yale University, SUNY at Buffalo, University College London, and the
University College of Wales in Aberystwyth, and he is a fountain of
knowledge. I remember him giving a seminar about scientific
reporting at the University of Iowa in 1971. Bob Hogg, while at his
humorous best, got quite perplexed when Jim reported his posterior
inferences for an infinitely wide range of choices of his prior
parameter values.




James Mills Dickey 

Jim Smith’s 1977
University of Warwick Ph.D. thesis was entitled Problems in
Bayesian Statistics relating to discontinuous phenomena,
catastrophe theory and forecasting. In this highly creative
dissertation, Jim drew together Bayesian decision theory, the
quite revolutionary ideas about discontinuous change expressed in
the deep mathematics of Catastrophe theory by the French Fields
medal winner Renée Thom and at Warwick by Sir Christopher Zeeman,
and the Bayesian justifications and extensions of the Kalman Filter
which were reported by Harrison and Stevens in their 1976 paper in
JRSSB. 



Jim Smith 

Discontinuous change can
be brought about by the effects of one or more gradually changing
variables e.g. during the evolutionary process or during
conversations which lead to a spat. Unfortunately, Thom’s remarkable
branch of bifurcation theory, which invokes cusp, swallowtail and
butterfly catastrophes on mathematical manifolds, depends upon a
Taylor Seriestype approximation which is only locally accurate.
Jim’s more precise
results provided forecasters with considerable insights into the
behaviour of time series which are subject to discontinuous change,
sometimes because of the intervention of outside events.
Discontinuities in the behaviour of time series can also be analysed
using Intervention Analysis. See the 1975 book on this topic
by Box and Tiao.
While still a
postgraduate student, Jim proved Smith’s Lemma ( [15], p158)
which tells us that the Bayes decision under a certain class of
symmetric bounded loss functions must be fall in a region determined
by the decisions which are Bayes against some symmetric step loss
function. This delightful lemma should be incorporated into any
graduate course on Bayesian decision theory.
Jim has been a very
‘coherent’ Bayesian throughout his subsequent highly accomplished
career. He was a lecturer at UCL during the late 1970s, before
returning for a lifetime of imaginative rationality and
sabrerattling at the University of Warwick.
In their 1976, 1977 and 1978
papers in JASA and the Annals of Statistics,
Vyaghreshwarudu Susarla and John Van Ryzin reported their
influential empirical Bayes approaches to multiple decision problems
and to the estimation of survivor functions. Their brand of
empirical Bayes was proposed by Herbert Robbins in 1956 in the
Proceedings of the Third Berkeley Symposium, and should nowadays
be regarded as part of the applied Bayesian paradigm since it is
very similar in applied terms to semiparametric hierarchical Bayes
procedures which employ marginal posterior modes to estimate the
hyperparameters. When he visited us to give a seminar at the
University of Warwick during the early 1970s, Robbins, who always
had a chip on his shoulder, scrawled graffiti on the walls of
Kenilworth Castle.
In his 1978 paper in
the Annals of Statistics, Malcolm Hudson of Macquarie
University showed how extensions of Stein’s method of integration by
parts can be used to compute frequency risk properties for
BayesStein estimators of the parameters in multiparameter
exponential families. His methodology lead to a plethora of
theoretical suggestions in the literature e.g. for the simultaneous
estimation of several Poisson means, many of which defy practical
common sense.
In 1978, Irwin Guttman,
Rudolf Detter and Peter Freeman proposed a Bayesian procedure in
JASA for detecting outliers. In the context of the linear
statistical model, they assumed a priori that any given set
of k observations has a uniform probability of being spurious, and
this lead to very convincing posterior inferences. Their approach
provided an alternative to the earlier BoxTiao methodology in Box
and Tiao’s 1973 book and contrasted with the conurbation of ad
hoc procedures proposed by Vic Barnett and Toby Lewis in their
1978 magnum opus. It should of course be emphasised that outliers
can provide extremely meaningful scientific information if you stop
to wonder how they got there. They should certainly not be
automatically discarded, since this may unfairly inflate the
practical significance of your conclusions.
Michael West [63] proposed a later Bayesian solution to the
outlier problem. One wonders, however, whether outliers should only
be considered during an informed preliminary exploratory data
analysis and a later analysis of residuals.
In 1979, Tony O’Hagan
introduced the terms outlier prone and outlier resistant,
in a dry though profound and farreaching paper in JRSSB,
when considering the tail behaviour of choices of sampling
distribution which might lead to posterior inferences that are
robust in the presence of outliers. He for example showed that
the exponential power distributions proposed by Box and Tiao when
pioneering Bayesian robustness in their 1973 book are in fact
outlier prone, and do not therefore lead to sufficiently robust
inferences.
A number of later
approaches to Bayesian robustness were reviewed and reported in 1996
in Robustness Analysis. This set of lecture notes was
coedited by Jim Berger, Fabrizio Ruggeri, and five further robust
statisticians. 



Fabrizio Ruggeri (ISBA President 2012) 

EXPLANATORY NOTES:
Laplace double exponential distributions and finite mixtures of
normal distributions are OUTLIER PRONE even though they possess
thicker tails than the normal distribution. They are therefore not
appropriately robust to the behaviour of outliers when employed as
sampling distributions. The generalised tdistribution is however
OUTLIER RESISTANT. Nevertheless, I would often prefer to assume that
the sampling distribution is normal, after using a preliminary
informative data analyses to either substantiate or discard each of
the outliers, or maybe to use a similarly outlierprone skewed
normal distribution. A (symmetric) generalised tdistribution is
awkward to use if there is an outlier in one of its tails, and a
skewed generalised tdistribution depends upon four parameters which
are sometimes difficult to efficiently identify from the data.
These issues are also
relevant to the choices of prior distributions e.g, the choices of
conditional prior distributions under composite alternative
hypotheses when formulating Bayes factors. They influence the
viability of Bayesian procedures which have been applied to legal
cases and genomics.
Jim Berger was an excellent
after dinner speaker at Bayesian conferences (he once teased Arnold
Zellner to the point where Arnold’s lovely wife Agnes wondered where
her husband had been the night before) and was only rivalled for his
wit and humour by Adrian Smith.
In 1978, I published, as an extension of my 1973 histogram smoothing
method, possibly the first ever nonlinear prior informative method
[64] for smoothing the common density f=f (t) of a sample of n
independent observations for values of x falling in a bounded
interval (a,b). I did this by reference to a logistic density
transform g=g(t) of the density. From a Bayesian perspective,
the prior information about g might be representable by a Gaussian
process for g on (a,b) with mean value function μ =μ(t) and
covariance kernel K=K(s,t), which are taken to be specified for all
values of s and t in (a,b) . For example, a secondorder
autoregressive process for g can be constructed by taking its
derivative g to possess the covariance structure of an OrnsteinUhlenbeck
stochastic process. Under such assumptions, it is technically
difficult to complete a Bayesian analysis because the RadonNikodym
of the prior process is heavily dependent on the choice of
dominating measure. For example, the function maximising the
posterior RadonNikodym derivative of g will also depend upon the
choice of dominating measure.
I therefore instead adopted a prior
likelihood approach by instead taking μ=μ(t) to represent the
observed sample path for t in (a,b) of a Gaussian process with mean
value function g=g(t) and covariance kernel K. Under this
assumption, the posterior likelihood of g given the ‘prior sample
path’ μ, may be maximised numerically, via the Calculus of
Variations, yielding a posterior maximum likelihood estimate for g
which is free from a choice of measure. In my paper, I completed
this task, assuming the special covariance structure indicated
above.
The prior
loglikelihood functional of g can be expressed in terms of a
roughness penalty involving integrals of the squares of the first
and second derivatives of gμ. Maximizing the posterior
loglikelihood functional of g using the Calculus of Variations,
yields a fourth order nonlinear differential equation, which I
converted into a nonlinear Fredholm integral equation for g. I used
an iterative technique to solve this equation, thus providing twice
differentiable estimates which depended upon the choices of μ and
two prior smoothing parameters.
With a small
modification, the same methodology can be used to smooth the log of
the intensity function of a nonhomogeneous Poisson process. My
approach bore similarities to the procedures that had been employed
by Richard Bellman and others to solve boundary value problems in
mathematical physics.
When proposing the
vote of thanks during the December 1977 meeting of the Royal
Statistical Society, Peter Whittle showed that my estimates were
also ‘limiting Bayes’, in the sense that they could also be derived
by limiting arguments from nonlinear equations for posterior mode
vectors which solved a discretized version of the problem, and he
generalised my nonlinear Fredholm equation to depend upon an
unrestricted choice of the prior mean value function μ and
covariance kernel K.




Peter
Whittle F.R.S. 

I had previously discarded this general approach because the
posterior mode functional is not uniquely defined in the limit.
Nevertheless, my specialcase equations yielded the first ever
effectively Bayesian methodology for the nonlinear smoothing of a
sampling density. This was over twenty years after Peter Whittle
published his linear Bayes approach to this problem.
When seconding the
vote of thanks, Bernie Silverman contrasted my approach with his
density estimation/roughness penalty methodology for investigating
cot death which he published that year in Applied Statistics.




Sir
Bernard Silverman F.R.S. 

In 1977, Clevenson and
Zidek had published a linear Bayes procedure in JASA for
smoothing the intensity function of a nonhomogeneous Poisson
process, which didn’t look that different to Whittle’s 1958 density
estimation method. If the arrival times in a Poisson process during
a bounded time interval (a,b) are conditioned on the number of
arrivals n, then they assume the probability structure of the order
statistics from a distribution whose density appropriately
normalises the intensity function of the Poisson process. The two
smoothing problems are therefore effectively equivalent.
The hierarchical
distributional assumptions in my R.S.S paper introduced logGaussian
doubly stochastic Poisson processes into the literature. David Cox
showed a keen interest at the time e.g. during one of his University
of London seminars, and he saved my bacon during my application for
tenure at the University of Wisconsin in 1979
when there was an unfair attempt to 'destroy me'
from out of the British Bayesian establishment after some
shenanigans by an untenured UW assistant professor.
Peter Lenk of the
University of Michigan in Ann Arbor has published various
approximate fully Bayesian developments, including the material in
his 1985 Savage prizewinning Ph.D. thesis, of my density estimation
methodology. Daniel Thorburn reported his approach to the problem,
but with similar prior assumptions to my own, in his 1984 monograph
A Bayesian Approach to Density Estimation. 



Peter Lenk 

In 1993 Richard Silver,
Harry Martz and Timothy Wallstrom of the Los Alamos National
Laboratory in New Mexico described another Bayesian approach [32]
which refers to the theories of Quantum Physics. See also Timothy
Wallstrom’s 1993 book Quantum Statistical Inference. By
approximating the logistic density transform by a linear combination
of p specified functions, Silver et al constrain the sampling
distribution to belong to the pparameter exponential family, by
approximating the logistic density transform by a linear combination
of p specified functions.
Chong Gu has extended
various special cases of my roughness penalty formulation by
reference to the theory of nonlinear smoothing splines and has even
shown that the computations are feasible in the multivariate case,
see [65]. Dennis Cox and Finbarr O’Sullivan [66] address the
asymptotic properties of these procedures.
In hindsight, it would
have been preferable, in terms of both mathematic rigor and
posterior smoothing, to take the prior covariance kernel K to adopt
the infinitely differentiable exponential quadratic form that was
assumed by Tony O’Hagan for an unknown regression function in his
1978 JRSSB paper.
I did this a decade or
two later with John Hsu in our 1997 paper in Biometrika,
which concerned the semiparametric smoothing of multidimensional
logistic regression functions and an analysis of the ‘mouse
exposure’ data. That was the paper which confirmed John’s tenure at
UCSB. He developed the hierarchical Bayes bit using some neat
algebraic tricks.
In 2012, Jackko
Riihimäke and Aki Vehtari managed to do the same thing for bivariate
density estimation in their paper ‘Laplace Approximations for
Logistic Process Density Estimation and Regression’ in ArXis
eprints. The authors employed very accurate conditional
Laplacian approximations on a grid, which they validated using MCMC.
Their generalisation of my 1978 method is the current state of the
art in Bayesian semiparametric univariate and bivariate density
estimation. Congratulations to the authors, who are also in the
machine learning business.
Also in 1978, the Dutch
economists Teunis Kloek and Herman Van Dijk used Importance
Sampling simulations in a seminal paper in Econometrica
to calculate the Bayes estimates for the parameters in various
equation systems. Importance sampling is a very efficient variation
of Monte Carlo which is guaranteed to converge to the correct
solution with a bounded standard error of simulation, if the
importance function is chosen appropriately. It was used during the
1980s to calculate the Bayesian solutions to many multiparameter
problems which were algebraically intractable. See the eminent
Israeli economist Reuven Rubinstein’s 1981 book Simulations and
the Monte Carlo Method. For further applications in Economics
see, for example, the 1988 paper by John Geweke in the Journal of
Econometrics.




Herman Van Dijk 

During the 1990s,
Importance Sampling was largely superseded by Markov Chain Monte
Carlo (MCMC), despite the notorious convergence difficulties
experienced by MCMC’s users and sufferers. When applying Importance
Sampling, simulations from a generalised multivariate tdistribution
with appropriately thick tails will produce excellent convergence
when addressing a wide range of nonlinear models.
In 1979, Jose Bernardo of
the University of Valencia published two pioneering papers about
vague ‘reference priors’, which maximise Lindley’s measure of the
amount of information in the statistical experiment. In his
JRSSB invited paper, Jose demonstrated the importance of the
corresponding reference posteriors in Bayesian inference. In
his Annals of Statistics paper, he expressed Lindley’s
measure in terms of KullbackLiebler divergence, and thereby
justified the measure in terms of a type of expected utility.
Jose’s 1979 papers
spawned a large literature on reference priors, including several
joint articles with Jim Berger. These procedures can, of course,
only be entertained as objective if the choice of sampling model is
itself objectively justifiable.
While the derivations
of reference priors can amount to separate, quite tedious, research
problems when analysing some sampling models, these vague priors do
lead to quite convincing alternatives to Jeffreys invariance priors.
They usually lead to similarly sensible posterior inferences.
Also in 1979, ChanFu
Chen of SUNY at Albany published his hierarchical Bayesian procedure
for the estimation of a multivariate normal covariance matrix in
JRSSB. He employed a continuous mixture of inverted Wishart
distributions in the prior assessment, and his hierarchical Bayes
estimators shrunk the sample covariance matrix towards a matrix
assuming intraclass form. ChanFu applied his method to stochastic
multiple regression analysis, to excellent effect.
An event of immense
significance in the history of Bayesian Statistics took place as the
decade drew to a close, organized by Jose Bernardo (the local
coordinator), Morris De Groot, Dennis Lindley and Adrian Smith
under the auspices of the University of Valencia and the Spanish
Government. The event took place during May to June 1979 and was
subsequently named the First Valencia International Meeting of
Bayesian Statistics. Most of the leading Bayesians in the world met
in the Hotel Las Fuentes in Alcossebre, a beautiful resort with cold
water springs on the beach, on the Spanish Mediterranean Coast
between Valencia and Barcelona. There ensued a learned discourse
between some of the best minds in the discipline, intermingled with
swimming, partying, and romps on the beach. All in all, a jamboree
to be reckoned with!




The beach at Alcossebre 

The Proceedings of
Valencia 1 were published by the four organizers, and the
invited speakers included Hirotugu Akaike, Art Dempster, George
Barnard, I. Richard Savage (the brother of the late Jimmie Savage),
George Box, James Press, Seymour Geisser, Jack Good, and Arnold
Zellner.
I.Richard Savage was the
brother of the late Jimmie Savage, and an impressive, accomplished,
and allinclusive man. His research interests at Yale included
public policy and Bayesian statistics.








I. Richard
Savage 

Leonard 'Jimmie' Savage 


Twentyfive papers, prepared
by a total of thirtyfive authors and coauthors were presented at
Valencia1 and discussed in the following fourteen sessions:
1. Foundations of Subjective Probability and Decision Making
2. Sequential learning,
discontinuities and change
3. Likelihood, sufficiency
and ancillarity
4. Approximations
5. Regression and time
series
6. Bayesian and nonBayesian
conditional inference
7. Personal and
interpersonal ethics
8. Sensitivity to models
9. Improving judgements
using feedback
10. Predictive sample reuse
11. Beliefs about beliefs
12. Bayesian nonparametric
theory
13. Coherence of models and
utilities
14. Hypothesis Testing
The highlights included (1) Bruce Hill’s presentation of his
paper ‘On finite additivity, nonconglomerability, and statistical
paradoxes’ when he invited anybody who didn’t agree with his
axiomatics to explain how it felt to be a ‘sure loser’ (2) Stephen
Fienberg’s trivialisation and demolition of Jeff Harrison’s
convoluted exposition about ‘Discontinuity, decision and conflict’
(3) Morris De Groot’s paper about improving predictive
distributions, and (4) Dennis Lindley’s paper on Laplacian
approximations to posterior densities, which preceded the much more
numerically accurate conditional Laplacian approximations
which were to be developed by other Bayesians during the 1980s and
synthesised into computer packages in the context of hierarchical
modelling during the early part of the twentyfirst century as a
valid alternative to MCMC. 



George E.P.Box 

George Box and Herbert
Solomon provided us with the most memorable moment of the entire
conference during the farewell dinner when they sang ‘Our Theorem is
Bayes Theorem’ to the tune of ‘My Business is Show Business’. Here
is one of their refrains: 

There’s no theorem
like Bayes theorem
Like no theorem we
know
Just recall what
Pearson said to Neyman
Emerging from a
region of type B
“It’s difficult
explaining it to Lehmann
I fear it lacks
Bayes’ simplicity
There’s no haters
like Bayes haters
They spit when they
see a prior
Be careful when you
offer your posterior
They’ll try to kick
it through the door
But turn the other
cheek if it is not too sore
Of error they may
tire 



I became so carried away by
the festivities that I downed several shots of Cointreau, only to
puke up my seafood an hour or so later, all over the beach.
The Valencia
International Meetings have beneficially influenced Bayesian
Statistics ever since. The ninth in Jose Bernardo’s truly
magnificent series was held in Alicante, Spain during June 2010. 



Bayesians at Play (Archived from Brad Carlin’s Collection) 

