A PERSONAL HISTORY OF BAYESIAN STATISTICS

 

Thomas Hoskyns Leonard
 
Retired Professor of Statistics, Universities of Wisconsin-Madison and Edinburgh
 

4: THE PRAGMATIC SEVENTIES (CONTINUED)

 

Let us be evidence-based, though not spuriously evidence-based

 

 

Tom Leonard (University of Warwick, 1976)   Tom's father, Captain Cecil Leonard (1913-2001)
 

The first part of this chapter appears in the John Wiley on-line library

 

Until 1972, much  of the Bayesian literature was devoted to procedures which either emulated the frequency approach or incorporated prior information via conjugate prior distributions where the prior information could have instead been represented by ‘prior likelihoods’ which required the specification of ‘prior observations and sample sizes’. See Antony Edwards’ 1972 book Likelihood. Consequently, the practical advantages of the Bayesian paradigm had not so far been fully exploited.

    Dennis Lindley’s and Adrian Smith’s widely cited landmark paper ‘Bayes Estimates for the Linear Model’ [57] seemed to change all that in 1972. In the context of M-group regression, they assume between-group De Finetti-style prior exchangeability of the M vectors of regression parameters, in the homogeneous variance situation where the sample variances of all the observations in the groups are assumed to be equal. Their shrinkage estimators for the vectors of regression coefficients  are therefore special cases of the procedures described in a previously unpublished technical report by Lindley which was based upon work which he completed at the American Testing Program (A.C.T) in Iowa City. In this A.C.T. report, he addressed the heterogeneous variance situation by assuming that the M unequal variances corresponding to the different regressions were also a priori exchangeable.

    The empirical validation so famously reported in the Lindley-Smith paper was based upon computations completed by Melvin Novick, Paul Jackson, Dorothy Thayer, and Nancy Cole at A.C.T. and reported in the British Journal of Mathematical and Statistical Psychology (1972). However, their validation related to the more general methodology developed in Lindley’s A.C.T. technical report rather to the special homogeneous variance case described in the Lindley-Smith paper.


According to my long-term memory, and I suppose that my brain cells could have blown a few gaskets over the years, Paul Jackson, a lecturer at the University College of Wales in Aberystwyth, advised me at A.C.T. in June 1971, with a smile and a wink, that he’d handled a serious over-shrinkage problem relating to Lindley’s estimates (see below) by setting ‘an arbitrarily small’ degrees of freedom hyperparameter to be equal to a small-enough-looking value which was however large enough to ensure that the ‘collapsing phenomenon’ became much less evident.

    As I remember, Paul Jackson’s choice of hyperparameter ensured that the posterior estimates of the M sampling variances were substantially different, and this in turn ensured that the quasi-Bayesian estimates for the M regression vectors were also comfortingly different. It was a new experience for me, if I’ve remembered it correctly, to see how the more ingenious of our applied statisticians can handle awkward situations with a flick of their fingers.


    Lindley and Smith described their hierarchical prior distribution in the following two stages:

Stage 1: The vectors of regression parameters constitute a random sample from a multivariate normal distribution with unspecified mean vector μ and covariance matrix C. The common sampling variance V possesses a rescaled inverted chi-squared distribution with specified degrees of freedom.

Stage 2: The first stage mean vector μ and covariance matrix C are independent, μ is uniformly distributed over p-dimensional Euclidean space. The covariance matrix C possesses an inverted Wishart distribution with specified degrees of freedom and pxp mean matrix.

    The posterior mean vectors of the regression vectors, conditional on V and C, can then be expressed as matrix weighted averages of the individual least squares vectors, and a combined vector which can be expressed as a matrix weighted average of these vectors. Lindley and Smith then make the conceptual error of estimating the V and C via the joint posterior modes of V, C, μ, and the regression parameters. These are not, of course, either Bayes or generalised Bayes estimates.

    Lindley and Smith did not apparently notice a ‘collapsing phenomenon’ that causes their estimates of the regression coefficients to radically over-adjust the least squares estimates. Marginal modal estimates of V and C may instead be calculated, and these will be asymptotical consistent for the ‘true’ V and C as M gets large and under the two-stage distributional assumptions described above.

    A straightforward application of the EM algorithm, as described by Art Dempster, Don Rubin, and Nan Laird in JRSSB in 1977, would yield equations for these marginal estimates which include important extra terms, when compared with the Lindley-Smith equations, which do not vanish as M gets large. See the equations for the marginal maximum likelihood estimates of V and C described in Exercise 6.3 h on pages 271-272 of [15].

    In his 1976 paper in Biometrika Tony O’Hagan showed that joint and marginal posterior modes can be quite different in numerical terms. Moreover, in their reply to my  trivially intended contribution to the discussion of their paper, Bradley Efron and Carl Morris [27] showed that the Lindley-Smith estimators of the location parameters have quite inferior mean squared properties, in a special case, when compared with their empirical Bayes estimators.

    In 1980, Bill Fortney and Bob Miller [58] sorted out M-group regression with a view to applications in business, and showed in a special case that O’Hagan’s procedures yield much better mean squared error properties for the estimators of the regression coefficients than the Lindley-Smith joint modal, estimators. In their 1996 JASA paper, Li Sun et al were to show in a special case, that as M gets large the Lindley-Smith estimators have far worse properties than maximum likelihood over arbitrarily large regions of the parameter space and for any choices of the prior parameter values

    Adrian Smith waited until 1973 to publish a numerical example of his M-group regression methodology, again in JRSSB.

    I recall bumping into Adrian in the bar in, as I remember, the downtown Hilton hotel, during the ASA meetings in San Francisco in 1987. I was sober at the time since I had just completed my ASA short course lectures on categorical data analysis for the day. During a convivial conversation with Adrian we got to talking about the computation difficulties associated with the 1972 Lindley-Smith paper and which hadn’t been resolved during the following fifteen years, to my satisfaction at least. I recall Adrian smiling into his drink and, and I’d put my hand on a Bible on this one, advising me to the effect that ‘he’d always stopped after the first stage in the iterations (for the posterior modes), since that provided the users with what they really wanted’.

    The consequences of Adrian’s youthful enthusiasm appear to be evident in Table 2 of the Lindley-Smith paper, where the authors compare their estimates for a single regression vector, under a special prior formulation, with the Hoerl-Kennard ‘ridge regression’ estimates. Most of the, presumably incorrectly calculated, Lindley-Smith estimates of the regression coefficients shrink the least squares estimates much less towards zero than do the ridge regression estimates.

    In 1977, Norman Draper and Craig Van Nostrand [59] reported detailed results which suggest that the ridge regression estimates, which refer to a graph of the ridge trace, themselves shrink too much when compared with Stein-style estimates. If Lindley and Smith had pursued their iterations to the limit, then their correctly calculated numerical results could well have been much closer than they claim to the less-than-desirable Hoerl-Kennard estimates.

    In the orthonormal case, and as the number of unknown regression parameters gets large, the methodology employed by Sun et al (JASA,1996) may be easily generalised, in this ridge regression situation, to show that the correctly calculated Lindley-Smith joint posterior modal estimates have far worse mean squared error properties than least squares over arbitrarily large regions of the parameter space and for any choices of the prior parameter values.

    Draper and Nostrand moreover take ridge regression to task for other compelling reasons. Some of their criticism can also be directed at the Lindley-Smith estimates in this single regression situation.

    [Legal Note: Adrian reported his Ph.D. research in the Lindley-Smith 1972 paper and his 1973 papers in JRSSB and Biometrika, and these continue to influence our literature and events in my personal history of Bayesian Statistics. I have not read his 1971 University of London Ph.D. thesis in recent years, and I hope that he included enough caveats in his presentation to ensure authenticity of his thesis. As of 14th January 2014, I am still trying to resolve all these issues with Adrian, and I will amend this chapter as appropriate].

    In Table 1 of their paper, Lindley and Smith report the results of the Novick et al ‘empirical validation’ of Lindley’s previous quasi-Bayesian estimates in the M-group regression situation. The attempted validation, which was completed at A.C.T., referred to the prediction from previous data of student’s grade point averages collected in in 1969 at 22 different American colleges and the authors measured ‘predictive efficiency’ by average mean squared error. Lindley and Smith claimed that their estimates yielded a predictive efficiency of 0.5502, when compared with 5.596 for least squares. However, when a 25% subsample of the data are considered, their estimates gave a purported 0. 5603 predictive efficiency, when compared with 0. 6208 for least squares.

    When proposing the vote of thanks after the presentation of Lindley and Smith’s paper to the Royal Society, the confirmed frequentist Robin Plackett concluded that their estimates ‘doubled the amount of information in the data’. This conclusion appears, in retrospect, to be quite misguided, for example because the empirical validation at A.C.T. was (according to my own perception) affected by Paul Jackson’s judicious ‘juggling’ of a prior parameter value during the application of an otherwise untested computer package, and because it was unreplicated.

    The empirical validation may indeed have simply ‘struck lucky’ (This interpretation seems extremely plausible; just consider the unconvincing nature of the theory underpinning the forecasts). Moreover, Novick et al made their forecasts quite retrospectively, about two years after the grade point averages they were predicting had actually been observed and reported. This gave the authors considerable scope for selective reporting, and it is not obvious why they restricted their attention to only 22 colleges.

    A celebrated Bayesian forecaster, who learnt the tricks of his trade at Imperial Chemical Industries, later advised me that ‘The only good forecasts are those made after the event’, and I am not totally convinced that he was joking! He also said ‘I like exchangeability. It means that you can exchange your prior when you’ve seen your posterior.’ Unfortunately that’s exactly what Paul Jackson would appear to have done when he changed the value of a degrees of freedom hyperparameter. In so doing, he would appear, to me at least, to have altered the course of the history of Bayesian Statistics.

    Our senior, more experienced statisticians hold a privileged position in Society; maybe they should try to  reduce the professional pressures on graduate students and research workers to produce ‘the right result’ since this can even create situations where a wrong result is projected as the truth.  This problem was examined as recently as the 2nd. September 2013 at the Royal Society of Edinburgh when possible inaccuracies in the official Scottish health projections were discussed with members of the UK Statistics Authority.

    I am surprised that Dennis and Adrian never formally retracted the empirical conclusions in their paper but instead waited for other authors to refute them after further painstaking research. Thank goodness that Fortney and Miller [58] returned some sanity to M-group regression, in their, albeit largely unpublicised, paper. I would never have known about their work if Bob Miller hadn’t mentioned it to me during the late 1980s.

    It is possible to derive an approximate fully Bayesian solution to the M-group regression problem by reference to a matrix logarithmic transformation of the first stage prior covariance matrix and conditional Laplacian approximations. Maybe somebody will get around to that sometime.

 
 

Robert B. Miller was a Professor of Statistics and Business at the University of Wisconsin-Madison. He is currently Associate Dean of Undergraduate Studies in the Business School at Wisconsin, where he was a colleague of the late Dean James C. Hickman of Bayesian actuarial fame. The brilliant actuary and statistician Jed Frees has followed in Hickman and Miller’s footsteps, at times from a Bayesian perspective.

 
 

Various procedures for eliciting prior probabilities or entire prior distributions e.g. from an expert or group of experts, were investigated during the early 1970s.

    In pp71-76 of his 1970 book Optimal Statistical Decisions, Morris De Groot recommended calibrating your subjective probability by reference to an objective auxiliary experiment which generates a random number in the unit interval, e.g. by spinning a pointer, and he used this calibration procedure as part of an axiomatic development of subjective probability. His innocuous-looking fifth axiom is very strong in the way it links the assessor’s preference ordering on elements of the parameter space to the outcome of the auxiliary experiment in a manner consistent with the first four axioms. See [16].

    Parts of the extensive literature on the elicitation of prior distributions were reviewed in JASA in 2005 by Paul Garthwaite, Jay Kadane and Tony O’Hagan.

    Paul Garthwaite is Professor of Statistics at the Open University, where his research interests also include animal and food nutrition, medicine, and psychology.

    Jay Kadane is the Leonard J. Savage Professor of Statistics at Carnegie-Mellon University, and one of the world’s leading traditional Bayesians. Here is a brief synopsis of some of the key elements of the JASA review:


    In 1972, Arnold Zellner proposed a procedure for assessing informative prior distribution for sets of regression coefficients. In 1973, James Hampton, Peter Moore and Howard Thomas reviewed the previous literature regarding the assessment of subjective probabilities. In 1973, Daniel Kahneman and Amos Tversky investigated various psychological issues associated with these procedures, and a 1974 experiment by the same authors demonstrated a phenomenon called ‘anchoring’, where an initial assessment, called an anchor by a subjective probability is adjusted by a quantity which is too small when the person in question reassesses his probability. In his 1975 review, Robin Hogarth maintains that ‘assessment techniques should be designed both to be compatible with man’s abilities and to counteract his deficiencies’.

    In a seminal paper published in JASA in 1975, Morris De Groot used the properties of recurrence and transience in Markov chains to determine conditions under which a group of experts can reach a consensus regarding the probabilities in a discrete subjective distributions. In doing so, he extended the Delphi Method, a management technique for reaching a consensus among a group of employees, which was reviewed and critiqued by Juri Pill in 1971.


De Groot’s ideas on combining and reconciling expert’s subjective distributions were pursued in 1979  by Dennis Lindley, Amos Tversky, and Rex V. Brown in their rather over-elaborate discussion paper ‘On the Reconciliation of Probability Assessments’ which they read to the Royal Statistical Society, and by the brilliant Brazilian Bayesian Gustavo Gilardoni and many others. However, beware the philosophy Formalisations impede the natural man, as Jim Dickey, or maybe it was me, once said.

    The question arises as to whether a subjective distribution for a parameter assuming values in a continuous parameter space can be meaningfully elicited from a subject-matter expert in a consulting environment. It would appear to be quite difficult to complete this task in its entirety in the short time that is typically available, unless the distribution was assumed to belong to a special class of distributions. Without this type of assumption, it might be possible to elicit, say, the expert’s subjective mean and standard deviation, and two or three interval probabilities, in which case you could use the distribution which maximises Shannon’s entropy measure given the specified information. It would appear to be very difficult to completely assess a joint subjective probability distribution for several parameters, though it might be worth assuming that some transformation of the parameters is multivariate normally distributed.

    A deeper philosophical question is whether or not a Bayesian statistician should try to elicit subjective probabilities on a regular basis in a consulting environment. In my opinion, it would at times be more beneficial for the statistician to just analyse the observed statistical data, while interacting with his client regarding the subject-matter background of the data.



In 1972 Mervyn Stone and Phil Dawid reported some ingenious marginalization ‘paradoxes’ in Biometrika which seemed to suggest that improper prior distributions do not always lead to sensible marginal posterior distributions even if the (joint) posterior distribution remains proper. In 1973, they and Jim Zidek presented a fuller version [60] of these ideas to a meeting of the Royal Statistical Society, and their presentation was extremely amusing and enjoyable with each of the three co-authors taking turns to speak.

    The purported paradoxes involve discussions between two Bayesians who use different procedures for calculating a posterior in the presence of an improper prior. I was too scared to participate in the debate that took place after the presentation of the paper, because I felt that the ‘paradoxes’ only occurred in pathological cases and that they were at best implying that ‘You should regard your improper prior as the limit of a sequence of proper priors, while ensuring that your limiting marginal posterior distributions are intuitively sensible when contrasted with the marginal posterior distributions corresponding to the proper priors in the sequence’.

    Phil Dawid asked me afterwards why I hadn’t contributed to the discussion. I grunted, and he said, ‘say no more.’

    The Dawid, Stone, Zidek ‘paradoxes’ have subsequently been heavily criticized by E.T. Jaynes [61], and by the Bayesian quantum physicist and probabilist Timothy Wallstrom [62] of the Los Alamos National Laboratory in New Mexico, who believes that, ‘the paradox does not imply inconsistency for improper priors since the argument used to justify the procedure of one of the Bayesians (in the two-way discussion) is inappropriate’, and backs up this opinion with mathematical arguments.

    Wallstrom has also made seminal contributions to Bayesian quantum statistical inference and density estimation, as discussed later in the current chapter, and to the theory of stochastic integral equations.

 

Timothy Wallstrom

 

    Other very mathematical objections to the marginalization ‘paradoxes’ are raised by the highly respected mathematical statisticians Joseph Chang and David Pollard of Yale University in their 1997 paper ‘Conditioning as disintegration’ in Statistica Neerlandica. However, according to a more recent unpublished UCL technical report, Stone, Dawid and Zidek still believe that their paradoxes are well-founded.

    Despite these potential caveats, Dennis Lindley took the opportunity during the discussion of the 1973 invited paper to public renounce the use of improper priors in Bayesian Statistics. By making this highly influential recantation, Dennis effectively negated (1) Much of the material contained in his very popular pink volume, (2) Several of the procedures introduced in his own landmark papers (3) Large sections of the time-honoured literature on inverse probability including many contributions by Augustus De Morgan and Sir Harold Jeffreys, (4) Many of Jeffreys’ celebrated invariance priors, and (5) A large amount of the work previously published by other highly accomplished Bayesians in important areas of application, including numerous members of the American School.

     Sometime before the meeting, Mervyn Stone said that ‘as editor of JRSSB he would give a hard time to authors who submitted work to him that employed improper priors’. And so he did!

 
 
STOP PRESS! On 22nd January 2014, and after a helpful correspondence with my friend and former student Michael Hamada, who works at the Los Alamos National Laboratory in New Mexico, I received the following e-mail from Dr. Timothy Wallstrom:
 
Dear Professor Leonard,
 
I was flattered to learn from Mike Hamada that I am appearing in your Personal History of Bayesian Statistics, both for the work on QSI with Richard Silver, and for my 2004 arXiv publication on the Marginalization Paradox.

I have since come to realize, however, that my 2004 paper on the Marginalization Paradox, while technically correct, misses the point. Dawid, Stone, and Zidek are absolutely correct. I realized this as I was preparing a poster for the 2006 Valencia conference, which I then hastily rewrote! Dawid and Zidek were at the conference, and I discussed these ideas at length with them there, and later with Stone via email. My new analysis was included in the Proceedings; I've attached a copy. I also presented an analysis at the 2007 MaxEnt conference in Sarasota, NY. That write-up is also in the corresponding proceedings, and is posted to the arXiv, at http://arxiv.org/abs/0708.1350 .

Thank you very much for noting my work, and I apologize for the confusion. I need to revise my 2004 arXiv post; I'm encouraged to do so now that I realize someone is reading it!

Best regards,

Tim Wallstrom

P. S. A picture of me presenting the work at the Maxent conference (which is much better than the picture you have) is at
 
 
    I would like to thank Timothy Wallstrom for advising me that he had already apparently resolved the long-standing Marginalisation Paradox controversies, It is good to hear that my colleagues at UCL were, after all, apparently correct. Quite fortuitously, Dr. Wallstrom's advice does not affect the conclusions and interpretations which I make in this Personal History regarding the viability and immense advantages of inverse probability and improper priors in Bayesian Statistics.
 

    In 1971, Vidyadhar Godambe and Mary Thompson read a paper to the Royal Statistical Society entitled ‘Bayes, fiducial, and frequency aspects of statistical inference in survey sampling’ which was a bit controversial in general terms,

    When seconding the vote of thanks following the authors’ presentation, Mervyn Stone described their paper as ‘like a plum pudding, soggy at the edges and burnt in the middle’.

    Mervyn’s reasons for saying this would again appear to have been quite esoteric; he thought that Godambe’s sampling models were focussing on a model space of measure zero. Vidyadhar’s sampling assumptions seemed quite reasonable to me at the time, though his conclusions seemed a bit strange. Most models are approximations, anyway. I of course totally agree with Mervyn that the fiducial approach doesn’t hold wash.


If you use improper vague priors in situations where there is no relevant prior information, then your prior to posterior analysis will not always be justifiable by ‘probability calculus’. Nevertheless, if you proceed very carefully, and look out for any contradictions in your posterior distributions, then your conclusions will usually be quite meaningful in a pragmatic sense and in the way they reflect the unsmoothed information in the data. Therefore most of the improper prior approaches renounced and denounced by Lindley and Stone are still viable in a practical sense. They are indeed key and absolutely necessary cornerstones of the ongoing Bayesian literature, which can be far superior in non-linear situations, for small to moderate sample sizes, to whatever else the frequentists and fiducialists serve up.

    The somewhat inquisitional Bayesian ‘High Church’ activities in London around that time perhaps even hinted of a slight touch of megalomania.

    In contrast, John Aitchison was using statistical methods with strong Bayesian elements to good practical effect in Glasgow. During the early 1970s he focussed on the area of Clinical Medicine e.g. by devising diagnostic techniques, by an application of Bayes Theorem, which reduced the necessity for exploratory surgery. For example, in his 1977 paper in Applied Statistics with Dik Habbema and Jim Kay the authors considered different diagnoses of Conn syndrome while critically comparing two methods for model discrimination. See also p232 of Aitchison’s 1975 book Statistical Prediction Analysis with Ian Dunsmore. Aitchison’s medical work during the 1970s and beyond is reported in his book Statistical Concepts and Applications in Clinical Medicine with Jim Kay and Ian Lauder.

 
John Aitchison F.R.S.E.
 

    Aitchison was motivated by Wilfred Card of the Department of Medicine at the University of Glasgow. In 1970, Jack Good and Wilfred Card co-authored ‘The Diagnostic Process with Special Reference to Errors’ in Methods of Information in Medicine.

    John Aitchison is a long-time Fellow of the Royal Society of Edinburgh, and a man of science. He was awarded the Guy Medal in Silver by the Royal Statistical Society of London in 1988, and probably deserved to receive the second ever Bayesian Guy Medal in Gold. The first had been awarded to Sir Harold Jeffreys in 1962.

    In 1974, Peter Emerson applied Bayes theorem and decision theory to the prevention of thrombo-embolism in a paper in the Annals of Statistics and co-authored ‘The Application of Decision Theory to the Prevention of Deep Vein Thrombosis following Myocardial Infarction’ in the Quarterly Journal of Medicine, with Derek Teather and Antony Handley.

    Furthermore Robin Knill-Jones, of the Department of Community Medicine at the University of Glasgow, reported his computations via Bayes theorem of diagnostic probabilities for jaundice in the Journal of the Royal College of Physicians London. In his 1975 article in Biometrika, John Anderson derived a logistic discrimination function via Bayes theorem and used it to classify individuals according to whether or not they were psychotic, based upon the outcomes of a psychological test.


In their 1973 JASA paper, Stephen Fienberg and Paul Holland proposed Empirical Bayes alternatives to Jack Good’s 1965 hierarchical Bayes estimators for p multinomial cell probabilities. They devised a ratio-unbiased data-based estimate for Good’s flattening constant α, and proved that the corresponding shrinkage estimators for the cell probabilities possess outstanding mean squared error properties for large enough p. Fienberg and Holland show that, while the sample proportions are admissible with respect to squared error loss, they possess quite inferior risk properties on the interior of parameter space when p is large, a seminal result.


Stephen Fienberg is Maurice Falk University Professor of Statistics and Social Sciences in the Department of Statistics, The Machine Learning Department, Cylab and i-Lab at Carnegie-Mellon University.

 

Steve Fienberg

 

    I first met Steve at a conference about categorical data analysis at the University of Newcastle in 1977. He is one of the calmest, intensely prolific scientists I have ever met.

    Stephen was President of the IMS in 1998-99 of ISBA in 1996-97 and the IMS in 1998-99. A Canadian hockey player by birth, he is an elected Fellow of the National Academy of Science and four further national academies. During the time that he served as Vice-President of York University in Toronto, he was affectionately known as the ‘American President’.

    Paul Holland holds the Frederick M. Lord Chair in Measurement and Statistics at the Educational Testing Service in Princeton, following an eminent career at Berkeley and Harvard.

 

Meanwhile, Thomas Ferguson generalised the conjugate distribution for multinomial cell probabilities in his lead article in the Annals of Statistics by proposing a conjugate Dirichlet process to represent the statistician’s prior beliefs about a sampling distribution which is not constrained to belong to a particular parametric family. His posterior mean value function for the sampling c.d.f., given a random sample of n observations, updates the prior mean value function Λ in the light of the information in the data. It may be expressed in the succinct weighted average form GF+(1-G)Λ, with F denoting the empirical c.d.f and G=n/(n+α),  Λ is the prior mean value function of F, and α is the single ‘prior sample size’ or a priori degree of belief in the prior estimate. The posterior covariance kernel of the sampling c.d.f. can be expressed quite simply in terms of α, Λ, F. 

    While Ferguson’s estimates shrink the sample c.d.f. towards the corresponding prior estimate of the c.d.f, they do not invoke any posterior smoothing of the type advocated by Good and Gaskins in their 1971 paper in Biometrika.

    David Blackwell and James MacQueen proved that Dirichlet processes can be expressed as limits of Polya urn schemes, while Charles Antoniak generalised Ferguson’s methodology by assuming a prior mixing distribution for α Λ. Both methodologies were published in 1974 in the Annals of Statistics.

    It’s not obvious how well the corresponding hierarchical and empirical Bayes procedures work in practice. In my 1996 article in Statistics and Probability Letters about exchangeable sampling distributions, I show that α is unidentifiable from the data when Λ  is fixed, unless there are ties in the data.


    Dale Poirier’s 1973 Ph.D. thesis in Economics at the University of Wisconsin-Madison was entitled Applications of Spline Functions in Economics. Dale has published many Bayesian papers in Econometrics since, and he is currently Professor of Economics at the University of California at Irvine.

    Dale Poirier is listed in Who’s Who in Economics as one of the major economists between 1700 and 1998.

 

In their 2005 survey of Bayesian categorical data analysis, Alan Agresti and David Hitchcock regard my 1972 and 1973 papers in Biometrika, and my 1975 and 1977 papers in JRSSB and Biometrics. as initiating a logistic transformation/ first stage multivariate normal/ hierarchical prior approach to the analysis of counted data, which evolved from the seminal Lindley-Smith 1972 developments regarding the linear statistical model.

 
Alan Agresti
 
    In my 1972 paper, I derived shrinkage estimates for several binomial probabilities, by constructing an exchangeable prior distribution for the log-odds parameters. In my all-time-favourite 1973 paper, I smoothed the probabilities in a grouped histogram by assuming a first order autoregressive process for the (multinomial) multivariate logits at the first stage of my hierarchical prior distribution. This type of formulation (maybe a second order autoregressive prior process would have yielded somewhat more appealing posterior inferences) provides a more flexible prior covariance structure in comparison to the Dirichlet formulation employed by Jack Good, and by Fienberg and Holland.
 
 

    My 1973 paper influenced the literature of semi-parametric density estimation, and preceded much of the literature on multi-level modelling for Poisson counts. My histogram smoothing method effectively addressed the problem of filtering the log-means of a series of independent Poisson counts, since the conditional distribution of the counts, given their sum, is multinomial.

    In my 1975 JRSSB paper, I constructed a three-fold prior exchangeability model for the logit-space row, column, and interaction effects in an rxs contingency table. However, when analysing Karl Pearson’s 14 x14 fathers and sons occupational mobility table, I added an extra set of interaction effects along the diagonal, and referred to a four-fold prior exchangeability model instead, with interesting practical results which appealed to Henry Daniels and Irwin Guttman, if nobody else.  

    My Biometrics 1977 paper is the best of the bunch in technical terms. In this, not particularly well-cited, article, I analysed the Bradley-Terry model for Paired Comparisons, again by reference to an exchangeable prior distribution. I used my shrinkage estimators to analysis Stephen Fienberg’s   ‘passive and active display’ squirrel monkey data, and they indicated a different order of dominance for the monkeys in the sample than suggested by the maximum likelihood estimates. In my practical example, I used a uniform prior for the first stage prior variance of my transformed parameters, rather than selecting a small value for the degrees of freedom of an inverted chi-squared prior.

    The Editor of Biometrics, Foster Cady was very keen to publish the squirrel monkey data, though he seemed much less concerned about the bothersome details of the statistical theory. The United States Chess Federation once showed an interest in using similar procedures for ranking chess players, but it didn’t come to anything.

    Unfortunately my posterior modal estimates in my 1972, 1973 and 1975 papers, when the smoothing parameters are unspecified, all suffer from a Lindley-Smith-style over-shrinkage problem in the situation where the first stage prior parameters are unspecified during the prior assessment.

    This difficulty was resolved in my Biometrics paper, where approximations to the unconditional posterior means of the model parameters were proposed, following suggestions in Ch. 2 of my 1973 University of London Ph.D. thesis ( that also hold for more general hierarchical models). A numerical example is discussed in Appendix A. See Chs. 5 and 6 of [15] for some more precise techniques. Nowadays, the unconditional posterior means and marginal densities of any of the model parameters in my early papers can be computed exactly using modern Bayesian simulation methods.

    In her 1978 JRSSB paper, Nan Laird, a lovely Scottish-American biostatistician at Harvard University reported her analysis of a special case of the contingency table model in my 1975 paper, which just assumed exchangeability of the interaction effects. Nan used the EM algorithm to empirically estimate her single prior parameter by an approximate marginal posterior mode procedure. The EM algorithm may indeed be used to empirically estimate the prior parameters in  any hierarchical model where the first stage of the prior distribution of an appropriate transformation of the model parameters is taken to be multivariate normal with some special covariance structure. These include models which incorporate a non-linear regression function, and also time series models. Many of these generalisations were discussed in my 1973 Ph.D. thesis. If the observations are Poisson distributed then I use my prior structure for the logs of their means.

 
Nan Laird
 

    The techniques I devised for Bayesian categorical data analysis during the 1970s have been applied and extended by numerous authors including Arnold Zellner and Peter Rossi, Rob Kass and Duane Steffey, Jim Albert, William Nazaret, Matthew Knuiman and Terry Speed, John Aitchison, Costas Goutis, Jon Forster and Allan Skene, and by Michael West and Jeff Harrison to time series analysis and forecasting. Jim Hickman and Bob Miller extended my histogram smoothing ideas to the graduation of mortality tables and to bivariate forecasting in actuarial science. Many of the models incorporated into WinBUGS and INLA can be regarded (at least by me!) as special cases or extensions of my formulations for binomial and multinomial logits and Poisson log-means.

    Matthew Knuiman is Winthrop Professor of Public Health at the University of Western Australia where he teaches and researches biostatistics and methodology, sometimes from a Bayesian viewpoint. The diversity and quality of his applications in medicine are tremendously impressive.

    William Nazaret, who extended my methodology, in Biometrika in 1987, to the analysis of three-way tables, with a bit of guidance from myself, has a Ph.D. in Statistical Computing from Berkeley. After a successful subsequent career at AT&T Bell Labs, he became CEO of Intelligense in the U.S., and then C.E.O. for Digicel in El Salvador, and finally C.E.O. for Maxcom Telecommunications S.A.B. de C.V., for whom he now serves as an external advisor. The life of a Venezuelan polymath on a multi-way table must be very interesting indeed, particularly when he’s a fan of shrinkage estimators.


The British General Elections in February and October 1974 were both a very close call. In the first election, the Labour Party won 4 more seats than the Tories, but were 17 short of an overall majority. The Tory Prime Minister, the celebrated puff-chested yachtsman Ted Heath, remained stoically in Downing Street while trying to form a coalition with the Liberals. When the Liberal leader Jeremy Thorpe (later to go for an early bath after allegedly conspiring to kill his lover’s dog) laughed in Heath’s face, Harold Wilson (a future cigar-puffing President of the Royal Statistical Society) flourished his working class pipe and became the first Labour Prime Minister since Clement Attlee. Then In October 1974 Labour achieved an overall majority of 3 seats.

    Wilson was an advocate of short-term forecasting but wasn’t obviously Bayesian, though he had a strong sense of prior belief.

    The BBC were ably assisted during the 1974 Election Nights by Phil Brown and Clive Payne who used a variation of Bayesian ridge regression (see Lindley and Smith [57]) to forecast the final majorities. Brown and Payne reported their, at times amusing, experiences, in their 1975 invited paper to the Royal Statistical Society. They went on to predict the outcomes during several further General Election nights while entwining with British political history.

 
Philip Brown
 

In 1974, Abimbola Sylvester Young, from Sierra Leone, was awarded his Ph.D. at UCL. The title of his splendid Ph.D. thesis was Prediction analysis with some types of regression function, and he was supervised by Dennis Lindley.

    After an eminent career, working with the Governments of Uganda and Malawi and various international food, agriculture, and labour organisations, Abimbola is currently a consultant statistician with the African Development Bank. Jane Galbraith, when a tutor at UCL, reportedly decided that he should be called Sylvester, because she could not pronounce Abimbola.


Masanao Aoki’s monumental 1975 book Optimal Control and System Theory in Dynamic Economic Analysis follows earlier developments by Karl Johan Aström in Introduction to Stochastic Control Theory. Aoki follows a Bayesian decision theoretic approach and develops a backwards-and-forwards updating procedure for determining the optimal way of controlling dynamic systems. In so doing, he extended the Kalman-Bucy approach to Bayesian forecasting, which was pursued by Harrison and Stevens in their 1976 paper in JRSSB, and later by West and Harrison in their 1997 book Bayesian Forecasting and Dynamic Models The books on Markov decision processes, by the eminent American industrial engineer Sheldon Ross are also relevant to this general area. See Markov Decision Processes by Roger Hartley for an excellent review and various extensions.

      Also in 1975, the economists Gary Chamberlain and Edward Leamer published a conceptual important article in JRSSB about the matrix weighted averages which are used to compute posterior mean vectors in the linear statistical model in terms of a least squares vector and a prior mean vector. Noting that the components of the posterior mean vector are not typically expressible as simple weighted average of the corresponding least squares estimate and prior mean, the authors used geometric arguments and multi-dimensional ellipsoids to compute bounds on the posterior means that illustrate this phenomenon further. The posterior smoothing implied by a non-trivial prior covariance matrix can be even more complex than indicated by these bounds and sometimes initially quite counter-intuitive.


    Still in 1975, Bruce Hill of the University of Michigan proposed a simple general approach in the Annals of Statistics to inference about the tail of a sampling distribution. His results were particularly simple for population distributions with thick right tail behaviour of the Zipf type. The inferences were based on the conditional likelihood of the parameters describing the tail behaviour, given the observed values of the extreme order statistics, and could be implemented from either a Bayesian or a frequentist viewpoint. This followed a series of fascinating papers by Hill concerning Zipf’s law and its important applications.

    Norman Draper and Irwin Guttman published their Bayesian approach to two simultaneous measurement procedures in JASA in 1975. Known as ‘Mr. Regression’ to his students, Norman Draper has also published several important Bayesian papers. He was one of George Box’s long-term confidants and co-authors at the University of Wisconsin-Madison, where he is now a much-admired Emeritus Professor of Statistics.
 

In 1976, Peter Freeman read his paper ‘A Bayesian Analysis of the Megalithic Yard’ to a full meeting of the Royal Statistical Society. His analysis of Alexander Thom’s prehistoric stone circle data was well-received. As Thom was both a Scot and an engineer, he always meticulously recorded his archaeological measurements. However, it has always been debated whether or not the megalithic yard is a self-imagined artefact. Maybe the Stone Age men just paced out their measurements with their feet.

    Even though he was an accomplished Bayesian researcher and lecturer, Peter Freeman was given an unfairly hard time by his senior colleagues at UCL during the late 1970s. Maybe he was too practical. In 1977, the University administrators even refused to let him open the skylight in his office, a key event in the history of Bayesian Statistics which preceded Dennis Lindley’s early retirement shortly afterwards. When Peter fled to Leicester to take up his duties as Chair of Statistics in that forlorn neck of the woods, he discovered that Jack Good, a previous candidate for the Chair, had made far too many eccentric demands, including the way his personal secretary should be shortlisted and interviewed by the administrators ‘specially for him’. Peter is currently a highly successful statistical consultant who works out of his cottage on the Lizard peninsula on the southern tip of Cornwall with his second wife.

    According to George Box, Sir Ronald Fisher (who’d dabbled with Bayesianism in his youth) always hated the obstructive administrators at UCL, together with the ‘beefeaters’ patrolling the Malet Street quadrangle. When one of the beefeaters manhandled a female colleague while she was trying the clamber through a window, Sir Ronald beat the rude fellow with his walking stick for ‘not behaving like a gentleman.’

 
 

Also in 1976, Richard Duda, Peter Hart and Nils Nilson, of the Artificial Intelligence Center at the Stanford Research Institute, investigated existing inference systems in Artificial Intelligence when the evidence is uncertain or incomplete. In ‘Subjective Bayesian Methods for Rule-Based Inference Systems’, published in a technical report funded by the U.S. Department of Defence, they proposed a Bayesian inference method that realizes some of the advantages of both formal and informal approaches.

    In 1977, Phil Dawid and Jim Dickey published a pioneering paper in JASA concerning the key issue of selective reporting. If you only report your best practical results then this will inflate their apparent practical (e.g. clinical) significance. The authors accommodated this problem by proposing some novel likelihood and Bayesian inferential procedures.

    Jim Dickey is currently an Emeritus Professor of Statistics at the University of Minnesota. He has made many fine contributions to Bayesian Statistics, and his areas of interest have included sharp null hypothesis testing, scientific reporting, and special mathematical functions for Bayesian inference. He has worked with other leading Bayesians at Yale University, SUNY at Buffalo, University College London, and the University College of Wales in Aberystwyth, and he is a fountain of knowledge. I remember him giving a seminar about scientific reporting at the University of Iowa in 1971. Bob Hogg, while at his humorous best, got quite perplexed when Jim reported his posterior inferences for an infinitely wide range of choices of his prior parameter values.

 
James Mills Dickey
 
    Jim Smith’s 1977 University of Warwick Ph.D. thesis was entitled Problems in Bayesian Statistics relating to discontinuous phenomena, catastrophe theory and forecasting. In this highly creative dissertation, Jim drew together Bayesian decision theory, the quite revolutionary ideas about discontinuous change expressed in the deep mathematics of Catastrophe theory by the French Fields medal winner Renée Thom and at Warwick by Sir Christopher Zeeman, and the Bayesian justifications and extensions of the Kalman Filter which were reported by Harrison and Stevens in their 1976 paper in JRSSB.
 
Jim Smith
 

    Discontinuous change can be brought about by the effects of one or more gradually changing variables e.g. during the evolutionary process or during conversations which lead to a spat. Unfortunately, Thom’s remarkable branch of bifurcation theory, which invokes cusp, swallowtail and butterfly catastrophes on mathematical manifolds, depends upon a Taylor Series-type approximation which is only locally accurate.

    Jim’s more precise results provided forecasters with considerable insights into the behaviour of time series which are subject to discontinuous change, sometimes because of the intervention of outside events. Discontinuities in the behaviour of time series can also be analysed using Intervention Analysis. See the 1975 book on this topic by Box and Tiao.

    While still a postgraduate student, Jim proved Smith’s Lemma ( [15], p158) which tells us that the Bayes decision under a certain class of symmetric bounded loss functions must be fall in a region determined by the decisions which are Bayes against some symmetric step loss function. This delightful lemma should be incorporated into any graduate course on Bayesian decision theory.

    Jim has been a very ‘coherent’ Bayesian throughout his subsequent highly accomplished career. He was a lecturer at UCL during the late 1970s, before returning for a lifetime of imaginative rationality and sabre-rattling at the University of Warwick.
 

In their 1976, 1977 and 1978 papers in JASA and the Annals of Statistics, Vyaghreshwarudu Susarla and John Van Ryzin reported their influential empirical Bayes approaches to multiple decision problems and to the estimation of survivor functions. Their brand of empirical Bayes was proposed by Herbert Robbins in 1956 in the Proceedings of the Third Berkeley Symposium, and should nowadays be regarded as part of the applied Bayesian paradigm since it is very similar in applied terms to semi-parametric hierarchical Bayes procedures which employ marginal posterior modes to estimate the hyperparameters. When he visited us to give a seminar at the University of Warwick during the early 1970s, Robbins, who always had a chip on his shoulder, scrawled graffiti on the walls of Kenilworth Castle.

    In his 1978 paper in the Annals of Statistics, Malcolm Hudson of Macquarie University showed how extensions of Stein’s method of integration by parts can be used to compute frequency risk properties for Bayes-Stein estimators of the parameters in multi-parameter exponential families. His methodology lead to a plethora of theoretical suggestions in the literature e.g. for the simultaneous estimation of several Poisson means, many of which defy practical common sense.

    In 1978, Irwin Guttman, Rudolf Detter and Peter Freeman proposed a Bayesian procedure in JASA for detecting outliers. In the context of the linear statistical model, they assumed a priori that any given set of k observations has a uniform probability of being spurious, and this lead to very convincing posterior inferences. Their approach provided an alternative to the earlier Box-Tiao methodology in Box and Tiao’s 1973 book and contrasted with the conurbation of ad hoc procedures proposed by Vic Barnett and Toby Lewis in their 1978 magnum opus. It should of course be emphasised that outliers can provide extremely meaningful scientific information if you stop to wonder how they got there. They should certainly not be automatically discarded, since this may unfairly inflate the practical significance of your conclusions.

    Michael West [63] proposed a later Bayesian solution to the outlier problem. One wonders, however, whether outliers should only be considered during an informed preliminary exploratory data analysis and a later analysis of residuals.

    In 1979, Tony O’Hagan introduced the terms outlier prone and outlier resistant, in a dry though profound and far-reaching paper in JRSSB, when considering the tail behaviour of choices of sampling distribution which might lead to posterior inferences that are robust in the presence of outliers. He for example showed that the exponential power distributions proposed by Box and Tiao when pioneering Bayesian robustness in their 1973 book are in fact outlier prone, and do not therefore lead to sufficiently robust inferences.

    A number of later approaches to Bayesian robustness were reviewed and reported in 1996 in Robustness Analysis. This set of lecture notes was co-edited by Jim Berger, Fabrizio Ruggeri, and five further robust statisticians.

 
Fabrizio Ruggeri (ISBA President 2012)
 

EXPLANATORY NOTES: Laplace double exponential distributions and finite mixtures of normal distributions are OUTLIER PRONE even though they possess thicker tails than the normal distribution. They are therefore not appropriately robust to the behaviour of outliers when employed as sampling distributions. The generalised t-distribution is however OUTLIER RESISTANT. Nevertheless, I would often prefer to assume that the sampling distribution is normal, after using a preliminary informative data analyses to either substantiate or discard each of the outliers, or maybe to use a similarly outlier-prone skewed normal distribution. A (symmetric) generalised t-distribution is awkward to use if there is an outlier in one of its tails, and a skewed generalised t-distribution depends upon four parameters which are sometimes difficult to efficiently identify from the data.

    These issues are also relevant to the choices of prior distributions e.g, the choices of conditional prior distributions under composite alternative hypotheses when formulating Bayes factors. They influence the viability of Bayesian procedures which have been applied to legal cases and genomics.


Jim Berger was an excellent after dinner speaker at Bayesian conferences (he once teased Arnold Zellner to the point where Arnold’s lovely wife Agnes wondered where her husband had been the night before) and was only rivalled for his wit and humour by Adrian Smith.


In 1978, I published, as an extension of my 1973 histogram smoothing method,  possibly the first ever non-linear prior informative method [64] for smoothing the common density f=f (t) of a sample of n independent observations for values of x falling in a bounded interval (a,b). I did this by reference to a logistic density transform g=g(t) of the density. From a Bayesian perspective, the prior information about g might be representable by a Gaussian process for g on (a,b) with mean value function μ =μ(t) and covariance kernel K=K(s,t), which are taken to be specified for all values of s and t in (a,b) . For example, a second-order autoregressive process for g can be constructed by taking its derivative g to possess the covariance structure of an Ornstein-Uhlenbeck stochastic process. Under such assumptions, it is technically difficult to complete a Bayesian analysis because the Radon-Nikodym of the prior process is heavily dependent on the choice of dominating measure. For example, the function maximising the posterior Radon-Nikodym derivative of g will also depend upon the choice of dominating measure.

    I therefore instead adopted a prior likelihood approach by instead taking μ=μ(t) to represent the observed sample path for t in (a,b) of a Gaussian process with mean value function g=g(t) and covariance kernel K. Under this assumption, the posterior likelihood of g given the ‘prior sample path’ μ, may be maximised numerically, via the Calculus of Variations, yielding a posterior maximum likelihood estimate for g which is free from a choice of measure. In my paper, I completed this task, assuming the special covariance structure indicated above.

    The prior log-likelihood functional of g can be expressed in terms of a roughness penalty involving integrals of the squares of the first and second derivatives of g-μ. Maximizing the posterior log-likelihood functional of g using the Calculus of Variations, yields a fourth order non-linear differential equation, which I converted into a non-linear Fredholm integral equation for g. I used an iterative technique to solve this equation, thus providing twice differentiable estimates which depended upon the choices of μ and two prior smoothing parameters.

    With a small modification, the same methodology can be used to smooth the log of the intensity function of a non-homogeneous Poisson process. My approach bore similarities to the procedures that had been employed by Richard Bellman and others to solve boundary value problems in mathematical physics.

    When proposing the vote of thanks during the December 1977 meeting of the Royal Statistical Society, Peter Whittle showed that my estimates were also ‘limiting Bayes’, in the sense that they could also be derived by limiting arguments from non-linear equations for posterior mode vectors which solved a discretized version of the problem, and he generalised my non-linear Fredholm equation to depend upon an unrestricted choice of the prior mean value function μ and covariance kernel K.

 
Peter Whittle F.R.S.
 

    I had previously discarded this general approach because the posterior mode functional is not uniquely defined in the limit. Nevertheless, my special-case equations yielded the first ever effectively Bayesian methodology for the non-linear smoothing of a sampling density. This was over twenty years after Peter Whittle published his linear Bayes approach to this problem.

    When seconding the vote of thanks, Bernie Silverman contrasted my approach with his density estimation/roughness penalty methodology for investigating cot death which he published that year in Applied Statistics.

 
 
Sir Bernard Silverman F.R.S.
 

    In 1977, Clevenson and Zidek had published a linear Bayes procedure in JASA for smoothing the intensity function of a non-homogeneous Poisson process, which didn’t look that different to Whittle’s 1958 density estimation method. If the arrival times in a Poisson process during a bounded time interval (a,b) are conditioned on the number of arrivals n, then they assume the probability structure of the order statistics from a distribution whose density appropriately normalises the intensity function of the Poisson process. The two smoothing problems are therefore effectively equivalent.

    The hierarchical distributional assumptions in my R.S.S paper introduced log-Gaussian doubly stochastic Poisson processes into the literature. David Cox showed a keen interest at the time e.g. during one of his University of London seminars, and he saved my bacon during my application for tenure at the University of Wisconsin in 1979 when there was an unfair attempt to 'destroy me' from out of the British Bayesian establishment after some shenanigans by an untenured UW assistant professor.

    Peter Lenk of the University of Michigan in Ann Arbor has published various approximate fully Bayesian developments, including the material in his 1985 Savage prize-winning Ph.D. thesis, of my density estimation methodology. Daniel Thorburn reported his approach to the problem, but with similar prior assumptions to my own, in his 1984 monograph A Bayesian Approach to Density Estimation.

 

Peter Lenk

 

    In 1993 Richard Silver, Harry Martz and Timothy Wallstrom of the Los Alamos National Laboratory in New Mexico described another Bayesian approach [32] which refers to the theories of Quantum Physics. See also Timothy Wallstrom’s 1993 book Quantum Statistical Inference. By approximating the logistic density transform by a linear combination of p specified functions, Silver et al constrain the sampling distribution to belong to the p-parameter exponential family, by approximating the logistic density transform by a linear combination of p specified functions.

    Chong Gu has extended various special cases of my roughness penalty formulation by reference to the theory of non-linear smoothing splines and has even shown that the computations are feasible in the multivariate case, see [65]. Dennis Cox and Finbarr O’Sullivan [66] address the asymptotic properties of these procedures.

    In hindsight, it would have been preferable, in terms of both mathematic rigor and posterior smoothing, to take the prior covariance kernel K to adopt the infinitely differentiable exponential quadratic form that was assumed by Tony O’Hagan for an unknown regression function in his 1978 JRSSB paper.

    I did this a decade or two later with John Hsu in our 1997 paper in Biometrika, which concerned the semi-parametric smoothing of multi-dimensional logistic regression functions and an analysis of the ‘mouse exposure’ data. That was the paper which confirmed John’s tenure at UCSB. He developed the hierarchical Bayes bit using some neat algebraic tricks.

    In 2012, Jackko Riihimäke and Aki Vehtari managed to do the same thing for bivariate density estimation in their paper ‘Laplace Approximations for Logistic Process Density Estimation and Regression’ in ArXis e-prints. The authors employed very accurate conditional Laplacian approximations on a grid, which they validated using MCMC. Their generalisation of my 1978 method is the current state of the art in Bayesian semi-parametric univariate and bivariate density estimation. Congratulations to the authors, who are also in the machine learning business.


Also in 1978, the Dutch economists Teunis Kloek and Herman Van Dijk used Importance Sampling simulations in a seminal paper in Econometrica to calculate the Bayes estimates for the parameters in various equation systems. Importance sampling is a very efficient variation of Monte Carlo which is guaranteed to converge to the correct solution with a bounded standard error of simulation, if the importance function is chosen appropriately. It was used during the 1980s to calculate the Bayesian solutions to many multi-parameter problems which were algebraically intractable. See the eminent Israeli economist Reuven Rubinstein’s 1981 book Simulations and the Monte Carlo Method.  For further applications in Economics see, for example, the 1988 paper by John Geweke in the Journal of Econometrics.

 
Herman Van Dijk
 

    During the 1990s, Importance Sampling was largely superseded by Markov Chain Monte Carlo (MCMC), despite the notorious convergence difficulties experienced by MCMC’s users and sufferers. When applying Importance Sampling, simulations from a generalised multivariate t-distribution with appropriately thick tails will produce excellent convergence when addressing a wide range of non-linear models.


In 1979, Jose Bernardo of the University of Valencia published two pioneering papers about vague ‘reference priors’, which maximise Lindley’s measure of the amount of information in the statistical experiment. In his JRSSB invited paper, Jose demonstrated the importance of the corresponding reference posteriors in Bayesian inference. In his Annals of Statistics paper, he expressed Lindley’s measure in terms of Kullback-Liebler divergence, and thereby justified the measure in terms of a type of expected utility.

    Jose’s 1979 papers spawned a large literature on reference priors, including several joint articles with Jim Berger. These procedures can, of course, only be entertained as objective if the choice of sampling model is itself objectively justifiable.

    While the derivations of reference priors can amount to separate, quite tedious, research problems when analysing some sampling models, these vague priors do lead to quite convincing alternatives to Jeffreys invariance priors. They usually lead to similarly sensible posterior inferences.

    Also in 1979, Chan-Fu Chen of SUNY at Albany published his hierarchical Bayesian procedure for the estimation of a multivariate normal covariance matrix in JRSSB. He employed a continuous mixture of inverted Wishart distributions in the prior assessment, and his hierarchical Bayes estimators shrunk the sample covariance matrix towards a matrix assuming intraclass form. Chan-Fu applied his method to stochastic multiple regression analysis, to excellent effect.


An event of immense significance in the history of Bayesian Statistics took place as the decade drew to a close, organized by Jose Bernardo (the local co-ordinator), Morris De Groot, Dennis Lindley and Adrian Smith under the auspices of the University of Valencia and the Spanish Government. The event took place during May to June 1979 and was subsequently named the First Valencia International Meeting of Bayesian Statistics. Most of the leading Bayesians in the world met in the Hotel Las Fuentes in Alcossebre, a beautiful resort with cold water springs on the beach, on the Spanish Mediterranean Coast between Valencia and Barcelona. There ensued a learned discourse between some of the best minds in the discipline, intermingled with swimming, partying, and romps on the beach. All in all, a jamboree to be reckoned with!

 
The beach at Alcossebre
 

    The Proceedings of Valencia 1 were published by the four organizers, and the invited speakers included Hirotugu Akaike, Art Dempster, George Barnard, I. Richard Savage (the brother of the late Jimmie Savage), George Box, James Press, Seymour Geisser, Jack Good, and Arnold Zellner.

I.Richard Savage was the brother of the late Jimmie Savage, and an impressive, accomplished, and all-inclusive man. His research interests at Yale included public policy and Bayesian statistics.

 
 

I. Richard Savage   Leonard 'Jimmie' Savage
 

Twenty-five papers, prepared by a total of thirty-five authors and co-authors were presented at Valencia1 and discussed in the following fourteen sessions:

1. Foundations of Subjective Probability and Decision Making

2. Sequential learning, discontinuities and change

3. Likelihood, sufficiency and ancillarity

4. Approximations

5. Regression and time series

6. Bayesian and non-Bayesian conditional inference

7. Personal and inter-personal ethics

8. Sensitivity to models

9. Improving judgements using feedback

10. Predictive sample reuse

11. Beliefs about beliefs

12. Bayesian non-parametric theory

13. Coherence of models and utilities

14. Hypothesis Testing


    The highlights included (1) Bruce Hill’s presentation of his paper ‘On finite additivity, non-conglomerability, and statistical paradoxes’ when he invited anybody who didn’t agree with his axiomatics to explain how it felt to be a ‘sure loser’ (2) Stephen Fienberg’s trivialisation and demolition of Jeff Harrison’s convoluted exposition about ‘Discontinuity, decision and conflict’ (3) Morris De Groot’s paper about improving predictive distributions, and (4) Dennis Lindley’s paper on Laplacian approximations to posterior densities, which preceded the much more numerically accurate conditional Laplacian approximations which were to be developed by other Bayesians during the 1980s and synthesised into computer packages in the context of hierarchical modelling during the early part of the twenty-first century as a valid alternative to MCMC.

 

George E.P.Box

 

    George Box and Herbert Solomon provided us with the most memorable moment of the entire conference during the farewell dinner when they sang ‘Our Theorem is Bayes Theorem’ to the tune of ‘My Business is Show Business’. Here is one of their refrains:

 

There’s no theorem like Bayes theorem

Like no theorem we know

Just recall what Pearson said to Neyman

Emerging from a region of type B

“It’s difficult explaining it to Lehmann

I fear it lacks Bayes’ simplicity

There’s no haters like Bayes haters

They spit when they see a prior

Be careful when you offer your posterior

They’ll try to kick it through the door

But turn the other cheek if it is not too sore

Of error they may tire

 
 

I became so carried away by the festivities that I downed several shots of Cointreau, only to puke up my seafood an hour or so later, all over the beach.

    The Valencia International Meetings have beneficially influenced Bayesian Statistics ever since. The ninth in Jose Bernardo’s truly magnificent series was held in Alicante, Spain during June 2010.

 

Bayesians at Play (Archived from Brad Carlin’s Collection)

 
 

Author’s Notes:  My invited paper [16] at the first Valencia conference was entitled ‘The roles of inductive modelling and coherence in Bayesian Statistics’. I in particular emphasised that while axiomatized Bayesian inference is arguably appropriate when the sampling model is specified, it is not necessary to be ‘coherent’ when the model is unknown, and that it is very important to refer to your inductive thought processes when considering possible models in relation to the data and the ‘scientific background of the data’.

    I was also perhaps the first ever Bayesian to publicly challenge the relevance of De Finetti-style axiom systems to Bayesian model-based inference, since I regarded these, rightly or wrongly, as at best tautologous with what they were implying i.e. that any ‘coherent’ inferential procedure should be equivalent to a Bayesian procedure which assumes a proper prior distribution for the unknown parameters in the model. I however confined attention in 1979 to discussing the strong assumptions inherent in the axiom system described on pages 71-76 of De Groot’s 1970 book which calibrated the assessor’s preferences with the potential outcome of an objective auxiliary experiment.

    My paper was well received by the majority of the audience, most of whom were also experienced applied statisticians, and I received favourable comments from Jay Kadane, Bill DuMouchel, and Jim Dickey, and much later from Mark Steel.  However, Dennis Lindley and Adrian Smith seemed unduly upset and reacted a touch petulantly. I felt lucky not to be kicked in the posterior!

    I however felt compelled by British Bayesian politics to leave for Wisconsin in August 1979, and I did not return to Europe in academic terms until some sixteen years later. In January 1981, my health was severely damaged by further, quite unanticipated, academic intrigue from o’er the Atlantic, and my career and publication record took a nosedive. Thank goodness I had tenure!

    The relevance of the De Finetti-style axiom systems will be considered further in the next chapter when I discuss the implications of the material reviewed in an article by Peter Fishburn.


My visits to Iowa have always been interesting, and quite dramatic. I held summer research appointments in Iowa City, at the American College Testing Program, in 1971 and 1972, and at the Lindquist Center of Measurement in 1984. In early 1982, I paid a seminar visit to the Department of Statistics of the University of Iowa, at Bob Hogg’s invitation. In 1984, I attended the 50 th. anniversary celebration conference of the creation of the Statistical Laboratory in Ames, Iowa by George Snedecor. I travelled there from Madison with Michael Hamada, Kam-Wah Tsui and
Winson Taam and met my wonderful undergraduate mentor David Cox for the last time. Some of us also endeavoured to discover the only street in Ames which sported a bar, an Australian one as I remember. Then, in 1992 I was the expert witness in the since celebrated ‘Rosie and the ten construction workers’ disputed paternity case in Decorah, Iowa.

    I have also visited the eerie Mississippi towns of Marquette and McGregor, Iowa, on occasion, with my relatives. McGregor is set way back in time, as if out of a Hitchcock or Stephen King novel. In Marquette, we stayed in the hotel which hangs from a cliff way above the beautiful Mississippi and the historic pink elephant below. When, in 1964, Iowa passed the ‘liquor by the drink’ law, a group of entrepreneurs purchased a life-size grey elephant from the Grand Old Party and painted it pink, before opening the Pink Elephant Supper Club. In contrast, Iowa City always seemed to be set in a weird sort of ultra-American quasi-future, and Ames was in some wasteland.

 
 

In September 2013, I recovered my 1973 Ph.D. thesis Bayesian methods for the Simultaneous Estimation of Several Parameters from a box at the University of Edinburgh where it had been stored since my retirement in 2001. To my surprise, I discovered a column of numbers in Table 2.4.2. of my thesis which should have been used to modify the posterior modal estimates reported in my 1972 Biometrika paper, but which I delayed reporting until several years later. See Appendix A for further discussions of the contents of my Ph.D. Thesis and of the applications later developed by several investigators in Animal Sciences, including Daniel Gianola, Jean-Louis Foulley and Rob Tempelman.

    Daniel Gianola has received six honorary doctorates for his pioneering Bayesian contributions to animal breeding. The Bayesian approach he and Rohan Fernando presented in 1984 to the 76th Annual Meeting of the American Society of Animal Science is regarded as seminal in animal breeding.

 
Daniel Gianola
 

    Dennis Lindley later praised and applauded the Gianola-Fernando discussion paper in a personal letter to Daniel Gianola. In his usual inimitable style, Dennis wrote,

Some of us Bayesians, despairing of our fellow statisticians, feel that the Bayesian 21st century will become about through the work of scientists, like yourselves, who see the sense in the Bayesian position. With such an admirable exposition as you describe, the advance may well occur in that way.

 
 
 
 
  © Thomas Hoskyns Leonard, 2014