Combining Human Judgment and Multiple Regression

Richard B. Darlington

Prediction is one of the fundamental purposes of science, and indeed the need to predict or estimate unknown values arises throughout life. Medical diagnosis is really a prediction problem, since the future usually reveals which diagnosis was correct. Predicting what consumers will buy is fundamental to virtually any business. Predicting the number of people who will retire or apply for unemployment benefits or welfare is necessary for rational public policy. Prediction arises in many other contexts as well.

An enormous amount of prediction is made by human judgment. Members of a parole board read an inmate's case history and attempt to predict whether he will commit more crimes if released. Physicians study a list of symptoms and circumstances to make a diagnosis. However, extensive research over the last 40 years indicates that when you have a database of at least 50 previous cases for which you have the data used in making the predictions, and you know the ultimate outcome for those cases, and those cases are representative of the current sample of cases, then you can enter those previous cases into a regression or a variant of regression, and generally use the results of that analysis to make more accurate predictions of future cases than can be made by human judgment alone. This research has been summarized most importantly by Meehl (1954), Sawyer (1966), Wiggins (1973, Chapter 5), and a four-author symposium (Theodore R. Sarbin, Paul E. Meehl, Robert R. Holt, and Hillel J. Einhorn) published on pages 362-395 of the Journal of Personality Assessment (vol. 50 # 3) in 1986. Most of these authors are psychologists, but they summarize some 90 studies covering an enormous range of topics, from meteoreology to medicine to economics. Since statistically-based predictions can be made by computers, some of this work has gone under the name of artificial intelligence in the last decade.

This research does not suggest human judgment is generally unnecessary; rather it indicates that the most accurate predictions generally result from a predictive system in which human judgment and statistical analysis are mixed according to prescribed rules. This document describes some of those rules.

One of the central conclusions from this research is that unaided human judgment is surprisingly unreliable. Two parole boards given the same prisoner's folder will often reach opposite conclusions about releasing the prisoner. Or a panel of physicians given the same folder two months apart may come up with different diagnoses. This unreliability is one of the major factors lowering the accuracy of prediction by human judgment. If you ask the same subject-matter experts (e.g. parole-board members or physicians) to write down a set of mechanical rules a clerk could follow (e.g., a point system in which earning a college degree in prison counts for 20 points and getting in a fight in prison counts for -5 points, etc.), those mechanical rules often predict new cases more accurately than case-by-case judgments made by the very experts who created the rules. Or if a statistician takes a set of past predictions made by the experts, even with no information about the ultimate accuracy of those predictions, and uses regression or a variant to predict not ultimate outcome but merely the experts' predictions, the formula derived from that regression often predicts the ultimate outcomes for new cases more accurately than the experts themselves can do for those same cases. The reason for this superiority of mechanical prediction methods is that whatever its disadvantages, a mechanically-applied prediction formula does at least make the same prediction twice when given the same input information.

The 90-odd studies summarized in this literature overwhelmingly favored mechanical prediction over unaided human judgment even though few if any used the best and most recent methods of mechanical prediction. This document describes one of several new method for combining human judgment with regression to yield predictions which are generally more accurate than can be achieved from either method alone.

The existence of these new methods leads me to a conclusion somewhat different from that of Robyn Dawes (see Dawes 1988 pp. 205-212 or especially Dawes 1979). After reviewing the literature cited above, Dawes advocated simply using human experts to create a prediction formula comparable to the mechanical parole formula mentioned above. He argued that this method generally does at least as well as regression-based methods, without requiring the sample data that regression requires. Dawes called his prediction formulas improper linear models. However, Dawes did not consider the use of new methods for combining human judgment with regression to yield predictions more accurate than can be obtained by either method alone. Such methods are described in detail by Darlington (1978). This document describes what is perhaps the simplest of these methods, which can be performed with an ordinary regression program.

The method to be described can best be understood by considering first another method that I do not recommend. I ask you to take the time to study the five steps of this unrecommended method because the final recommended method merely adds one more step to these five. The unrecommended method proceeds as follows:

1. Just as Dawes recommends, use subjective judgment to weight the predictor variables to predict the criterion variable Y. Let G, for "guess", denote the variable thus formed, and let bG denote the set of weights used to create G.

2. Use simple regression to predict Y from G. Let kG denote the simple regression slope found in that regression.

3. Multiply the entries in bG by kG. This does not change the relative sizes of the bG values, but does adjust them so that their overall sizes conform to simple regression. That is, after this multiplication there will be no tendency for either the highest or lowest values in G to either overestimate or underestimate Y. Let kG*bG denote this set of adjusted guessed weights.

4. Find the residuals e in the simple regression of step 2.

5. Use multiple regression to predict e from the set of original predictors. Let be denote the set of weights thus found.

6x. Add the weights found in steps 2 and 4, producing final weights kG*bG+be .

Step 6x is included here for expository purposes only; in practice I recommend replacing it by steps 6 and 7 below. The problem with step 6x is that its final weights kG*bG+be exactly equal the weights produced by ordinary multiple regression. Thus the extra work of creating subjective weights is entirely wasted. This problem can be avoided by multiplying the be weights by a constant ke before adding them to the weights found in step 3. Thus the final weights are kG*bG+kebe . The constant ke is always less than 1, so it gives less weight to the multiple regression in step 4, thereby leaving the final weights closer to the adjusted guessed weights found in step 3. In fact if the guessed weights were fairly accurate then ke may well turn out to be 0, thus leaving the final weights exactly equal to the weights of step 3.

A formula for ke was suggested by Stein (1960). Letting N denote the sample size, P denote the number of predictor variables, and Re denote the multiple correlation found in step 4, Stein's formula is

ke = 1 - [(P-2)/(N+2-P)]*[(1-Re2)/Re2]

This formula may yield ke < 0; if so, then Stein suggest setting ke = 0. Therefore we can replace step 6x above by the following two steps:

6. Compute ke = 1 - [(P-2)/(N+2-P)]*[(1-Re2)/Re2]; if ke < 0 then set ke = 0.

7. Compute the final weights b = kG*bG+kebe

To understand this method, suppose first that the guessed weights are very accurate. Then the residuals found in step 4 will not be predictable from the predictors, so Re will be low. But the lower Re the lower ke, so the closer the final weights will be to the adjusted guessed weights of step 3. At the other extreme, if the guessed weights are highly inaccurate, then the residuals of step 4 will be predictable, Re will be low, ke will be near 1, and the result of steps 6 and 7 will approximate the result of step 6x. And as mentioned earlier, use of step 6x is equivalent to ordinary multiple regression. Thus the output of step 7 is in effect a weighted average of the adjusted guessed weights and ordinary regression weights, where the relative importance of the two sets of weights is determined by the accuracy of the guessed weights as measured by Re.

The formula for ke is also structured reasonably with respect to N and P. Small N and/or large P lowers ke, thus raising the importance of the guessed weights relative to the computed regression weights. That's reasonable because small N and large P raise the standard errors of regression weights, so they should be given lower importance under those conditions.

References

Darlington, Richard B. (1978). Reduced-variance regression. Psychological Bulletin, 85, 1238-1255.

Dawes, Robyn M. (1979). The robust beauty of improper linear models. American Psychologist, 34, 571-582.

Dawes, Robyn M. (1988). Rational Choice in an Uncertain World. San Diego: Harcourt Brace Jovanovich.

Meehl, Paul E. (1954). Clinical Versus Statistical Prediction: a theoretical analysis and review of the literature. Minneapolis, University of Minnesota Press.

Sawyer, Jack (1965). Measurement and prediction, clinical and statistical. Psychological Bulletin, 66, 178-200.

Stein, Charles. (1960). Multiple regression. In Ingram Olkin et. al. (Eds.), Contributions to Probability and Statistics. Stanford, Calif.: Stanford Univ. Press.

Wiggins, Jerry S. (1973). Personality and Prediction: Principles of Personality Assessment. Reading, Mass: Addison-Wesley.