How Many Covariates to use in Randomized Experiments?

Richard B. Darlington
Copyright © Richard B. Darlington. All rights reserved.

Sometimes a great deal of background information is available about participants in experiments, so that a great many variables might possibly be used as covariates while testing the effect of some crucial independent variable. How many covariates should actually be used? This note suggests an answer for the case in which participants are randomly assigned to experimental treatments. First we briefly consider the case without random assignment.

A Brief Digression on Nonrandomized Experiments

One strategy that is sometimes suggested is that nonsignificant covariates can be dropped from an analysis. A systematic way of doing this is to first add all possible covariates (or at least a very large number), then drop the least significant one. Then recompute the significance of the remaining regressors and again drop the least significant one. Repeat this process until all remaining covariates are significant.

Strategies like this have a very serious flaw in nonrandomized experiments. The very fact that a covariate correlates highly with the independent variable will tend to make both variables nonsignicant due to collinearity. But under the proposed strategy, the covariate will then be dropped. Thus the strategy drops precisely the covariates that are most important to keep--those that correlate highly with the independent variable. When an independent variable and covariate correlate highly and both are nonsignificant, the correct conclusion is that the collinearity has made it impossible to disentangle their effects, but the flawed strategy will resolve the uncertainty in favor of the independent variable. Thus if the substantive nature of a variable suggests that it should be used as a covariate, then it should not be dropped because of nonsignificance.

A Proposed Procedure for Randomized Experiments

The situation is very different under random assignment. Suppose a regression has a randomly-assigned independent variable X and many possible covariates. So long as the regression has some residual degrees of freedom, any number of covariates will lead to a valid test of the null hypothesis that the treatment has no effect. But using too many covariates may lower the precision of the estimate of the treatment's effect, and using too few covariates may have the same effect. Is there some way to choose a subset of covariates that will lead to the most precise possible estimate of the treatment's effect? This note suggests such a procedure.

Specifically, the suggestion is to predict the dependent variable Y from the covariates alone, first omitting X from the regression. Use the backward deletion procedure mentioned above. That is, delete the least significant covariate from the regression, recompute the regression, again delete the least significant covariate, and keep repeating the process until all the t values for individual regressors are 1.42 or above, in absolute value. This corresponds closely to a familiar rule that the two-tailed signicance levels should all be .15 or below. However, advocates of that rule sometimes suggest that a different rule should be used if high collinearity is observed, and I suggest using a cutting point of 1.42 regardless of collinearity.

Derivation of the Procedure

The squared standard error of the regression slope b(X) equals MSE/[N*Var(X)*Tol(X)], where the four entries in the expression are respectively the mean squared error, the sample size, the variance of X, and the tolerance of X. Tol(X) is defined as 1 - RX2, where RX is the multiple correlation predicting X from the covariates. Changing the number of covariates in a regression does not change N or Var(X), so for present purposes we can think of the standard error of b(X) as being determined by the ratio MSE/Tol(X). The problem then is to minimize that ratio.

By Monte Carlo experiments, I have found that it is invalid to have the independent variable in a regression while you are trying to determine which covariates to delete--because then the covariates that happen, by chance, to correlate highly with X will tend to be deleted, while those are precisely the ones it is important to leave in. Therefore we must consider regressions predicting the dependent variable Y from the covariates alone. Once we have used those regressions to determine which covariates to include, then we can add X to the model.

Although we are working temporarily with regressions excluding X, we can nevertheless estimate what Tol(X) would be if X were in the model. That's because random assignment assures that RX is zero in the population. When a true R is zero, then E(R2) = P/(N-1). Therefore E(Tol(X)) = E(1-RX2) = 1 - E(RX2) = (N-P-1)/(N-1). Therefore we can say that the expected tolerance for a randomly-assigned treatment variable is (N-P-1)/(N-1), where P is the number of covariates. Since N-P-1 = df, where df denotes the residual degrees of freedom in the regression, we can write the expected tolerance as df/(N-1).

If we then drop out k covariates, the numerator of this fraction increases by k, so the expected tolerance in the new regression is (df+k)/(N-1), where df is df for the original (larger) regression. Therefore the ratio of the two expected tolerance values (before and after the deletion) is df/(df+k).

We want to determine under what conditions the estimated standard error of b(X) would be exactly the same in these two regressions. Then on one side of that line, the deletion would increase the standard error, and on the other side it would decease it. Therefore we will imagine for the moment that the two standard errors are exactly equal, and see what conditions are associated with that equality.

As described earlier, equality of the two standard errors means that MSE/Tol(X) is the same in the two regressions. But since we expect the two Tol(X) values to be in the ratio df/(df+k), that means the two MSE values must be in the same ratio. But in any regression, MSE = SSE/df, where SSE is the sum of squared residuals. Let SSE and SSE' refer to the larger and smaller regressions respectively. Then MSE for the larger regression is SSE/df while MSE for the smaller regression is SSE'/(df+k), where again df pertains to the larger regression. If we set the ratio of these two values to the aforementioned value df/(df+k), we have

(SSE/SSE')*(df+k)/df = df/(df+k), so

SSE/SSE' = [df/(df+k)]2

so SSE'/SSE = [(df+k)/df]2

If we use the ordinary F test to test the significance of the regressors deleted from the larger regression, we have

F = (SSE' - SSE)/SSE * df/k = (SSE'/SSE - 1)*df/k

Substituting the previous relation into this, we have

F = ([(df+k)/df]2 - 1)*df/k = (k2 + 2df*k)/(df*k) = 2 + k/df

Thus to decide whether to delete some set of k covariates, the worker can compute F and make the deletion if F < 2 + k/df.

Several considerations make it reasonable to ignore the k/df term. First, k is most often 1 since we are deleting one covariate at a time--and even when not 1, k is usually fairly small so that k/df is trivial in comparison to 2. k/df would rarely exceed .1, so the proper F is nearly always between 2 and 2.1. Second, in my own experience, deleting variables one at a time usually produces fairly large jumps in the smallest remaining F. If F = 1.5 for one variable, then deleting it might well yield 2.5 as the smallest F in the recomputed regression. Thus working with an exact critical F is not very important. Third, when the difference between the two standard errors of b(X) is plotted as a function of F, that difference is near 0 for a broad range of F-values around the exact value of 2 + k/df, so again hitting it exactly is not very important. Thus for simplicity I suggest using a critical F of exactly 2, to avoid having to recompute k/df after each deletion. Or if k = 1 and t = sqrt(F) is being used, I suggest a t of 1.42, which is just above sqrt(2) = 1.414.