Using ALLSETS to identify collinear sets

Richard B. Darlington
Cornell University

The "collinearity problem" is construed in different ways, but one construal of the problem is this. It may be that R2 is statistically significant even though no regression slope is significant individually. Or even if some slopes are significant, it may be that removal of all the nonsignificant variables from the model produces a significant drop in R2, even though none of the variables removed was significant individually. In either of these cases, the problem is to identify the smallest sets of variables whose removal produces significant drops in R2. For instance, if there are 10 predictors and only predictors 3 and 5 are significant individually, it may be that in addition, predictors 7 and 10 are significant when tested as a set, while predictors 2, 4, and 8 also form a significant set. The problem is to identify such sets.

The significance of such sets, despite the nonsignificance of all their members taken individually, is usually caused primarily by collinearity among the members of the set. Such patterns of significance can also occur in the absence of collinearity, but are of interest even so -- so that fact does not detract from the usefulness of finding all sets with this property.

We can also define a variant of the problem which distinguishes between covariates and independent variables. Covariates are variables included in a regression model merely to control for their effects; the investigator cares little whether their effects are statistically significant. The collinearity problem is then defined the same as before, but with the additional proviso that even though the covariates are always in the model, all identified sets of variables must contain only independent variables. It turns out that the solution to this problem requires only a simple extension to the solution of the problem without covariates, so we consider first a problem with P independent variables and no covariates.

After using ALLSETS to compute all 2P values of R2 (and hold them in memory), it is simple to use an F test to test the difference between each of these values of R2 and the R2 for the full regression. Each of these F's tests the significance of the set of omitted variables. For instance, if P = 5 and you test the difference between the full R2 and the R2 computed for variables 1, 2, and 4, that F tests the independent contribution of the set containing variables 3 and 5. (The multiple tests problem is discussed later; here I'll just say I believe there is a simple and satisfactory solution to it.)

Identifying the most important significant sets

This series of F tests does not by itself provide a satisfactory solution to the collinearity problem, because the number of such tests may be unmanageably large. For instance, if P = 12, then the total number of possible regressions is 212 = 4096. Even if we don't count either the regression with no variables or the regression with all variables, or the 12 regressions missing only one variable, that still leaves 4096 - 14 = 4082 regressions. It turns out to be quite feasible to calculate 4082 such values of F and their associated p values. But it would still be an enormous job to examine all 4082 of these values of p or F to determine which sets are significant. Further, we're not interested in the significance of any set if some subset within it is also significant. For instance, if the set containing variables 2, 4, and 8 is significant, then we're not separately interested in the set containing variables 2, 4, 8, and 9. The significance of the latter set would tell us only that some of the variables in that set must be related to Y, and we already know that from the significance of the set containing variables 2, 4, and 8.

Thus another step is needed. We must pare the list of significant F's (which as just seen, may be thousands of items long) down to just those F's testing sets without significant subsets. To do this, the method requires the user to specify some significance level alpha which a set must meet to be considered significant; most users will probably set alpha to .05 or .01. After all values of F and p have been computed, the method ranks the tested sets from small to large, according to the number of variables in each. For instance, if P = 5, the regression containing variables 1, 2, 4 would be ranked ahead of the regression containing just variables 1 and 2, because the tested set for the latter (variables 3, 4, 5) is larger than the tested set for the former (variables 3 and 5). Then the program goes through the list of F's in order, and ignores any F testing a set which contains a subset previously noted to be significant. All significant F's passing this test are printed, together with the list of variables tested by that F. The list of printed sets is then usually quite short.

Despite the screening process just described, a single variable may be in two or more printed sets. For instance, suppose P = 8 and the only significant sets are (2, 4, 5) and (1, 5, 7, 8). Then both will be printed despite both containing variable 5, because neither one is a subset of the other.

Computing time is trivial; with GAUSS running on a DOS 486 machine, a problem with P = 10 takes about 15 seconds to do everything described above, yielding output like that shown next. Most computing languages are considerably slower than GAUSS, and problems with larger P will obviously take longer, but computing time still seems inconsequential.

An example

Below is some sample output with 10 predictors and alpha = .05. Q is the number of variables in a significant set. In the table section titled "Variables in tested set", 1's denote variables in the tested set. You see in the first two rows that the program found variables 8 and 3 significant individually, with p's of .0486 and .0007 respectively. Then it identified a set containing variables 5 and 9 with p = .0395, then a set containing variables 2 and 9 with p = .0457, then a set containing variables 1, 7, 9, and 10 with p = .0343, and finally a set containing variables 1, 2, 5, 7, and 10 with p = .0497. Only variables 4 and 6 fail to fall in any significant set; columns 4 and 6 contain all 0's.

Overall R^2 =  0.4900

    R^2      F        p       Q   Variables in tested set
                                  1 2 3 4 5 6 7 8 9 10
  0.4358   4.1429   0.0486    1   0 0 0 0 0 0 0 1 0 0
  0.3115  13.6471   0.0007    1   0 0 1 0 0 0 0 0 0 0
  0.3980   3.5158   0.0395    2   0 0 0 0 1 0 0 0 1 0
  0.4025   3.3443   0.0457    2   0 1 0 0 0 0 0 0 1 0
  0.3385   2.8949   0.0343    4   1 0 0 0 0 0 1 0 1 1
  0.3292   2.4594   0.0497    5   1 1 0 0 1 0 1 0 0 1

Handling covariates

We now return to the case in which the user wishes to distinguish between independent variables and covariates. This case is handled simply by applying the method just described, not to a simple correlation matrix but to a matrix of partial correlations in which the covariates have been partialed out. That is, the correlation matrix has (P+1) rows and columns as described above, but P is just the number of independent variables, not including any covariates. If this is done, N should be reduced by the number of covariates to make the F tests work correctly. For instance, if N was 100 but 5 covariates were partialed out, enter N as 95.

What about multiple tests?

We must certainly consider the problem of multiple tests, since the procedure just described may scan hundreds or even thousands of tests, and print just a few significant ones. I suggest using a multiple-test philosophy similar to that of the Fisher protected t method. That philosophy suggests the followingprocedure: in the full model, test the set of all variables not significant individually. If that set is not significant, stop, because then there is no evidence that collinearity has prevented some sets from being significant. That way, all later tests actually performed are tests of subsets from a set already found to be significant. The Fisher philosophy is not accepted universally, but is widely used and I believe it is often reasonable; for a fuller discussion see Sections 11.5.4 and 11.5.5 of my textbook Regression and Linear Models (McGraw-Hill, 1990).

This extra "Fisher protection" test is not incorporated into program FINDSETS described later, because I assume a worker will typically turn to FINDSETS only after considerable analysis of the data with an ordinary regression program. Therefore I believe the easiest thing to do is to perform the Fisher protection test with that other program. That may save the worker from having to use FINDSETS at all.

GAUSS Program FINDSETS: Detailed Description

Program FINDSETS operates as described above, and was used to generate the sample output shown above. FINDSETS requires just three inputs: a correlation matrix R, a sample size N, and a chosen significance level alpha -- typically .05 or .01. In R, the criterion or dependent variable Y must appear last, so the last row and column of R contain Y's correlations with the predictor or independent variables.

If the regression model contains some covariates which the investigator has included in the model just so they can be controlled, and the investigator wants to find sets of variables which include only independent variables (not covariates), then FINDSETS assumes R is a matrix of partial correlations among independent variables and Y, with the covariates partialed out. In that case, N should be reduced by the number of covariates. For instance, if N was 100 but 5 covariates were partialed out, enter N as 95.

Here are the major vectors (columns or rows) and matrices used in FINDSETS. GAUSS ignores a letter's case, so names are typically lower-case in the program itself, but upper-case in the discussion for easy identification.

VJ is a list of the 2P sweeps to be performed. It is created early in the program, using rules given earlier.

C is a copy of R. SWEEPing modifies C, but leaves R unchanged. After 2P sweeps dictated by the entries in VJ, C will differ from R only because of rounding error.

RSQ starts as a column vector of 2P zeros, but after each sweep one of those zeros is replaced by the newly-calculated value of R2, so RSQ is full of values of R2 by the end of the program. RSQ is filled sequentially, from top to bottom.

CH is a 1 x P row vector which starts as all zeros. When any variable k is swept, CH[k] is changed from 0 to 1, or from 1 to 0 if CH[k] = 1. Thus after each sweep, CH contains a record of the rows swept at that moment.

MEMB (for "members") starts as a 2P x P matrix of zeros. After the ith sweep, CH is copied into row i of MEMB. Thus after all the sweeps, MEMB shows the independent variables used to compute each value of R2. RSQ and MEMB are not printed out by default; they're typically too long. But they are not erased by the program. A little later we describe how the interested worker can access the unprinted information in them.

MAX is the largest value of R2 found by the program -- the value found when all P regressors are in the model. Its position in RSQ is identified by ALLVAR. Thus to say ALLVAR = 9 is to say that the 9th entry in RSQ is the one for the full model.

F is a vector that holds the 2P values of F calculated as described earlier. That is, each entry in F tests the difference between a particular R2 and MAX.

PV is a vector that contains the size of the set tested by each of those F tests. For instance, if P = 8 and the 11th value of RSQ is for a model with 5 variables, then PV[11] = 3 because the tested model differs from the full model by 3 variables.

FINDSETS uses vector commands to calculate all 2P values of F simultaneously. It so happens that values of PV appear in the denominators of F. That means that when FINDSETS attempts to test the full model against itself, the expression for F has a denominator of 0. To avoid that problem, that single value of PV is changed from 0 to 1 before F's are computed (we're not interested in that value of F anyway), and then changed back to 0 afterward.

After all F's are computed, the significance level p corresponding to each F is computed. These values of p are held in vector PP.

Then a matrix SRT is created with 2P rows and P+4 columns. The first four columns of SRT are RSQ, F, PP, and PV respectively, and the remaining columns are 1 - MEMB. The subtraction is done because MEMB itself contains 1's for variables in a model and 0's for other variables. But those other variables are the ones whose contribution to the overall R2 is tested by each F. Therefore each row of 1 - MEMB contains 1's for the variables tested by the corresponding F, and 0 for other variables.

After it is created, SRT is sorted by its fourth column (PV), so the F tests on the smallest sets are placed first. That facilitates the later step in which a significant set is not printed if it contains a significant subset. After the sorting, any such subset will precede the set under consideration, so only the preceding rows must be scanned.

The actual procedure is a little different from that just described, but equivalent. A vector SKIP is created, containing at first all 0's. Then changing any entry i in SKIP to 1 tells FINDSETS to skip that row when printing. First the program enters 1's into SKIP for all rows with nonsignificant values of F. Then, for each significant set I, the program sets SKIP to 1 for every later set J which includes I as a subset.

For readers who want to know exactly how this is done, it is done by dot-multiplying the I and J "membership lists" and setting skip[j]=1 if the product equals the "membership list" for I. For instance, consider the membership lists (1 0 1 0 1) and (1 0 1 1 1). The first is a subset of the second, since every 1 in the first list is matched by a 1 in the second. The dot-product of the two lists (containing products of corresponding entries) is (1 0 1 0 1), which matches the first list. The general rule is that the product of two lists will equal the first list only if the first is a subset of the second. That's the rule the program uses; it compares the first list to the dot-product of the two lists, and changes SKIP for the second list to 1 if the two are identical.

Finally SRT is saved to text file FINDSETS.OUT, except for the rows of SRT identified by SKIP.

The program is divided into 6 independent sections. By "independent" I mean there are no loops or "go to" commands that go between sections. Thus the program can be run section by section. Or a programmer trying to translate the program into another computer language can write and debug sections sequentially, debugging earlier sections before later sections have even been written. The six sections do the following:

  1. Define the SWEEP procedure
  2. Create VJ, the list of rows to sweep
  3. Perform 2^P sweeps, create RSQ and MEMB.
  4. Find or create MAX, ALLVAR, vectors F, PP, and PV, and matrix SRT
  5. Create vector SKIP
  6. Write an output file named FINDSETS.OUT.
To write to a text file all of SRT, and not just selected rows, you can take advantage of this independence among sections. After running the entire program, you can run section 6 alone after two modifications:

1. Most important, if you have already run the entire program and created file FINDSETS.OUT, change the first command in section 6 to, for instance, "output file = allsets.out reset". This names the about-to-be-created file ALLSETS.OUT instead of FINDSETS.OUT. Without this step you would overwrite the previously-created file FINDSETS.OUT.

2. Delete the commands "if skip[i]==0" and "endif" from section 6b. Those deletions make the section ignore the SKIP vector, and print all of SRT.

A note to programmers planning to rewrite: The SWEEP procedure is actually called from only one command in the entire program (in section 3). Therefore you can, if you wish, enter the commands for SWEEP at that one spot in the program, and thereby do away with all procedures or subroutines.

FINDSETS: Program Listing

FINDSETS assumes R, N, and alpha have already been defined.

GAUSS sometimes uses two "=" signs in a row; they will look like "==".

The program lines
"....R^2......F........p.......Q...Variables in tested set";
"..................................1 2 3 4 5 6 7 8 9 10";
print the heading lines for the output table. In those program lines, the dots should be replaced by spaces; they are shown as dots merely to make it easier to see how many there are.

After this line, program comments are enclosed between two "@" symbols.

@ 1. Define the SWEEP procedure @
proc sweep(k,c); local pv,b; pv=1/c[k,k]; b=c-c[.,k]*c[k,.]*pv;
b[k,.]=c[k,.]*pv; b[.,k]= -c[.,k]*pv; b[k,k]=pv; retp(b); endp;

@ 2. Create VJ, the list of rows to sweep @
vj=ones(2^p,1); incr=1; do until incr==2^(p-1); incr=2*incr;
i=0; do until i>2^p-incr; i=i+incr; vj[i]=vj[i]+1; endo; endo;

@ 3. Perform 2^P sweeps, create RSQ and MEMB. @
rsq=zeros(2^p,1); ch=zeros(1,p); memb=zeros(2^p,p);
i=0; do until i==2^p; i=i+1; jj=vj[i]; ch[jj]=1-ch[jj];
memb[i,.]=ch; c=sweep(jj,c); rsq[i]=1-c[p+1,p+1]; endo;

@ 4. Find or create MAX, ALLVAR, vectors F, PP, and PV, and matrix SRT @
max=maxc(rsq); pv= -sumc(memb')+p; allvar=minindc(pv); pv[allvar]=1;
f=(rsq-max)./pv*(n-p-1)/(max-1); pp=cdffc(f,pv,(n-p-1)*ones(rows(pv),1));
pv[allvar]=0; srt=rsq~f~pp~pv~(1-memb); srt=sortc(srt,4);

@ 5. Create vector SKIP @
skip=(srt[.,3] .> alpha); i=1; do until i==2^p; i=i+1; if skip[i]==0;
j=i; do until j==2^p; j=j+1; if skip[j]==0; pr=srt[i,5:p+4].*srt[j,5:p+4];
skip[j]=(pr==srt[i,5:p+4]); endif; endo; endif; endo;

@ 6a. Open text file FINDSETS.OUT, write opening material into it @
output file = findsets.out reset;
format /rd 8,4; "Overall R^2 =";; srt[1,1]; " ";
"....R^2......F........p.......Q...Variables in tested set";
"..................................1 2 3 4 5 6 7 8 9 10";

@ 6b. Write into the file the rows of SRT for which SKIP = 0 @
i=0; do until i==2^p; i=i+1; if skip[i]==0;
format /rd 8,4; srt[i,1:3];; format 4,0; srt[i,4];; " ";;
format 1,0; srt[i,5:p+4]; endif; endo;

@ 6c. Write some notes at the end of file, close the file @
" "; "F tests difference between R^2 and overall R^2.";
"p is the signif. level for F. Q is number of variables in tested set.";
"In the 'Variables' section, 1 denotes the variables out of the model and";
" therefore in the tested set, while 0 denotes variables in the model.";
output off;