An Experimental Comparison of Statistical and Case History Methods of Attitude Research

I: The Problem and Its Outcome

Samuel Stouffer

Table of Contents | Next | Previous

This is the report of an experiment to estimate the extent of agreement between findings, obtained independently as far as possible, by a statistical and a ease study technique, in attitude research.

The subjects were 238 students of the University of Chicago. The inquiry dealt with attitudes toward prohibition. This subject was selected because it seemed sufficiently controversial to yield a rather wide range of attitudes and because there was already available a test of attitudes toward prohibition constructed by Dr. Hattie N. Smith, using the method of equal appearing intervals first adapted for attitude research by Professor L. L. Thurstone.

The subjects were members of the following undergraduate classes, meeting in the autumn quarter, 1929: Sociology 110 (one campus class and one University College class), Mr. Paul F. Cressey, instructor; Sociology 220, Professor Ellsworth Faris, instructor; Sociology 270, Professor E. W. Burgess, Instructor; and Political Science 236, Professor H. F. Gosnell, instructor. Since the purpose of the investigation was to test two methods of attitude research and not to learn the attitudes toward prohibition of the student body as a whole, no attempt was made to get a representative sample of the student body. Every student present took part in the study. Of the 249 present, excluding duplications, 238 completed

(2) their assignments in time to be used. The 238 included 111 males and 127 females. The composition of the classes as to nationality of parents and home and neighborhood environment may be judged somewhat from Tables 23 and 26 in Appendix B.

Each of the 238 students took Dr. Smith's test of attitudes toward prohibition and also wrote an autobiography of about one thousand words describing his experiences and feelings from childhood to the present day with respect to drinking and to prohibition laws. The instructor in each course Made the writing of this document an assignment, the paper to be graded roughly on the basis of the cooperativeness and promptness of the student, wholly irrespective, of course, of whether the student was a wet or a dry. Previous-experience had suggested that this academic stimulus was desirable in order to get careful work. It was equally desirable, however, to encourage frankness by guaranteeing anonymity. This was accomplished by assigning to each student a code number at random. He was asked to put the code number on his attitude test and also on his autobiographical document. Neither the instructors nor the present writer knew what code number a given student would receive. Some member of each class was invited to volunteer to keep a key list of code numbers and corresponding names. To this member the students turned in privately slips containing their code numbers and names. After the case histories had been read and graded, the list of code numbers and corresponding grades was given to the student with the key. He turned back to the instructor a

( 3) list of the names and corresponding grades. Since only three grades were given (excellent, satisfactory, unsatisfactory), there was little possibility of identifying the writers of the case histories, The key lists were destroyed. The plan seemed to have the confidence of the students. Of the 238 papers, the writer knows the authorship of less than a dozen, and most of these only because the students revealed their identities purposely.[1]

The students also rated themselves on a graphic rating scale, as to their own attitudes toward prohibition laws and toward drinking liquor, and also filled out a background questionnaire.

The reliability of the test was estimated by correlating the average scores on two parallel forms filled out at the same sitting. The reliability coefficient was +.94. The 238 case histories were rated on a graphic rating scale by four judges independently. The reliability of the ratings as to attitudes toward prohibition laws was estimated by averaging the intercorrelations of the composite ratings, expressed as standard scores, of each pair of judges with each other pair. The reliability coefficient was +.96. [2]

The correlation table showing the relationship between the Smith test scores and the composite ratings of four judges

( 4) on the case histories written by the 238 students who took the test is presented in Table 1. The correlation is +.81. Corrected for attenuation it is +.86.

A validity coefficient of +.85 is higher than any which the writer has seen between an attitude test and an outside criterion.[3] A careful study of the correlation table, however, suggests caution in interpreting the results. It will be noted that the distribution both of test scores and case history ratings has a wide range and a tendency toward bimodality. No effort was made, of course, to load the extremes In choosing subjects for this study. The effect of a distribution of the sort found is to raise the correlation higher than probably would be the ease if variability were less. If this experiment were to be repeated under the same conditions, using a group with small variability, the correlation probably would be smaller than +.81.[4] On the other hand, one has some reason to feel that extremes of sentiment on prohibition are relatively no less frequent in the general population than in a rather liberal school like the University of Chicago. If anything, a considerable confusion of feelings

( 5)

Table 1 Correlation Between Scores on Smith Test of Attitudes and Composite Case History Ratings of 4 Judges on Attitudes toward Prohibition Laws
  42 - 48 49 - 55 56 - 62 63 - 69 70 - 76 77 - 83 84 - 90 91 - 97 98 - 104 105 - 111 112 - 118 119 - 125 126 - 132 133 - 139 140 - 146 147 - 153  
9.0-9.4 .. .. .. .. .. .. .. .. .. .. .. .. .. 1 .. .. 1
8.5-8.9 .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
8.0-8.4 .. .. .. .. .. .. .. .. 1 .. 1 1 1 1 .. 2 7
7.5-7.9 .. .. .. .. .. .. .. .. .. .. 3 2 .. 4 3 2 14
7.0-7.4 .. .. .. .. .. .. .. .. .. 1 2 3 4 6 3 2 21
6.5-6.9 .. .. .. .. .. 1 1 3 .. 2 2 6 7 3 2 .. 27
6.0-6.4 .. .. .. .. .. 1 1 2 .. 5 2 4 6 4 .. 1 26
5.5-5.9 .. .. .. .. 2 2 .. 3 2 2 2 3 3 1 1 .. 21
5.0-5.4 .. 1 1 2 2 2 2 4 4 1 4 4 .. .. 1 .. 28
4.5-4.9 .. .. .. 3 2 4 1 .. 4 1 .. .. 1 .. .. .. 16
4.0-4.4 .. .. .. 1 2 1 1 2 2 .. .. .. 1 .. .. .. 10
3.5-3.9   2 3 3 9 3 1 1 1 1 .. .. .. .. .. .. 24
3.0-3.4 1 1 5 12 5 2 1 .. .. .. 1 .. .. .. .. .. 28
2.5-2.9 .. 1 6 4 1 1 1 ... .. .. .. .. .. .. .. .. 14
2.0-2.4 .. .. 1 .. .. .. .. .. .. .. .. .. .. .. .. .. 1
  1 5 16 25 23 17 9 15 14 13 17 23 23 20 10 7 238
r = +.813, r, corrected for attenuation, = -.858

(6) which would lower reliability and validity, might be anticipated pat such an institution. Dr. Smith found a tendency toward bimodality among 281 students of several colleges and in a random group of 200 business men. She found that the scores of a group of 206 members of the Methodist Church and 178 members of the Y.W.C.A. were uni-modal and skewed toward the dry end of the attitude scale.[5] The nature of the distribution in Table I seems satisfactory enough to permit the use of correlational methods. The regressions are clearly linear. A normal correlation surface is not required, and it would seem that the distribution of standard-cross-products is such as to justify the expression of their central tendency in form of an average, which is, of course, the correlation coefficient.

The correlation of +.86 as corrected for attenuation is an estimation of what the relationship would be between the test scores and the composite ratings of the four judges if the chance errors in the test and in the ratings could be eliminated. This is determined by dividing the observed correlation coefficient by the geometric mean of the reliability coefficients of the test and the ratings. If the reliability coefficients are too high, the validity coefficient of +.87 may be too low. If the reliability coefficients are too low, the validity coefficient may be too high. As discussed later in the sections on reliability, the reliability coefficients

( 7) do not take into account certain types of errors. If these could have been taken into account, it is possible that the reliability coefficients, themselves corrected for attenuation, would have been slightly higher. But the observed correlation between the test scores and the judges' ratings also probably would have been slightly higher, with the result that the validity coefficient as corrected for attenuation probably would have been raised little, if at all.

A constant error which would tend to give the validity coefficient a spuriously high value would be a strain for consistency on the part of the subject in trying to produce a document which would be consistent with his answers to the test. A discussion of the precautions taken to prevent this and an evaluation of the possible effect of such a strain for consistency will be found below (pages 31-33). On the other hand, a constant error which might tend to give the validity coefficient a spuriously low value is the fact that a certain amount of time elapsed between the taking of the test and the writing of the case history. In no case was this more than a week, but any changes in attitude during that interval would tend to make the observed correlation between the test and the judges' ratings lower than the true correlation. It is possible, of course, that the very act of writing the case history may have caused some slight shifts in attitudes, which would be reflected in the completed copy of the case history but not in the test. In the writer's opinion, neither of these constant errors probably were important enough,, on

(8) the whole, to affect the true validity coefficient significantly, especially since they pulled, if at all, in opposite directions.

A word of caution should be introduced about the use of the technical word validity in this investigation. Validity as here used means nothing more than that what the test purports to measure seems to be about the same as what the judges infer from the case histories. It is conceivable that the concept of attitudes may be defined in such a way that neither the test nor the case histories may be thought to get at attitudes. Both the test and the case histories are dependent on the verbal reports of the subjects, the principal difference being that the case history is a narrative summarising a life sequence of acts, including feelings, in their cultural setting. This study makes no attempt to get outside data from other people as to the subjects' acts with reference to prohibition. Nor, on the other hand, does it probe as deeply into the "unconscious" as a psychoanalytic interview would seek to do. It is not the purpose of this investigation to engage in a terminological discussion of concepts of attitude or to evaluate to what extent that which the judges infer to be an attitude conforms to any particular logician's concept. As will be observed below (pages 35 to 37 ), the judgments as to attitudes made by four graduate students of the Department of Sociology in the University of Chicago, who were familiar with the theoretical literature on the subject, turned out to be not very different from the judgments made by the

( 9) superintendent of the Illinois Anti-Saloon League and the secretary and director of the Illinois Association Opposed to Prohibition, who were not familiar with the theoretical literature on attitudes and who presumably had strong personal feelings on the subject of prohibition,[6]

Several factors which might make this correlation spuriously high already have been mentioned or are discussed in later sections of the investigation. The writer's conviction, based on quantitative cheeks wherever possible, is that these factors probably did not enter in sufficient importance to make the correlation spuriously high. Two other factors, upon which there is no check, may be suggested here; although they do not affect validity as defined above. One is that the only record which we have of the subject's acts is his own written record. The correlation between his verbal responses on a test

( 10) and an index of behavior as reported by others might not be so high as that between the verbal responses on the test and an index of his behavior as reported by himself even in the most frank and honest autobiographical document. This is not to say that the former would be a better index than the latter. It simply is to point out that the two indices might be different and that if a meticulously high degree of accuracy were desired both somehow should be taken into consideration. The second factor is that the case history, as well as the test, contains expressions of opinion, and the judges may have given them excessive weight as compared with experiences less easily evaluated.[7] Neither of these two factors are controlled in the present investigation. To control the former would probably be impossible, requiring the subject to reveal his identity so that his acquaintances could be consulted, requiring long and arduous quest for interviews, requiring a lapse of time during which the attitude if at all volatile might change, and yielding at best an amorphous mass of data which would be difficult to evaluate qualitatively, let alone quantitatively. To control the latter might not be impossible, but would be difficult. It would require editing the case histories and eliminating every semblance of a verbal opinion. The case history then could be evaluated both in its mutilated and original form, the ratings compared,

( 11) and appraisal made of the weight attached to the verbal reports. The dangers of editing are obvious. Even if differences were found, it could not be perfectly sure that the editor had not pruned away reports of important non-verbal acts inextricably attached to the expressions of opinion. Nor could it be sure, unless checked, that two editors working independently, would not cut away different things. Finally, by what criterion would one decide what weight given by judges to the opinions in the case history would be "excessive"?

The writer's opinion is that neither of these two factors, if properly controlled, would materially affect the validity coefficient of  +.86 found, even if it were chosen to extend the definition of validity given on page 8. But it must be emphasized -- as distinguished from certain other conclusions reached as to factors affecting the validity coefficient -- that this opinion is based more or less on a vague feeling for the data rather than upon objective evidence.

Further tests of the comparability of the results obtained independently by the statistical and case history methods are reported below, pages 38 to 47. A large part of the report is devoted to a critical consideration of various factors, other than those already mentioned, which might affect the reliability or validity coefficients found. Special attention is given to factors which might make the reported correlations spuriously high. These factors are considered, one by one, and, wherever possible, a quantitative test is applied. On the whole, the result of these analyses is, in

( 12) the writer's judgment, a confirmation of confidence in the validity coefficient found.


  1. This method was devised by Dr. Herbert Blumer, University of Chicago, and used successfully by him in some previous research, as yet unreported.
  2. These reliability coefficients are discussed in detail In Sections II and III below.
  3. It is the convention to interpret the reliability coefficient of a test as the index of the extent to which it measures something consistently. The validity coefficient tells the extent to which it measures what it purports to measure. The validity coefficients even of highly reliable Intelligence tests are seldom above +.50 or +.60.
  4. If a test can be assumed to be equally effective throughout the range (an assumption which is often doubtful) the correlation to be expected with different variabilities may be estimated by the formula given in Kelley, Statistical Method, p. 222 .
  5. Hattie N. Smith. The Construction and Application of a Scale for Measuring Attitudes About Prohibition, PhD Thesis; University of Chicago, December, 1929, p. 46.
  6. This proved to be a rather interesting confirmation of the theory set forth by Professor Ellsworth Faris, writing in 1928; "The question of definition and inconsistency in the use of the word f attitudes is.... more a matter of lexicography than science. A word means what men mean by it and most dictionaries patiently record all the uses of the words in the language. If one author is inconsistent, and most of them do slip, he should be held accountable for the fault, but scientific progress will not be made by mere voting about words. It is also a matter of common knowledge that other words are used instead of the word 'attitude' to denote the same thing, e. g., tendency, predisposition, disposition, and habit. To the tyro this is confusing; but if we think denotatively we cannot go wrong. Even the word attitude could be abandoned and a meaningless symbol substituted without loss. We could speak of the element X which is left as a residue of a former action and predisposes to a future act or type of acts." , —Faris, "Attitudes and Behavior," American Journal of Sociology, September, 1928, p. 271, note.
  7. On the precautions taken in the present investigation to avoid this, see below, page 23.


Valid HTML 4.01 Strict Valid CSS2