#
*An Experimental Comparison of Statistical and Case History Methods of
Attitude Research*

## II. Reliability of the Attitude Test Scores

### Samuel Stouffer

##### Table of Contents | Next | Previous

The attitude test scale employed was that constructed by Dr, Hattie N. Smith,
using the method described in detail by Prof. L. L. Thurstone in 1928.**[1]**
The test comprised 44 opinions, each with a predetermined scale value. Dr. Smith
derived these scale values by the procedure described in detail in her
dissertation.**[2]**
A preliminary set of 135 opinions on prohibition was mimeographed on small
slips, one opinion to a slip. Each set of 135 opinions was sorted by 300 judges
into 11 piles, representing various degrees of attitudes from extremely
favorable toward prohibition to extremely unfavorable. Using the method

( 14) of equal appearing intervals, the judges laid the slips in piles which appeared to them to be equal distances apart, For each opinion a cumulative frequency distribution was constructed., showing the number of judges who had placed the opinion on or to the left of each pile. The psychologically spaced piles numbered arbitrarily from 1(extremely favorable toward prohibition) to 11 (extremely unfavorable toward prohibition), formed the base line or attitude continuum.. The scale value of each opinion was then allocated to the attitude continuum by dropping to the base line a perpendicular drawn graphically from the median of a smooth curve which had been passed free-hand throw the ordinates of the cumulative frequency distribution.

Out of the 170 trial opinions, Dr, Smith selected 44, There were two main
criteria of selection. First, only those opinions were selected about which the
300 judges rather closely agreed,. The degree of agreement on an opinion was
measured graphically by dropping perpendiculars from the first and third
quartile points on the smoothed cumulative frequency curve and reading off on
the base line the semi-interquartile range. Second, the opinions were so chosen
as to form two parallel groups of 22 opinions with scale values spaced equally,
as nearly as possible, along the attitude continuum.**[3]**

Dr. Smith gave the test to 890 subjects. The subjects put a cheek mark opposite each of the 44 opinions with which they agreed. The reliability was determined by correlating the average scale value of opinions endorsed on one half of

( 15) the test with the average scale value of the opinions endorsed on the presumably parallel half. The correlation between these two sets of scores was +.85, which became +.92 when adjusted for a test of double the length of each half, by use of the Spearman-Brown prediction formula

In the present investigation, the same 44 opinions selected by Dr. Smith were used.**[4]**To determine reliability, it seemed worth while to take into account what Prof. Thurstone has called the "intrinsic popularity" of a given opinion, Experience has shown that although two opinions may have the same scale value and semi-interquartile range, one may be endorsed by several times as many people as another. As soon as the 238 subjects in the present study had taken the test, the approximate number endorsing each of the 44 opinions was counted. As an extreme example of the influence of the "intrinsic popularity" factor, two opinions may be cited.

The opinion "Prohibition is not desirable now because

(
16) there is not a sufficiently large majority in favor of it to make
enforcement effective" (scale value, 5.6; semi-inter-quartile range, 1.1), was
endorsed by 162 people, while the opinion, "It is absolutely immaterial whether
we have prohibition or not" (scale value, 5.5; a semi-interquartile range, 0.6)
was endorsed by only 9 people. In spite of its small semi-interquartile range,
the latter opinion obviously is of no value in the test. There were a few other
opinions which awakened relatively little response from anybody. It seemed best.
to rearrange the opinions into two parallel scales somewhat different from the
parallel scales chosen by Dr. Smith. This was accomplished by requiring that the
number of people endorsing questions within a corresponding small range on each
scale should be the same, as nearly as possible. While this was done *a
posteriori,* it introduces no spurious reliability, because the mere total
number of people endorsing two parallel opinions gives no indication that the
same people who endorsed the one also endorsed the other (unless perhaps_{.}
each opinion should be endorsed by an extremely large proportion of the
subjects, as was not generally the ease in the present study). The parallel
scales in which the 44 opinions were divided is shown in Appendix B, Table 6.

In the present investigation the average scores made by subjects on each parallel form yielded a correlation coefficient of +.88., slightly higher than that obtained by Dr. Smith in her group of 890 subjects. Adjusted by the Spearman-Brown formula for a test of double the length of either half, the reliability coefficient was +.94. This is not quite as

(
17) high as the reliability coefficient of +.96 obtained by Thurstone and
Chave on their scale of attitudes toward the church.**[5]**

There is an interesting difference between the distribution of the test
scores in Table I and that of the opposite ratings of the judges of the case
history documents. The ratings have a somewhat heavier loading at both extremes.
This difference may be due in part to the inability of the judges of the ease
histories to make fine enough distinctions near the extremes. It is very likely
due in part, also, to the failure of the test to give people with extreme
attitudes an opportunity to register their true feelings. A person's score on
the test is the average scale value of the opinions which he endorses. If the
opinion which is theoretically the best index of his attitude happens to be the
most extreme opinion in the group of 44 his score is likely to be less extreme
than his true position would justify, because all of the other opinions which he
endorses will of necessity weight his score on the side of moderation. The
arithmetic average, therefore, is not a good measure of central tendency for
extreme cases, although it may be satisfactory for people whose true position is
not too near either end of the scale. **[6] **
Moreover the scale

(
18) values of the opinions near the ends of the scale are not quite so
accurately determined as those near the middle of the scale, because of the "end
effect" inevitable in the method of equal appearing intervals-and of the
arbitrary element introduced by free-hand extrapolations.**[7]**
the fact that there is a difference between the number of people with extreme
attitudes as measured by the test and by the judges' ratings of the ease
histories, just as would be expected in advance from a knowledge of the way in
which the test scores are computed, provides apparently a useful warning against
overconfidence in the exactness of a frequency distribution of test scores as
indices of the true distribution of attitudes in the group. It is possible that
some more refined methods of test construction now in process of development by
Prof. Thurstone may eliminate some of the error at the extremes.

_{,}would possibly be higher than the reliability coefficient of +. 94 found in the present study.