PROJECTIVE TESTS CAN BE MADE RELIABLE: Measuring Need for Achievement
JOHN J. RAY
University of New South Wales
Summary: With self-rating (or "Likert") scales, reliability is partly a function of number of items and it is therefore suggested that increasing the number of measurements (items) should also improve projective test reliabilities. Empirical examples in the field of n Ach measurement are given that support this contention. Evidence is also mentioned which suggests that reliability can further be improved by different scoring systems and stimulus materials.
One of the most serious criticisms of projective tests is the often repeated finding that such procedures suffer from lack of reliability (e.g. Murstein, 1963; Weinstein, 1969; Entwisle, 1972). Rather than refute the criticism, most users of projective tests seem simply to ignore it and instead stress the validity of such tests (e.g. DeCharms, Morrison, Reitman, & McClelland, 1957). That such validity evidence can be found does of course pose something of a problem for test theorists since to them test reliability is a precondition for validity. (See Cronbach, 1964 and, for a dissenting view, Karon, 1966). Part of the answer to this problem may lie in the type of validation sought. Where the validation is performance in some experimental situation (as with much of the TAT n Ach research) a common but momentary (situational) level of arousal for some particular motive may underlie performance on both the task of the experiment and the task of answering the test. Because the arousal is momentary, however, scores obtained from the test on one occasion do not correlate with scores obtained from the test when it is readministered on a later occasion. This view of projective tests as sensitive to mood fluctuations or situational responses is in fact the normal retort made by users of such tests when they do choose to reply to criticisms based on low observed test-retest reliabilities (e.g. Erginel, 1972). Erginel in fact produces evidence to show that Rorschach scores do fluctuate in cycles. as would be expected if mood swings were involved. This line of argument is, however, a potentially limiting one: It is very often desirable to use projective test scores as indices of traits, i.e. as indices of chronic rather than of momentary dispositions. If a projective test is to be used to measure consistent traits in people, it does seem an indispensable requirement that the measurements it provides should also be consistent. It is the purpose of this paper to show that high reliability is in fact obtainable with projective tests -- in particular with tests of need-for-achievement (n Ach).
Reliability and Test Length
Reliability maximization is in fact a problem not limited to projective tests. Self-rating (Likert) scales also have such a problem. The only difference would seem to be that users of Likert scales have in general tried to construct reliable scales. They have made reliability a criterion for scale adequacy. If they fail to do so, their scales are just as prone to unreliability as any other. For example Lynn (1969) constructed a Likert scale to measure need for achievement and -- as is for some reason generally the case with research in the n Ach area -- he failed even to consider whether or not his scale was reliable. As a consequence, when the reliability of his scale was in fact tested (Ray, 1971), it was found to be only .17 -- commensurate with what is found with projective n Ach tests.
What then do Likert scale users do to ensure that their scales are reliable? The answer (See e.g. Guilford, 1954) is very simple. They repeat their measurements or estimates a large number of times. Each item of a Likert test represents an independent occasion of measurement and an independent estimate of what it is that the scale purports to measure. By pooling a large number of such estimates (items) and averaging the scores obtained from them, a total or overall score is obtained which is in fact consistent from occasion to occasion. Each item, however. must in fact be a measure of what all the other items are a measure of, i.e. the scores on items must be intercorrelated. What a Likert scale constructor in effect looks for is then the largest possible number of intercorrelated items that he can find which do relate in content to what he wants to measure. The formulas he uses as criteria for scale adequacy are, then, all formulas which weight number of items against the mean intercorrelation of those items (See Cronbach, 1951). This can perhaps most clearly be seen in the case of the Spearman-Brown correction formula for split half reliability but, as Cronbach shows, formulas such as "alpha" and the Kuder-Richardson formulas 20 and 21 are also reducible to this. When alpha is at its maximum value, then, an optimal weighting of test length against item intercorrelation has been found.
There is absolutely no reason why users of projective tests should not resort to the same technique for the purpose of reliability maximization. To date, however, few seem to have done so. Each story produced in response to a TAT or Rorschach card is in effect, like a single Likert scale item. If one is, for instance, desirous of measuring n Ach by the TAT what one does is to count the number of achievement-related images in each story. The number of images is the score for that card. If, as is usual, four cards are used, the person's total score is the sum of only four numbers. By contrast, the score on a Likert scale may often be the sum of 20 to 50 numbers. If we want a reliable projective measure of n Ach, what we have to do is use 20 cards, not four. The only complication to this account is of course the possibility that a TAT protocol can in fact be scored for more than just a total achievement imagery score: Whatever it is scored for, however, one card still only produces one number, one score, or one estimate of what is to be measured. The person emits a large or a small number of images in a particular category and that is his score on that variable. For reliability one normally needs many such scores.
The logic in the above considerations has made itself apparent to at least one earlier writer, Murstein (1963). Notably, Murstein is a writer well versed in classical test theory -- something probably untrue of most projective test users. Murstein however, says (1963):
"Before anyone rushes to get 10 more cards, it should be noted that it is doubtful whether this increment in reliability would actually be achieved. The user of the Spearman-Brown formula assumes that the new items are parallel to the previous ones. Accordingly, merely doubling the number of cards does not meet the 'parallel' requirements. [p. 137]"
It should be noted, however, that these reservations are little more than thoroughly non-empirical pessimism. For some tests, parallel forms exist already (e.g. the French "Test of Insight," See Atkinson, 1958). In other cases, what is there to prevent one selecting from a large number of alternatives those cards that are in fact parallel to the previous ones? What is to prevent us from doing as Atkinson (1950) did and selecting new cards simply on the criterion of whether they do or not tend to evoke high levels of (for example) achievement imagery? To suggest that the four existing TAT cards used to measure need for achievement are in some sense unique or inimitable is, when considered, a remarkable assertion indeed. At any event it is only empirical evidence that can certainly settle the matter. Whether Murstein or the present author is correct only actual experience can decide.
There is in fact already in the literature ample indication that increases in length can increase projective test reliability. This evidence is, however, in general a matter of only incidental mention and only one study appears to have started out with the sole objective of increasing length to obtain greater reliability. This is the study by Johnston (1957) using the IPIT, a forced choice modification of the TAT. In this modification, four possible interpretative stories of the TAT card are provided already and all the subject has to do is rank the four explanations in order of appropriateness. If he ranks the achievement-oriented story first, he gets a high score on the achievement sub-scale. Obviously, this procedure is sufficiently different from the normal TAT procedure to make results obtained with the two methods not automatically generalizable one to the other. Nonetheless, Johnston's results are instructive. Using a set of 24 pictures (instead of the original 10) test-retest reliabilities rose in every case. For the four sub-scales the reliabilities were (ten-item reliabilities in brackets): .60 [.46] , .58 [.47], .59 [.56] and .73 [.61]. Most impressive of all, however, was the fact that the 24-item. form was more valid. The 24-item achievement scale predicted performance in an achievement task but the 10-item form did not. This tends to confirm the view of reliability as setting an upper bound to validity (But see Karon. 1966).
The most direct evidence for the present thesis, however, is to be found in the work of Atkinson (1950). Doubling the normal number of TAT cards for measuring n Ach from four to eight, he obtained a Spearman-Brown reliability of .65. Selecting the six of these which elicited most achievement imagery an even higher value of .78 was obtained. These values are of course much higher than is normally obtained and do in fact fall in the range normally found with Likert scales. Murstein quotes this result but without comment. He does appear to regard the Spearman-Brown split-half formula as a dubious form of reliability measurement but its widespread use with Likert scales makes this seem a rather odd reservation. It is on the basis of precisely such a reliability estimate that projective tests are often condemned. The relationship between internal consistency and test-retest reliability is a well-established one (Cronbach, 1951) and, if empirical evidence is needed, one could again turn back to Johnston's (1957) study, where it is seen that internal consistency and test-retest reliability increased simultaneously. If there are any residual doubts at all about the superior reliability of a summed score as opposed to a single-item score, the results of Malatesha (1971) should put these at rest. He tabulates the test-retest reliability of an Indian version of the TAT for measuring n Ach. He gives the reliabilities of the five cards and of the total score separately. The average reliability of a single card is .49 but the reliability of the score obtained by totalling the scores on the five cards is .74. Increasing length does increase reliability.
Scoring System and Stimulus Materials as Contributors to Reliability
The results quoted so far are something like oases in a desert. That Atkinson should get a reliability of .78 with six cards and Malatesha a reliability of .74 with five cards is clearly exceptional as far as projective tests in general go (See the research summarized in Murstein, 1963 or, Weinstein, 1969). Most amazing of all, however, are the reliabilities reported by Kureshi (1966). From two plates for each of three constructs he obtains reliabilities of .85, .81 and .98! Can something be learned from these exceptional cases which would help us all to do as well?
Regrettably, it is not easy to establish why these tests were so exceptionally reliable. The Atkinson and the Kureshi studies are both to be found only in unpublished dissertations that came to attention only because they were quoted by Murstein (1963) and Siddiqui and Akhtar (1969) respectively. The Kureshi article also gives little hint of what might have been atypical in that particular administration of the test. Fortunately, however, some recent Australian work by
the present author on need for achievement contained a projective test with high reliability. Form I of the French Test of Insight (FTI) (See Atkinson, 1958) was administered by students to people they felt they could rate on need for achievement. With a sample of 75, a reliability of .748 was observed. This was reliability as assessed by Cronbach's (1951) coefficient "alpha," which is in fact the mean of all possible split half reliabilities. The FTI was scored for achievement imagery only. Other Likert scales for measuring achievement motivation included in the test battery showed reliabilities ("alphas") that were in some cases lower than that of the FTI. There is, however, one element of spuriousness in the FTI reliability just given. Murstein and many others (e.g. Smith, 1970) have recommended that achievement imagery score be corrected for verbal fluency. Obviously, the more one writes, the greater chance one has of emitting some achievement imagery. When the FTI was scored simply for number of words used in each answer, a reliability of .976 was observed, i.e. the number of words people use from answer to answer is highly consistent. This score correlated .580 with the original FTI raw score, indicating substantial contamination. When the score on each item was corrected for fluency (i.e. number of achievement images divided by number of words used) the reliability dropped to .630. Because verbal fluency scores tend to be highly reliable, a test score contaminated with fluency will also tend to have its reliability inflated. A failure to make this correction could possibly explain the remarkable results of Kureshi (1966).
The reliability of .630 is, however, still considerably higher than that usually obtained , Weinstein (1969) reported a split half reliability for this test of .48. The characteristics of this administration are then still of interest. The only deviation from normal practice that could have an explanatory role was that a simple global "total number of achievement images" score was calculated for each of the ten items without endeavouring to discriminate sub-categories (such as "Fear of Failure" versus "Success seeking"). It should be mentioned here, however, that the whole reason for choosing the French test for the study was that previous studies had found some reliability for it. The levels of reliability observed were less surprising than they would have been with, say, the TAT. This in fact is of no small relevance to the thesis presented above. The FTI is considerably longer than the TAT as used to measure n Ach (ten items versus four). It would seem then that, at least for the projective measurement of n Ach, a test (the FTI) is available for which we normally can expect satisfactory reliability. This is because the FTl already has two parallel forms. If both forms were combined, reliabilities should range from high to at least satisfactory. Had Weinstein done this we can (by the Spearman-Brown formula) estimate that he would have obtained a reliability of .64 instead of .48. A similar calculation for the present administration of the test gives an increase from .63 to .77.
To return to the topic of why a satisfactory reliability was obtained on the present occasion with only the ten item form: The global scoring procedure may have contributed something but another possibility is that the reliability statistic used was simply more accurate. Weinstein and others have normally used an arbitrary splitting of their tests into two halves when estimating reliability by the internal consistency method. On the present occasion Cronbach's (1951) generalized "alpha" formula was used. This is equivalent to the mean of all possible split halves. With short projective tests arbitrarily taking just one of the possible split halves may introduce more error than is usually the case: In future reliability measurements of projective tests, use of alpha is therefore strongly indicated. Use of alpha would also allow for possible "sawtooth" type effects that have often been invoked in explanation of low projective test reliability (See Murstein, 1963).
In summary, then, it may be said that prospects for the reliable projective measurement of at least n Ach are good as long as those working with such tests take all the means available to them for maximizing test reliability. Early publication of the promising work of Kureshi (1966) in this field would also be of great assistance in possibly extending our means of attaining this goal. The extremely high (almost high enough to make one suspect some artifact) validity coefficients (reliability figures are regrettably not given) reported by Honor and Vane (1972) for the TAT scored by the Arnold (1962) system also give promise that alternative scoring systems as well as increased length may contribute much towards the future attainment of high reliabilities.
Arnold, M. B. Story sequence analysis. New York: Columbia University Press, 1962.
Atkinson, J.W. Studies in projective measurement of achievement motivation. Unpublished doctoral dissertation, University of Michigan, 1950.
Atkinson, J. W. (Ed.) Motives in fantasy, action and society. Princeton, New Jersey: Van Nostrand, 1958.
Cronbach, L. J. Coefficient alpha and the internal structure of tests. Psychometrika, 1951, 16, 297-334.
Cronbach, L. J. Essentials of psychological testing. New York: Harper, 1964.
De Charms, R. Morrison, H. W., Reitman, W., & McClelland, D. C. Behavioural correlates of directly and indirectly measured achievement motivation. In McClelland, D. C. (Ed.), Studies in motivation. New York: Appleton Century, 1955.
Entwisle, D. R. To dispel fantasies about fantasy-based measures of achievement motivation. Psychological Bulletin, 1972, 77, 377-391.
Erginel, A. On the test-retest reliability of the Rorschach. Journal of Personality Assessment 1972, 36, 203-212.
Guilford, J. P. Psychometric methods. New York: McGraw Hill, 1954.
Honor, S. H., & Vane, J. R. Comparison of Thematic apperception test and questionnaire methods to obtain achievement attitudes of High School boys. Journal of Clinical Psychology, 1972, 28, 81-83.
Johnston, R. A. A methodological analysis of several revised forms of the Iowa Picture Interpretation Test. Journal of Personality, 1957, 25, 283-293.
Karon, B. P. Reliability: Paradigm or paradox, with especial reference to personality tests. Journal of Projective Techniques & Personality Assessment, 1966, 30, 223-227.
Kureshi, M. A. A study of adolescent fantasy. Unpublished PhD thesis, Department of Psychology, AMU, Aligarh, 1966.
Lynn, R. An achievement motivation questionnaire. British Journal of Psychology, 1969, 60, 529-534.
Malatesha, R. N. The relationship between motivation and attitude of modernization. Journal of Psychology Researches. 1971, 15, 111-113.
Murstein, B. I. Theory and research in projective techniques. New York: Wiley, 1963.