Scandinavian Journal of Psychology, 1988, 29, 145-147.



University of New South Wales, Australia

Smedslund has rediscovered the fact that the items of many scales are to various degrees semantically related. He is critical of this. It is pointed out that there is already an ongoing debate about how much semantic overlap there should be between items and it is claimed that such overlap is important in maximizing reliability. Too much overlap does however limit validity so proposals for formalizing the degree of overlap are briefly explored.

Although, I have long been an admirer of SmedsIund's various attempts to inject more philosophical sophistication into psychology, I would like to submit that in his latest paper (Smedslund, 1987) Smedslund is essentially tilting at windmills. What he attacks is very much a straw man.

In his paper Smedslund quite persuasively shows that the correlations between the items of an additive scale will tend to reflect the degree to which those items share semantic content. He implies that this represents some sort of folly on the part of scale constructors and seems to think that scale constructors are unaware of this aspect of their scales. As I have constructed and had published more than my fair share of scales over the years I may be in a position to comment on these implications.

Basically, I have bad news for Smedslund: no scale constructor worth his salt would be unaware of what Smedslund has shown. Proof of this is the fact that there has long been a running debate among people concerned with scales over just that issue. The debate has however not been about whether scale items are semantically related but rather about how much semantic relatedness is desirable. The debate has, in other words, already gone a long way past the basic truth that Smedslund has rediscovered: For instance, Boyle (1985) argues that a scale can be too internally consistent. He says that internal consistency of a high order can only be obtained by having items which all more or less say the same thing and that this is undesirable. This is, of course, very much like what Smedslund says. Boyle's position is however a minority one. Although such eminent psychologists as R. B. Cattell support him (see Boyle, 1985) the usual psychometric textbook approach is that internal consistency (and hence reliability) should be maximized (see Nunnally, 1967).

What both Boyle and Smedslund appear to overlook is the reason why we use multi-item scales in the first place. We use them to increase both reliability and validity. The very concept of reliability (in psychometrics) however implies repetition. We want to find out if people will give the same answer twice. How can we do that without content overlap? By asking a person the same question in two or three different ways and finding out that the person answers consistently on the various occasions we assure ourselves that what we are getting is potentially informative and not mere random noise. This relationship between internal consistency and repeatability is of course the basis of the long-known Spearman-Brown prophecy formula (Nunnally, 1967). If a person gives consistent answers within the scale he will also tend to give consistent answers between administrations of the scale. This is because the (say) 10 items of a scale represent not only 10 different sets of content but also 10 occasions of measurement. The Spearman-Brown formula (or derivatives such as Cronbach's (1951) "alpha") thus uses a set of 10 occasions of measurement to estimate the outcomes of 20 occasions of measurement -- the second 10 occasions being, of course, when the scale is re-administered. Some degree of content overlap is therefore entirely proper in any scale. Boyle's contention that internal consistency cannot be used to predict test-retest reliability therefore serves only to illustrate that he has understood none of the basic texts or references on this topic (see Nunnally, 1967; Cronbach, 1951).

Both Boyle and Smedslund are however right in detecting that there is something a bit unsatisfactory about the content overlap between items: The problem with such overlap is that it limits validity. It undermines your claim to be studying a general construct. For this reason most scales have some sort of more or less formalized sub-scale structure (e.g. Ray, 1971). The scale will he comprised of clusters of semantically related items and a scale will not be considered satisfactory unless there are correlations between the clusters as well as correlations within the clusters. Factor analysis is the normal method of checking on this. The fact that the items of a scale do not correlate solely with semantically related items is thus the basis for a claim that the scale has at least the potential for a broad validity. There is even a conventional name for this aspect of a scale. It is called "construct validity".

Smedslund does of course claim that it is not only semantic overlap that produces inter-item correlations. He also claims that "commonsense" or implicit personality theories play a part. People know that although two things are not semantically related they are none the less commonly found together. People will know that if they answer "yes" to some item they will feel that they should also answer "yes" to something commonly associated with it. Thus by being one of the group for whom the scale was written one can predict many of the associations between the scale items. There can be no doubt that Smedslund is perfectly correct in all this. What he seems to overlook, however, is that finding such associations is no easy task. As every scale constructor knows, there are many items which should go together which are in fact found not to go together. This is why the final form of a scale is almost invariably much shorter than the initial version of the scale. Many "obvious" relationships between items turn out not to be so obvious after all. They turn out in fact to be non-relationships. The discovery of such relationships is thus a worthwhile achievement in itself. Since it is unusual to report one's failures, we cannot know how much of an achievement it is but I myself have certainly had complete failures with some scales on some occasions (e.g. Ray, 1972). What I thought was a well-conceived collection of items relating to the one general trait turned out to have nothing in common at all. I am sure I am not alone in that experience. To find any set of items that go together must therefore be rated as some advance in our knowledge. The EPQ is therefore a far less trivial achievement than Smedslund appears to think it is. The fact that the correlations between its items can be predicted is itself an important achievement. Such items or sets of items are all too rare. In fact one scale of the EPQ (the P scale) is still, despite many revisions, below the level of internal consistency generally considered acceptable (Ray, 198b). Eysenck no doubt wishes that he could find a few more of those "obvious" relationships that Smedslund decries.

Be that as it may, however, we should perhaps in conclusion revert to the point that Smedslund faintly raises and which Boyle quite explicitly raises: How much sematic overlap is in fact ideal? Is there any sense in Boyle's proposal that reliability (alpha) should be limited to 0.70? If we realize that a scale with an alpha as high as 0.90 can still have a mean inter-item correlation of less than 0.20 this proposal does seem extreme. Boyle is proposing that we accept as scales collections of items which may on many occasions be insignificantly correlated with one another. Under such circumstances Boyle's concern that a scale with an alpha of 0.90 may have "surplus" homogeneity or content overlap can only be seen as laughable. There may indeed be some sense in setting up formal criteria for degree of content overlap but Boyle's proposal for such a criterion is plainly unpersuasive. An alternative might be the wider adoption of Comrey's (1970) procedure of having around three versions of each item and treating the resultant clusters as mini-scales. Most subsequent analyses are then based on the relationships between the clusters rather than on the relationships between the items. This procedure would seem to go some way towards formalizing and standardizing degree of content overlap.


