The University of New South Wales, Continuing Education, 1982


LECTURER: Dr. John Ray

In this course an attempt has been made to deliver all the material orally as far as possible. The notes therefore will be slightly briefer than usual and, instead of going over exactly the same ground as the oral delivery, will more represent an independent approach to the same problems that takes advantage of the things that can best be done in a written medium. A synopsis of each individual lecture will therefore be given at the end of these notes, but first let us look at the problem afresh:

Method and rationale for constructing an additive scale

The technical terms first: A scale is any group of questions or items with a common theme. An additive scale (often called a "Likert" scale) is a set of items where each person gets a score according to how many questions he answers in a certain direction or in a certain way. The old school spelling test (9 out of 10 spellings right meant that you were a good speller) is one example. Intelligence, personality and most attitude scales are of this type. They are additive scales. In intelligence scales the only possible scores on an item are 1 or 0 (i.e. "right" or "wrong"). In attitude scales it is customary to offer a range of responses such that a person gets a higher score the more strongly he agrees. A common scheme is to give:

5 = Strongly Agree
4 = Agree
3 = Not Sure
2 = Disagree
1 = Strongly Disagree

A failure to respond is also generally scored as 3. This number system assumes that each possible arswer is perceived as spaced along an equal-intervals continuum. This is probably seldom the case but the amount of information lost by the approximation is held to be slight.

An important feature of such a scale is that it should consist of equal proportions of oppositely scored items. Thus a "Strongly Agree" response on the following two items would get a 5 in the first case but a 1 in the second case:

1. All men are equal.
2. Inequality is nature's law.

If these two items correlated highly, they would form a rudimentary "scale" of "equalitarianism". In adding up scores on several items we are assuming that all the items measure the same basic thing. Proof of this is always needed in the form of high positive correlations between all the items. What we are doing in fact is to find the average response of any person to an item. We want to know if on the average he agrees with equalitarian statements or disagrees with them. We can always get such an average by dividing the total score by the number of items but we in fact seldom do this. For most uses the total score alone is just as satisfactory.

This averaging effect is useful to us because it enables us to randomize out variability due to the particular wording of the individual item and retain what is general to all the items. For instance the following items all express an attitude to innovation - but innovation in different contexts:

1. I enjoy modern music better than music written years ago.
2. Australia's ties with Great Britain should not be reduced.
3. The best buildings are those whose architecture is functional and without ornamentation.
4. The woman's place is in the home.

It is known that answers to these four items intercorrelate significantly and yet they all talk about different things. The one thing they have in common is a preference for old ways versus new ways. If we gave 1 for a "Strongly Agree" response on items 1 and 3 and 5 for such a response on items 2 and 4 we might say that we have a scale of conservatism. One item alone would not do as such a scale. If we took item 4, it might be affected as much by the subjects sex as by their attitude to innovation. Answers to item 3 might be affected by one's artistic taste; answers to 2 might be affected by one's ancestry or birthplace; answers to 1 might be affected by the subtlety of one's musical taste. On the other hand, answers to items l, 2 and 3 are unlikely to be affected by ones sex; answers to 1, 2 and 4 are unlikely to be affected by one's taste in architecture; answers to l, 3 and 4 are unlikely to be affected by one's birthplace and answers to 2, 3 and 4 are unlikely to be affected by the subtlety of ones musical taste. Thus by taking an average we minimize the effect of any one specific effect and are left with what is general to all items or held in common by all items. Note: We cannot assume that a group of items has something in common just because it appears to on inspection. We can only get to know this by giving the test to a large group of people and seeing if the items empirically do intercorrelate. This is what test analysis consists of.

There is a great deal of variability in how highly a group of items intercorrelate. In other words, their internal consistency is variable. The usual measure of a scale's internal consistency is the average inter-item correlation i.e. the average of all the correlations between all the items. We may call it "r" (mean 'r'), Naturally we like r to be as high as possible. In published tests it ranges between approximately +0.100 and + 0.300. The more the items intercorrelate the more they have something in common, and the less they measure specific things only. Therefore we can feel sure that we are measuring something general with quite a short scale if the r is high. The higher the r the shorter we can make the scale. The lower the r the longer the scale has to be. In item analysis therefore, we discard those items first which detract most from r i.e. in the following 4 item scale we would discard item 3 first of all.


2...............1.000..-.093....398....r = .203

and are left with:

2..............1.000.....398....r = .410

So we see that in order to know how good a scale is we need to know both its r and how many items there are in it, i.e. we need to know its r weighted by its length. We can combine this information into one number by using a weighting formula. A very common such weighting formula is the Spearman-Brown equation. Our new single number is called (after Cronbach) 'coefficient alpha". The formula is:

alpha = n.r/(1+r(n-1))

where n = the no. of items. (Vide Lord & Novick, 1968, pp. 90 & 118).

There are some rough conventions about what level of alpha is acceptable for a scale. If the scale is to be used in personnel selection, it must be above .90. A good research instrument is above .80. A preliminary version of a research instrument should be above .60. Note that if a scale has an r of .200 and 20 items, its alpha will be .83. If it has only 10 items its a will be .71.

This coefficient is often called "the reliability coefficient 'alpha'". This is because it has the very useful property of being usually close to numerical identity with the test-retest reliability of a scale. i.e. if we give the same test to the same group of people on two occasions a month apart we might find that scores on the two occasions correlate .85. If this is so, we would also expect coefficient "alpha" of the test to be around .85.

Why is this so? If we give a 10 item test twice we have not two occasions of measurement but 20. Each time a person answers an item is a different time from any other. One does not answer two items at once. One answers them one after the other. If the 10 item test is a test of conservatism, we have 20 occasions being measured. Each item is a minature (but weak) test in itself. Thus in finding the r of a group of items we are finding the r of several occasions of measurement. When we weight this by r by the length of the test we also correct the r for the greater accuracy that is obtained by using the sum of all items at once instead of just one at a time. Thus if the r on the second occasion of measurement is similar to the r on the first occasion (and if the two groups of people are identical this is generally so) we can estimate the correlation between the second set of ten measurements and the first set of ten measurements by knowing just the intercorrelations among the first set of measurements. Another way of doing the same thing which is perhaps more intuitively understandable is to use the split-half method. In this method the reliability of the test across two administrations is estimated by splitting the responses obtained on one administration into two halves (e.g. odd numbered items vs. even numbered items) and adding up the two sets of items to get two total scores. These two scores are then intercorrelated and corrected for the attenuation due to their each being only half as long as the test proper by the same Spearman-Brown formula. In this use it is sometimes called the Spearman-Brown "prophecy" formula. The split half method divides the test into two sub-tests. Our method divides it into as many sub-tests as there are items. (For a recent reference on this topic see Cronbach (1951)).

Coefficient alpha is not usually calculated by the Spearman-Brown formula. There is a mathematically equivalent shortcut formula which enables us to sidestep calculating the actual intercorreiations. For interest's sake it is:

alpha = (n/(n-1)).(1.0 - the sum of the variances of the individual items/the variance of the total score)

In a form suitable for the 1 or 0 response possibilities of intelligence test items, this is known as formula K.R.21 (after Kuder and Richardson, who devised it).

Because it is so closely associated with internal consistency, one does occasionally hear coefficient alpha referred to as a measure of internal consistency. Strictly, this is incorrect. r is the measure of internal consistency. Alpha is the internal consistency weighted by the length (no. of items) of the test. When one says: "The reliability is ......(so and so)" one is usually referring to the alpha of the test.

There is also a shortcut method of finding which items of a test are weakest (detract most from the internal consistency). This is to correlate each item with the total score on the test. The items with the lowest correlations are discarded. It is customary to correct the raw correlation for "spurious overlap" i.e. because the individual item forms part of the total there will always be some correlation between the two but this effect can be removed mathematically. Only the corrected coefficient can be tested for significance.

Note that alpha and r behave differently as we drop low correlating items from a test. Alpha rises to a maximum and then drops, whereas r rises continuously. It is normal practice to drop weak items in groups (e.g. drop the four weakest items) and then re-calculate the reliability and item-total correlations. On the basis of the new correlations one then drops the next weakest four items and so on. As low correlating items are progressively dropped r must obviously always rise. The advantage of a higher r is however partially offset by the shorter length of the test so alpha does not rise as much. Eventually, the items become so few that, in spite of the higher r, alpha begins to drop. We normally take as the final form of our scale that length at which alpha is at a maximum. If the original number of items was small or if there was only a small proportion of high correlation coefficients between items, the final maximum alpha will be low. There is a significance test of alpha due to Hoyt (1941). It works by converting alpha to an 'F' ratio. The formula is:

alpha = 1 - 1/F

The appropriate 'F' is found from a table by taking the degrees of freedom as:

No. of respondents - 1
(no. of respondents - 1) x (no. of items - 1).

If we have 100 persons and 8 items the degrees of freedom would thus be 99 and 693. For these, an 'F' of 1.40 is significant at the <.01 level. Hence the minimum level of alpha for significance at the <.01 level becomes .29. This test is seldom used because a reliability of .29, although significant, is never likely to be useful. Instead one takes as a guide the rough conventions given earlier.

For example

Say we wish to construct a scale to measure "task orientation" (vide Bass, 1967). The Bass Orientation inventory (0ri) already provides a measure of this construct but is characterized by very low test-retest reliability. Some serious problems have also showed up in validity studies. A preliminary look at the nature of this inventory is required. A typical item is:

5. I like my friends to:

A. Want to help others wherever possible.
B. Be loyal at all times.
C. Be intelligent and interested in a number of things.

C is said to be a task oriented response
B is said to be a self oriented response
A is said to be an interaction oriented response

To answer such a question, one has to choose one out of A, B and C. This is what is known as "ipsative" scoring. It purports to tell you what "type" of person the subject is. He must be a task-oriented type, an interaction oriented type or a self-oriented type. The scoring system prevents him from being high on all three orientations or low on all three orientations. This means that if a person dislikes the self-orientation or interaction orientations he is forced to present himself as a task oriented person even though he may in fact be simply indifferent to it. Ipsative scores also have the defect that they cannot meaningfully be correlated with one another. That is, although Bass presents two different types of "opposites" to task-orientation ("self-" and "interaction-"), we have no way, using his scale, of telling how opposite the two actually are.

This throws some light on a puzzling feature of the Ori's validity: Bass (1967) in summarizing a great range of studies with Ori, concludes as follows: "With a few interesting exceptions, in small groups and large organizations, the task-oriented person is upgraded by observers, peers and superiors. He is more tolerant of deviant opinion, conflicting ideas and directive supervision, although he does better himself as a permissive supervisor ...... The interaction-oriented person is downgraded generally ...."

One would think that the task-oriented person would place less value on, and give less attention to, interpersonal relations and hence be more isolated and less popular. One would expect him to be a directive supervisor -- little tolerant of individual peculiarities. AND YET THE OPPOSITE IS FOUND TO BE TRUE.

Of all things, one would most expect the task-oriented person to have a high need for achievement -- and yet Bass reports that there is in general no relationship. Bass, too is disturbed by this point and endeavours to explain it away by saying that the classical n-Ach measures only tap fantasized achievement, not devotion to real achievement. To say this is to ignore the range of validity studies that have supported the n-Ach measures as indices of actual behaviour (see Brown, 1964).

Being careless or the individual and concentrating only on getting things done, we might also expect the task-oriented person to be perceived as authoritarian -- and yet again Bass reports that this is not so. One would expect the interaction oriented person to be most characterized by humanitarianism-radicalism and be tolerant of deviance in others -- and yet Bass reports that these things are more true of the task-oriented person.

How has all this come about? It could possibly be attributed to the ipsative scoring. To present oneself as self-oriented is strongly discouraged by our culture. To say a person is selfish is very pejorative indeed. Interaction-orientation too appears flabby and weak, indicative of insecurity and uncertainty. Our culture glorifies achievement, competition and success. Even if one is interaction-oriented, one does not acknowledge this as a primary goal. Given this situation, the normal, well adjusted person would express the values of his society and his milieu by choosing the task oriented response and avoiding the socially undesirable self- and interaction-oriented responses. Thus by forcing a nomal person to make a choice we prevent him from expressing the degree of preference that he may also have for interaction orientation. Only the really dependent and ineffectual people choose this alternative, while only the maladjusted and egotistical choose the self-oriented alternative.

If the foregoing is true a great deal of the validity defect could be cured by "de-ipsatizing" the scale i.e. each of the alternatives could be made into an independent item as follows:

...................................................................................................................Agree......? .....Disagree

1. I like my friends to want to help others wherever possible........................3..........2........1
2. I like my friends to be loyal at all times......................................................3..........2........1
3. I like my friends to be intelligent and interested in a number of things......3..........2........1

The above is only an illustration. The change needs to be done with rather more subtlety than this. As they stand, the above three would not be very good items simply because nobody could be expected to disagree with them. Who does not like helpful, loyal and intelligent friends? A better rewording might be:

l. It is important to me that my friends should want to help others wherever possible.
2. It is important to me that my friends should be loyal at all times.
3. It is important to me that my friends should be intelligent and interested in a number of things.

We would, then, endeavour to do two things in this project: 1). De-ipsatize "Ori", 2). Write two sets of new items which we would expect the task-oriented and interaction-oriented person (as usually conceived) to agree with. We should probably ignore self-orientation as being most clearly socio-pathological and least likely to be of importance.

Our new task-orientation and interaction-oriented scales will, of course, have to be composed of equal numbers of positively and negatively worded items. This is to avoid problems with acquiescent response set. If the two scales turn out to be highly negatively correlated, we might in fact combine them into one. We cannot know in advance whether this will happen. There is no a-priori reason why it should not. Even if the mean score on task items is high and the mean score on interaction items is low, the two may still intercorrelate highly! The important thing is that the Likert procedure does not put a person in the position of having to make a choice -- one alternative of which may be more socially desirable than the other.

After we have constructed our new scales we will have to devise a validity study to test it. The nature of this is a matter of convenience. The two major types of study are the experiment and the survey. For the former, the Bales (1958 & 1960) system of interaction process analysis may be useful. His schema specifically provides categories for task oriented and socio-emotional responses. For the second type, a criterion-groups study is often easiest. One attempts to find two groups which differ on no important characteristic other than that measured by the scale. Differences in mean scores by the two groups can then be tested for significance by a 't' test and the results indicate the validity of the scale. One might, for instance compare music and poetry clubs with abortion-law reform and civil-liberties groups. Alternatively, W.E.A. classes versus Tech. College classes, or simply men versus women in any group.

If the new scale appears too long for convenient administration on in our validation study, we might have to give it to a preliminary group solely for the purpose of reducing it in length by item-analysis procedures. We might thus have a reliabilitiy study followed by a validity study.

Whatever groups we give the scale to, we should also include some other scales in the questionnaire. This enables us to assess concurrent validity. We might thus include n-Ach measures, authoritarianism measures and humanitarianism measures because, on theoretical grounds, we would expect these constructs to be related to our constructs.

Lecture synopses


LECTURER: Dr. John Ray

The idea that something invisible going on inside people's heads can be measured is a fairly bold one. Yet, as Thurstone said, "if a thing exists, it exists in some quantity and hence can, in principle, be measured". All of us do in fact talk as if we can measure other people's attitudes. We say things like: "She is a lot shyer than he is" or "he is so conservative it is unbelievable". Such statements imply that we can detect degrees of shyness or conservatism. Perhaps in part we infer these mental attributes from behaviour but also we infer them from things people say. The present course of lectures is largely devoted to making the latter type of inference more systematic and more careful.

A very useful thing to do when we wish to make our statements of quantity more precise is to make use of numbers. This is what is usually meant by "measurement". Let us look at the types of measurement there are:

1). Nominal measurement is where we just use numbers as labels -- e.g. runners in a race may be allocated to "Lane 1" or "Lane 2" or "Lane 3". "1" or "2" is just a label. There is no implication that Lane 2 has more of something than Lane l does.
2). Ordinal measurement is where numbers indicate some sort of ranking -- e.g. The horses in a horse race are said to come home 1st, 2nd and 3rd. They are ranked 1, 2 and 3.
3). Equal interval measurement --- e.g. the themometer is graded so that the difference in amount of heat is the same between 10 and 20 degres as it is between 30 and 40 degrees. You would have to add the same amount of heat to get the same temperature rise. Note that this is not so in ordinal measurement. The fact that horses A, B and C came 1st, 2nd and 3rd completely ignores the fact that Horse A may have been 2 yards ahead of Horse B but Horse B in turn may have been only 1 yard ahead of Horse C.
4). Ratio measurement. In the themometer where you place the zero is arbitrary. The Fahrenheit zero, for instance, is 32 below the Centigrade zero. Where the zero is placed to indicate "no heat at all" (as in the absolute scale of temperature most used by physicists) however, we have a ratio scale. In attitude measurement our scales are probably somewhere between ordinal and interval in their properties. As far as we can usually tell, little error results from treating them as equal interval scales (but see later).

A very popular form of attitude measurement (particularly with novices) is content analysis. In this a person is asked to express his opinions on some topic in as open-ended a way as possible and this text is used to rate him in some way. This starts out seeming a very attractive way of gathering data about people's mental world but soon becomes less attractive when we have to start trying to compare one person with another. Is (for instance) text A more conservative than text B? The only way we can proceed to such judgments with any semblance of care and objectivity is to have some sort of pre-stated scoring system or scoring guide. We count instances of particular types of utterance and give the person a higher or lower score on the basis of the number of such utterances he makes. Note the following example of a scoring system for achievement motivation from French's chapter in J.W. Atkinson's book Motives in Fantasy, Action and Society. French wants to count up the number of achievement-related images each person uses in responding to certain questions that require them to tell a story of some kind:

"Although it was likely that there would be a fairly high correlation between the scores obtained by merely tabulating the number of items in which the relevant imagery occurred and scores based on a content breakdown of those items, a scoring system involving such a break down was devised for several reasons. One was that increasing the number of possible scores per item would increase the possible range of scores, thus permitting increased sensitivity of the test. The second was that further investigation might reveal some types of responses to be more diagnostic than others. The final consideration was ease, reliability, and objectivity of scoring. Breaking the imagery into categories would permit more precise definitions of what should be scored. Scoring by categories would also provide an objective (numerical index) measure of the amount of imagery in an item and eliminate the necessity of a subjective judgement by the rater. Using McClelland's early scoring method as a point of departure, we set up categories which were comprehensive enough to handle our data and which suited our theoretical formulations. The resulting system is similar in many respects to McClelland's recently published version which was not available when this research was being done. The categories and a sample response for each appear in Table 1.

TABLE 1. Scoring Categories for the Test of Insight (Category followed by Example)

1. Desire for goal (A+)
"He is determined he will succeed in everything he does".
2. Goal directed ability (I+)
"He does it to make the 'other fellow' like him".
3. Personal qualifications for goal attainment (Q+)
"He has leadership ability".
4. Expectation of goal attainment (Ga+)
"He will make a name for himself"
5. Goal attainment (G+)
"He has lots of friends".
6. Positive affect to goal attainment (P)
"He has a feeling of satisfaction about a job well done".
7. Desire to avoid failure (A-)
"He hates to do anything wrong".
8. Activity directed toward avoiding failure (I-)
"He lets the 'other fellow' win so he won't get mad".
9. Lack of qualifications for, or possession of qualifications preventing, goal attainment (Q-)
"He hasn't enough ambition" "He is disagreeable".
10. Expectation of failure (Ga-)
"He will never profit much".
11. Defensive statements or rationalization (D)
"He pretends he doesn't care because he knows he can't".
12. Failure to attain goal (G-)
"He is an outcast".
13. Negative affect to failure (N)
"He is upset because he didn't pass".

Once the items have been scored, a number of scores can be computed for a given individual: a total score, which is the sum of all the categories scored for all the items; a total positive score, the sum of all the positive categories for all items; a corresponding negative score; and a score for any given category."


LECTURER: Dr. John Ray

As will by now be obvious, open ended questioning places a very heavy load of arbitrary judgement on the researcher. The researcher has to do the job of putting people or their responses into categories. Surely it would be better to let the person himself do any categorizing needed? Why not let people rate themselves? This is what the usual "tick a box" questions are designed to do. Because of the limited options for response that such questions can offer, they often seem rather arbitrary and unsatisfactory but at least the arbitrariness is out in the open and not concealed in a scoring guide used long after the interview has taken place. Furthermore, it is a part of the aim of this course to explore ways of reducing the arbitrary and unsatisfactory aspects of "tick a box" questions. We will call the use of such questions "closed ended surveys". A set of such questions having a common theme is called a "scale".

Some technical terms: When a person gets a score on such a scale (assessed, for instance, by the number of answers he gives in a "key" direction) we want that score to be both "reliable" and "valid". "Reliable" implies that if you gave him the same scale to answer a month later he would get much the same score. "Valid" implies that if his score indicates that he is (for instance) highly conservative, then he really is highly conservative. A scale should also be internally consistent -- for details see the earlier section of these notes.


LECTURER: Dr. John Ray

What if the people we are surveying tell lies about their opinions? How do we deal with that? We do two very important things:

1). We design the survey to minimize the motive for lying. Telling the truth is always easier than inventing a lie so if the survey is anonymous and generally non-threatening we generally have the problem to a considerable extent beaten before it arises. Some people however just do not know how to be honest. They may be completely unaware that they are misrepresenting themselves. These people are caught up by the second part of our strategy:
2). We use a "lie scale". This is a scale of items that make implausibly good assertions -- e.g. "I never tell lies". People who make a lot of such claims are probably not being very accurate in their self descriptions and we could, for instance, discard such people from our survey results.

Another problem is the people "who will say "Yes" to anything": The careless acquiescers. Unfortunately, we cannot in this instance use a scale of "acquiescent tendency" because one thing we have found about such tendencies is that they are inconsistent from occasion to occasion. The types of statement attracting careless agreement vary from person to person. To cope with this, therefore, what we have to be very sure to do is to design our questionnaires so that a person gets a high score by disagreeing with some statements and agreeing with others. If he has to give an equal number of "Noes" (or "Disagrees") as "Yeses" (or "Agrees") to get a maximum score on whatever the scale is measuring, we say that the scale is "balanced". Note that an acquiescer would on such a scale get a middling score -- indicating, as we would wish, that we cannot place him one way or the other on the attribute in question. With a one-way worded scale, by contrast, such a person would artificially be shown as an extreme high scorer. Let us have an example of a balanced and an unbalanced scale of attitude to Aborigines. This is an imaginary scale for the purpose of illustration only.


1. Aborigines are dirty.......................3...........2........1
2. Aborigines know little of hygiene...3..........2........1
4. Aborigines seldom work................3..........2........1

Note that as well as being unbalanced (all items anti-Aborigine) this scale is also offensive. We seem to be pushing a consistent negative "line". Such scales were often used in the early days of attitude research and such "leading the witness" can still be found among some careless researchers. If I answered all questions "Yes" out of sheer indifference or amusement at the task, I would get a score of 12 (4 x 3) and be wrongly shown as an extreme racist. Let us now look at an alternative "balanced" scale:


1. Aborigines are dirty.......................................................3........2........1
2. Aborigines get drunk a lot.............................................3........2........1
3. Aborigines understand hygiene as well as anyone.......1........2........3
4. Aborigines are good workers given the chance............1........2........3

To come out as an extreme racist on this scale I have to consciously change my answers from "Yes" at first to "No" later. An extreme acquiescer would get a middle score of 8 (out of a possible range from 4 to 12). We would, in other words add 3+3+1+1 to get his score from 4 "Yeses". Note that the scale also has the attraction of appearing neutral. Both racists and anti-racists can find something in it to agree with. We are not "leading the witness" or suggesting what sort of line we like the person to take.

Everything so far has concerned Likert scales -- scales where item scores are added to get a scale score. There are also two other well-known types of scale which we shall briefly look at here. They are Thurstone scales and Guttman scales. Thurstone scales look like this:

Check the two items that come nearest to representing your opinion:

1. Aborigines are kind and admirable people.
2. Aborigines are O.K.
3. I Like Aborigines.
4. Aborigines are not worth worrying about.
5. Aborigines are the dregs.

Say I check items 2 and 3. What is my score? We don't do any adding. What we do to find my score is to look up the "scale value" of each item and my score becomes the midpoint between the two "scale values" we find. How do we know the "scale value" for each item? We have to do a preliminary study. We give a large set of such items to a sub-set of the people we later wish to survey (We call this sub-set the "judges") and ask them to group the statements so that there are equal-appearing intervals between them in terms of degree of favourableness/unfavourableness. A statement in Group 1 might (for instance) be extreme anti-Aborigine whilst a statement in Group 9 might be an extreme pro-Aborigine statement. Statements such that judges tend not to agree very much where they fall we discard. The average group into which a statement falls becomes the scale score of that statement. Note again that I am here giving simplified examples for the purposes of illustration rather than real-life examples. Say then that the above 5 items had been found to be judged fairly consistently by the judges and had on average been assigned by them to categories 8.7, 5.0, 6.5, 13.1 and 1.2. My score would then be 5.75. The big disadvantage with Thurstone scales is that they have to be recalibrated for each different population to which they are applied. Each survey really then becomes two surveys. Since the results of Thurstone scaling have been found to correlate highly with the results of Likert scaling, the easier Likert scaling is normally used. Note, however, that the task put to the respondent is somewhat more attractive for Thurstone scales and the Thurstone scale also has a better claim to giving equal-interval measurements.

Guttman scales arose from Guttman's concern that the same score on a Likert scale can be gained in a variety of ways. A Likert scale score does not tell us which items the respondent said "Yes" to. Guttman therefore devised scales which had properties similar to the following:


1. I am over 5ft tall...... 1..... 0
2. I am over 5'6" tall.... 1..... 0
3. I am over 5'10" tall...1..... 0
4. I am over 6' tall........1..... 0

As I personally am just over 5'10" tall, I would answer "Yes" to Qs 1, 2 and 3 but "No" to Q4. Note that my score of "3" would tell Guttman not only that I am fairly tall but it would also tell him exactly what questions I answered to get that score. I could not, for instance, reasonably have answered "Yes" to Qs 1, 3 and 4 only. Guttman scaling, then, is a system that requires statements to be strictly ordered in their degree of extremity. Unfortunately, on any one topic there are generally very few attitude items that can be so ordered and this means that Guttman scales have to concern very narrow issues. It is hard to construct them with any breadth of coverage. Broad concepts such as "conservatism" are not very amenable to Guttman scaling. Narrow concepts such as "Attitude to the Franklin River dam" are more amenable.


LECTURER: Dr, John Ray

This lecture is primarily concerned with "hints from an old hand" How do we actually put a questionnaire together? Some points: Always numerically pre-code your answers if you can. For example.

1. "Happiness comes out of a brown bottle" (Tick a box to indicate Yes, ? or No)

is only for novices. Do it this way:

Circle a number to indicate your answer: .................................Yes....?...... No

1. "Happiness comes out of a brown bottle"...............................3.......2.......1

The reason the second way is preferable is that the computer likes numbers best. It has a hard time reading ticks in boxes. So if, like everybody else these days, you are going to give your data to someone to analyse on the computer, the data should already be in number form. Another example of numeric coding might be as follows:

State your religion now by circling one number below:

1. Catholic
2. Orthodox
3. Anglican
4. Other Protestant
5. Non-Christian religion
6. Belief in God only
7. Atheist/agnostic.
8. Other (say which)

The "others" category is a catchall which will primarily be used by people confused about where they fall. You might get "C of E" (recoded as 3), Jewish (recoded as 5) or "No religion" (recoded as 7).


LECTURER: Dr. John Ray

Sampling: Centuries ago, mathematicians found that if you wanted to describe the attributes of a given population, you could do so nearly as well if you studied only a sample of the population rather than the whole population. The sample did however have to be representative and this was generally achieved by making it a random sample. One of the surprising findings they made was that the sample could often be very small relative to the whole population and that the accuracy of the estimate provided by the sample was dependant only on the sample size -- not on the population size. Thus, to a given degree of accuracy, a sample of 500 might be enough to estimate the characteristics of a population of 30 million.

Public opinion pollsters, for instance, get generally very accurate estimates of the opinions of all Australians (15 million in all) by sampling only about 2,000 people. Even 2,000 is in fact more than needed for many purposes. Pollsters use 2,000 overall mainly because they often want to break their results down by State and study each State separately. Some people argue that the ideal sample size is 40! Less than 100 is certainly common in psychological research. It all depends on the degree of accuracy one thinks necessary. As a rule of thumb, surveys of attitudes using the sort of scales advocated in these lectures should have between 100 and 200 people.

If the sample is to be broken down into sub-samples which will be studied separately (e.g. males and females) no sub-samples should fall below 100. This is however, very approximate. For more details you need a statistician. Most social science researchers don't calculate the degree of accuracy of estimate they want. They just get as many people as they can.

The big snag in sampling is that all the things statistics books tell you about it assume a perfect sample. They are talking about black and white marbles being drawn out of a barrel at random. In social surveys this seldom if ever happens. Some of the marbles (people) we draw won't come out! They say (in effect) "go and draw some other marble". This is the problem of non-cooperation. Easily a third of the people we approach door to door refuse to be surveyed.

In applying mathematical statistical procedures (probability estimates) to social science data we are therefore speaking "as if" we had true random samples. We know we really do not. What we are asking is then not "What is the accuracy of this estimate?" but rather "What would be the accuracy of this estimate if we had a random sample?". The latter may still be an interesting question.

How do we calculate probabilities (pretending our samples are true samples)? This depends on what we want to estimate so let us use just one example. Say we want to estimate the probability of a given correlation being due to small-sample error. (i.e. our sample being so small that the results are not representative of the whole population). First we have to calculate the correlation coefficient itself. Here is one commonly used shortcut procedure:

Say we have three people and we have their scores not only on an intelligence test but also their scores on a scale of ambition (achievement motivation). Following are their scores:



We decide to process the data using the Spearman Rank Difference method. In this method the only information we use is the person's rank on the attributes concerned. So we rank our sample separately on the two attributes:



1...............2nd............2.5th............. 0.5 (1/2)
3...............3rd............2.5th..............0.5 (1/2)

Note that two people "shared second place" on ambition. In reality they shared 2nd and 3rd place so we score each as "2.5". We then simply subtract the ranks from one another to get column 3. We then square each entry in column 3 to get a quantity called "d-squared". We then sum the "d-squareds". Thus .25 + 0.0 + .25 gives .5: This quantity is called sum of d-squared. We insert it in the formula:

Rho = 1 - 6 times the sum of the d-squareds/n(n-squared -1) or:
Rho = 1 - (6 times .5/(3(9-1))

where "n" is the number of people. The correlation (Rho) is then 1 - (3/24) or .875. This is a high correlation and indicates that there is a strong tendency for high intelligence to go with high achievement motivation (or ambition). Note that if the answer had been -.875 instead of .875, the correlation would have meant that there was an equally strong tendency for high intelligence to go with low ambition.

But could this result be an outcome merely of sampling error? Could it be a freak result due to the very small sample? We check this by looking up the probability of this result in a table to be found in the back of various statistics books. We might, for instance use the table taken from Experimental Design in Psychological Research by A.L. Edwards.

To use the table we need to know what "df" means and what probability level to use. "df" simply means the number of people minus two. We had data from 3 people so our "df" is 1. The probability level we use is more arbitrary but social scientists generally want only a 5% chance of their result being due to sampling error. Some are more strict and accept only a 1% chance. Additionally, we have to decide whether or not we could specify the direction of the relationship in advance (in our case a positive or a minus correlation). If we can specify this we use Edwards's table directly and look up the ".050" (5%) column directly. With one degree of freedom (df) we find that a correlation of .88 would be needed for significance.

Our actual correlation (.875) does, in other words, have a greater than 5% chance of being due to sampling error (small sample size). We reject therefore our first impression that we have evidence for a relationship between intelligence and ambition. If, on the other hand, we decided that we had no idea in advance of the direction of any possible relationship between the two attributes, we could not use Edwards's table exactly as it stands. We would have to use the column headed ".025" to get the 5% level for probability of sampling error. This level, we find, is .997 so in this case our result is even further from significance. Knowing the direction of a relationship in advance is called a "one-sided" or a "one-tailed" test (see the bottom line of Edwards's table.)

Note that as this course was designed for the beginner I have throughout used the simplest possible examples so many refinements and points of dispute have been glossed over so that the main point is made. A full year's course in statistics would be needed to get you to the point of being able to understand all the statistics commonly used in survey research. The correlation coefficient is however probably the most commonly used statistic in such research and an understanding of what it is will at least enable you to interpret the main effects in whatever analysis the computer puts out for you. NO-ONE now does any analyses by hand. The important thing now is to understand what an analysis means rather than understanding how to carry out such an analysis yourself.


BALES, R.F. Task roles and social roles in problem-solving groups. In Maccoby, E.E., Newcomb, T.M. & Hartley, E.L. Readings in social psychology. 3rd ed. N.Y. : Holt, Rinehart & Winston, 1958.

BALES, R.F. & STRODBECK, F.L. Phases in group problem solving. In Cartwright, D. & Zander, A. Group Dynamics. 2nd ed. N.Y.: Harper & Row, 1960.

BASS, B.M. Social behaviour and the orientation inventory : A Review. Psych. Bulletin, 1967, 68, 260-292.

BROWN, R. Social Psychology. N.Y., Free Press, 1964.

CRONBACH, L.J. Coefficient alpha and the internal structure of tests. Psychometrika, 1951, 16, 297-334.

HOYT, C.J. Note on a simplified method of computing test reliability. Educ. and Psych. Meas. 1941, l, 93-95.

LIKERT, R. The method of constructing an attitude scale. In: Fishbein, M. Readings in Attitude Theory and measurement. N.Y. : Wiley, 1967.

LORD, F.M. & NOVICK, M.R. Statistical theories of mental test scores. Reading, Mass. : Addision-Wesley, 1968.

Go to Index page for this site

Go to John Ray's "Tongue Tied" blog (Backup here or here)
Go to John Ray's "Dissecting Leftism" blog (Backup here or here)
Go to John Ray's "Australian Politics" blog (Backup here or here)
Go to John Ray's "Gun Watch" blog (Backup here or here)
Go to John Ray's "Education Watch" blog (Backup here or here)
Go to John Ray's "Socialized Medicine" blog (Backup here or here)
Go to John Ray's "Political Correctness Watch" blog (Backup here or here)
Go to John Ray's "Greenie Watch" blog (Backup here or here)
Go to John Ray's "Food & Health Skeptic" blog (Backup here)
Go to John Ray's "Leftists as Elitists" blog (Not now regularly updated -- Backup here)
Go to John Ray's "Marx & Engels in their own words" blog (Not now regularly updated -- Backup here)
Go to John Ray's "A scripture blog" (Not now regularly updated -- Backup here)
Go to John Ray's recipe blog (Not now regularly updated -- Backup here or here)

Go to John Ray's Main academic menu
Go to Menu of recent writings
Go to John Ray's basic home page
Go to John Ray's pictorial Home Page (Backup here)
Go to Selected pictures from John Ray's blogs (Backup here)
Go to Another picture page (Best with broadband)