
Edgar Degas: Portraits in a New Orleans Cotton Office (1882).
INTRODUCTION
Citizens are willing to support scientific inquiry because they expect that scientific findings can be used to improve human well being. To meet these expectations, scientists develop and test theories to explain observations, under the presumption that when we know how things work we are in a better position to improve human well being. The procedure scientists use to develop and test theories is the attempt to falsify hypotheses, either those derived from theories or ones being proposed as part of theory building. Through falisification of hypotheses that do not accurately explain observations, scientists build support for theories that do explain observations.
To test hypotheses, scientists must accurately measure their concepts of interest. Physicists, for example, must accurately measure wind velocity and surface temperature to test the hypothesis that the greater the wind velocity, the greater the temperature on the surface of an aircraft's wing. Similarly, sociologists must accurately measure self-esteem and marital satisfaction to test the hypothesis that the greater the self-esteem, the greater the marital satisfaction.
To test hypotheses, therefore, scientists must measure what they think they are measuring (i.e., validity) and do so with an instrument that records observations in a consistent manner (i.e., reliability). It would be invalid, for example, for physicists to attempt to measure temperature with a yardstick. And trying to measure temperature with a thermometer that provides widely varying results with each attempt, while valid in the sense of using the correct instrument, would be unreliable because the observations would be inconsistent (one would never know which observation, if any, was correct). Likewise, sociologists, when measuring self-esteem, must be able to measure it rather than some other concept (e.g., locus-of-control is a similar, but different concept) and do so with an instrument that records self-esteem in a consistent manner.
A key methodological question for scientists, therefore, is, "How does one obtain valid and reliable measures of concepts?"
Physical and life scientists can directly observe most of their subject matter, which gives them a distinct advantage for assessing validity and reliability. Social scientists, on the other hand, do not have this advantage because most of the concepts they examine are abstract rather than concrete. How does one observe, for example, abstract concepts such as self-esteem and marital satisfaction? Such measurements can be made, and with a great deal of validity and reliability; doing so, however, requires a thorough understanding of these requirements and how to assess them in practice.
This page describes validity and reliability and procedures for assessing them in sociological research.
VALIDITY
Validity is the extent to which an instrument measures what it is supposed to measure (Carmines and Zeller, 1979).
Scientists distinguish among different types of validity, and across disciplines refer to the same type of validity using different names, which sometimes can create confusion about what type of validity is being assessed! Basically, validity can be classified as either nonempirical or empirical.
Non-Empirical Validity
By "empirical," we mean "related to observation," or "data-based." The first form of validity we will discuss is non-empirical, meaning not related to observations or data analysis. Content validity (sometimes called face or representational validity) is the consensus (i.e., intersubjective, negotiated) opinion of the community of scholars as to whether the items used to measure a construct refer to the domain of the construct and to no other construct (i.e., the community of scholars is all persons trained within a scientific discipline, typically persons with a PhD degree). In other words, the issue of content validity is, "Does the community of scholars agree that a particular set of observed variables is appropriate to measure a particular physical entity or abstract construct?"
It is important to note that content validity is assessed only by the opinions of the community of scholars. There is no empirical assessment of content validity. The community of scholars believes that a measure has intuitive appeal or not, regardless of what empirical assessments might be brought forth (see example below).
Evaluations of content validity are critically important to all sciences. These evaluations can become problematic in the social sciences because often we investigate phenomena that cannot be observed with our physical senses.
Because content validity is critical to all social research, scholars undertake much effort to develop valid measures. One of the primary purposes of qualitative methods, for example, is to learn enough about subjects and the way they view the world to be able to measure their beliefs, behavior, and so forth in a manner that accurately reflects these phenomenon; that is, in a manner that has content validity. If one is fairly certain of content validity (e.g., the definition of self-esteem and how to measure it for a certain population), one is more confident in using survey research methods.
Empirical Validity
Empirical validity is assessed by evaluating the extent to which a measure relates to other measures consistent with theoretically derived hypotheses concerning the concepts being measured (Carmines and Zeller, 1979). Empirical validity is intrinsically linked to theory. Hence, we assume a measure of a concept is valid if a theoretically derived hypothesis relating the concept to another concept receives support through observation and data analysis. If we hypothesize, for example, that the greater the self-esteem the greater the marital satisfaction, and this hypothesis receives empirical support (i.e., the null hypothesis is rejected), then we assume we have measured each concept correctly and that each concept has empirical validity.
Empirical validity is assessed by an evaluation of three types of relationships:
- A proposed causal relationship between the construct and its predictor variables ("items" that make up the scale used to measure a construct; e.g., the ten questions in Rosenberg's self-esteem scale). This type of validity is called "construct validity."
- A proposed causal relationship between the construct and another construct that is theoretically linked with it. If we hypothesize, for example, that the greater the self-esteem, the greater the marital satisfaction, and this hypothesis is supported, then we assume we have measured self-esteem (and marital satisfaction) correctly. This type of validity is called "predictive validity."
- A proposed reciprocal relationship (correlation) between the construct and another construct that is theoretically linked with it. If we hypothesize, for example, that self-esteem is associated with locus-of-control, and this correlation is found to be statistically significant, then we assume we have measured self-esteem (and locus-of-control) correctly. This type of validity is called "concurrent validity." It is preferred that the "second construct," the one being used to validate the construct under consideration, has been validated in other studies. For example, if one wished to assess the empirical validity of a newly proposed measure of locus-of-control, then one might want to assess its concurrent validity with the Rosenburg self-esteem scale, which has been validated in many studies for more than 40 years.
Consider this example (see diagram):
Our construct of interest is self-esteem. We are using ten items (statements in Likert format on a questionnaire) to measure self-esteem. (By the way, here is a link to the well-accepted set of items used to measure self-esteem, the Rosenberg Self Esteem Scale). Each of the ten items implies an hypothesis; that, for example, the greater the self-esteem, the greater the score on Item #1 (e.g., "On the whole, I am satisfied with myself"). Also suppose we are hypothesizing that self-esteem predicts marital satisfaction. And suppose we are hypothesizing that self-esteem exists concurrently with locus-of-control (defined as a sense of being in control over one's life).
The hypotheses described above represent three ways in which empirical validity is assessed:
- First, construct validity is assessed by examining the parameter estimates for the effects of the construct on the indicator items (i.e., is each item strongly correlated with the overall summated score, which we use as the measure of the construct). If the t-ratio for the parameter estimate for the effect of self-esteem on Item #1 is large enough (e.g., 1.96 or above for a Type-I error rate of 5%), then the null hypothesis (i.e., "Self-esteem has no effect on the response to Item #1") is rejected and we can state that the rejection of the null hypothesis lends support for the validity of measuring self-esteem with Item #1. This procedure is repeated for each of the ten items. If the null is not rejected for an item, then it is removed from the scale. It might be replaced by another item or we might use just the nine remaining items to measure self-esteem.
- Second, predictive validity is assessed by examining the parameter estimate for the effect of self-esteem on marital satisfaction.
- Third, concurrent validity is assessed by examining the parameter estimate for the correlation between self-esteem and our measure of locus-of-control.
Assuming that all variables are measured with continuous quantitative data (i.e., interval-level data with seven or more response categories, for example), then the assessment of the parameter estimates relies upon the size of the t-ratio with respect to some specified level of probability of a type I error (at the .05 level, for example).
Suppose the t-ratio indicates that the null form of each hypothesis outlined above (i.e., "no relationship") is rejected. Then, the community of scholars can claim that the results lend support for the validity of the construct and its measurement.
Suppose the t-ratio indicates that the null hypothesis is not rejected?
If the data fail to show support for empirical validity, then there are three interpretations of these negative findings:
- there is no empirical validity for the construct of interest.
- the theory from which the hypothesis is derived is incorrect.
- there is a problem with the measurement/conceptualization of the other construct.
Consider, for example, these possible explanations for not supporting the hypothesis that the greater the self-esteem, the greater the marital satisfaction.
- First, it could be that marital satisfaction is not measured well.
- Second, it could be that our hypothesis is not supported (self-esteem does not predict marital satisfaction).
- Third, it could be that self-esteem is not measured well.
Note: It can always be the case that sampling error might lead to not rejecting the null hypothesis, meaning that one would not receive an indication of empirical validity because of a large standard error related to sampling error. Generally, data analysis procedures assume no sampling error and therefore social scientists tend to focus upon the three possibilities outlined above as reasons for not obtaining an indication of empirical validity.
Comparison of Non-Empirical and Empirical Validity
Assessments of content (non-empirical) validity depend entirely upon the opinions of the community of scholars. They have no empirical elements to them. Sometimes, scholars will confuse issues related to content validity with those related to construct validity.
Consider this example:
Dr. X claims he has measured self-esteem, which he defines as "the overall attitude that a person maintains with regard to their self worth and importance," by summing the responses to three statements on a questionnaire. Assume each statement is associated with a Likert response scale with seven categories ranging from 1 = "strongly disagree" to 7 = "strongly agree." Further assume that Dr. X's definition of self-esteem is well accepted by the community of scholars.
These are the three statements used by Dr. X to measure self-esteem:
- Overall, I am happy with my marriage.
- I am satisfied with my marriage.
- My marriage is a source of happiness for me.
Dr. X reports that the reliability of the construct is very high (note: we will discuss reliability later on this page), statistical analysis indicates that the statements "go together" (note: we will discuss statistical procedures, such as factor analysis, used to make such judgments later in this course), and the construct is highly correlated with a second assessment of subjects' self-esteem conducted while each subject delivers a speech to a large audience (for example, we might hire trained observers to note subjects' body language and speech characteristics that indicate a sense of confidence and self-worth and record their observations on a numerical scale where 1 = low self-esteem and 10 = high self-esteem). Given these statistical results, Dr. X claims content validity for his measure of self-esteem.
Dr. Y, on the other hand, argues that Dr. X's construct has no content validity because the statements used to measure self-esteem "do not seem to be related to the definition of self-esteem set forth by Dr. X." Dr. Y argues that the statements seem to measure "something more like marital satisfaction." Dr. Y has no empirical support for her claim. Dr. X replies that his measure of self-esteem is very well supported by the empirical results.
Who's right?
Let's suppose we, as members of the community of scholars, agree with Dr. Y that the statements seem to measure marital satisfaction. Then the community of scholars rejects Dr. X's measure of self-esteem, even though it seems to have strong empirical support.
But what about the strong correlation of Dr. X's measure with the second measure of self-esteem, the one calculated from the observations of each subject's speech delivered to a large audience? Let's assume the speech score has been judged by the community of scholars to have content validity and is reliably measured. The strong correlation of Dr. X's construct with the measure of body language does not offer support for the content validity of Dr. X's measure because the correlation could be accidental, meaning not theoretically justified, or it could represent a case of empirical validity; that is, theoretically we would expect Dr. X's items to be significantly correlated with self-esteem because we expect marital satisfaction to be significantly correlated with self-esteem.
Summary
- Validity is the extent to which an instrument measures what it is supposed to measure.
- Non-Empirical validity (we will use the term, "content validity"; see other terms: face, representative) is the assessment made by the community of scholars that the items or procedures used to measure the concept accurately represent the definition of the concept.
- Empirical validity, in its three forms described here (construct, predictive, concurrent) is assumed if theoretically derived hypotheses related to the measurement of the concept or its relationship with other concepts are not rejected as part of data analysis, whether this analysis is based upon quantitative or qualitative data. Note other terms that are used to refer to empirical validity are: criterion, internal, external; these are just a few of the different terms for the same types of validity we discuss here.
RELIABILITY
Reliability is the extent to which a measurement instrument or procedure yields the same results on repeated trials (Carmines and Zeller, 1979). Without reliable measures, scientists cannot build or test theory, and therefore cannot develop productive and efficient procedures for improving human well being.
To illustrate the importance of reliability, we will discuss the testing of this research hypothesis: The greater the formal education, the greater the income. The diagram below shows two possible relationships between education (measured quantitatively on a continuum of 0-16 years of formal schooling) and income (measured quantitatively on a continuum as total dollar income from salary and wages before taxes). The green-colored line, Y1, shows that income is constant, regardless of education. The blue-colored line, Y2, shows that as education increases, income increases.
[D]
Suppose we want to know if the observed relationship between education and income, expressed by the line Y2, can be trusted to represent a "real" relationship between these two variables. Could this relationship have occurred by chance? Or, said another way: Is the parameter estimate (usually expressed as beta, which equals .42 in this example) for the observed relationship expressed by Y2 significantly different from zero, given some specified level of error?
To answer this question we write the research hypothesis (Ha): The greater the education, the greater the income. Because we can never know "true" reality through observation, but only discover false statements about reality, we test the null form of this hypothesis (Ho): There is no relationship between education and income. If the null hypothesis is rejected (falsified), then we can claim support for the research hypothesis. Of course, because we can never know reality we can never know a falsification of reality either. So, we give ourselves some margin of error in stating that the null hypothesis is rejected. It is common in the social sciences to allow ourselves a degree of error equal to 5%.
Our question, then, is this: Is the slope of Y2 (amount of change in income over years of education) different from that of the line Y1 (where the amount of change equals zero) given a degree of error equal to 5%?
To answer this question we must account for some amount of flexibility for the line Y2. That is, because some margin of error exists around the line Y2 we must account for the fact that Y2 is not a perfect representation of the relationship between education and income. What factors can cause this error to occur? First, we do not expect Y2 to provide a perfect prediction of income because other variables besides education affect income. For example, type of education obtained (a 4-year college degree in engineering versus a 4-year college degree in sociology) and sex (males typically receive more income than females for the same education) are two variables that can affect the relationship between education and income. Variance around the line Y2 that occurs due to the effects of other variables is called specification error. Second, our ability to generalize our findings to persons not in our sample is affected by sampling error. Third, we must account for errors that inevitably occur in measuring education and income: measurement error.
In considering the topic of reliability, we shall discuss only the effects of measurement error.
What about specification error? Assessments of reliability refer only to the consistency of a measuring instrument with repeated measurements over time. Therefore, considerations of specification error do not enter into evaluations of reliability. If we believe that our theory does not adequately reflect reality, then we add or subtract concepts from it, or perhaps rearrange concepts within a causal sequence in the theory.
What about sampling error? Assessments of reliability refer only to the consistency of a measuring instrument with repeated measurements over time. Therefore, considerations of sampling error do not enter into evaluations of reliability. If we know the characteristics of the population, then we can 1) compare the characteristics of our sample to the population, 2) make corrections by weighting the sample, and then 3) evaluate the reliability of the weighted data (we will talk about weighting later in this class). If we do not know the characteristics of the population, then we must assume we have made no sampling errors.
Recall that statistics used for testing hypotheses (such as the t-ratio) generally assume that measurement error occurs at random, meaning that the sum of all measurement error is zero. The effects of measurement error are summarized within the standard error of the slope of Y2 (called the standard error of the parameter estimate for the effect of education on income). We should note that the standard error also reflects the range of observations for a variable; but given a certain range of observations, the standard error is reduced by accurate measurement. We should note also that standard error reflects the amount of sampling error; but sampling error must be accounted for by weighting, if possible, prior to evaluating reliability. Therefore, we will focus our discussion solely upon how measurement error affects hypothesis testing.
In the figure shown above, the standard error of the slope of Y2 is represented by the red-colored, bell-shaped curves about Y2. This margin of error shows the range of where Y2 might be, given measurement errors. The standard error can be thought of as the "wiggle room" for Y2: the amount that the straight line Y2 can shift due to measurement error. In this diagram, the bell-shaped curves (the standard error of the slope) are relatively small. A visual examination of standard error in relation to the difference in Y2 compared with Y1 shows that, even accounting for measurement error, Y2 has a different slope than Y1. No matter how much one allows Y2 to wiggle as a straight line within the bell-shaped curves that represent measurement error, it will not wiggle all the way to the flat line represented by Y1.
The statistic used to provide a number to this difference is the t-ratio, which is the sum of the squared distances between observed income and expected income on Y2 for a given level of education, divided by the standard error. If the standard error is small, then even a small difference in the slope between Y2 and Y1 will yield a statistically significant t-ratio, which will provide support for the research hypothesis.
But what if the standard error is relatively large? What if the relationship contains much measurement error? Then, the bell-shaped curves around Y2 might be large enough for Y2 to wiggle sufficiently as a straight line so that it lies flat, like Y1. In this case, even though the line Y2 seems to indicate a positive relationship between education and income, it can wiggle so much that it might also lie flat, indicating no relationship between education and income. In statistical terminology, because we have divided the squared distances between observed income and estimated income by a large standard error, the t-ratio will be small and the null hypothesis will not be rejected.
Thus, we might fail to reject the null hypothesis (Ho: There is no relationship between education and income) for two reasons.
- First, the relationship is in fact not a significant one (the null is correct). This finding implies that the research hypothesis is flawed. If this is the case, then we need to revise or perhaps reject our theory.
- Second, we might fail to reject the null hypothesis because we have too much measurement error. If this second explanation is correct then we have rejected a good research hypothesis and failed to support a good theory!
Therefore, the only way to test hypotheses with the precision needed to build and test theory--and thereby improve human well being--is to collect data with as little measurement error as possible.
Summary
- Presumably, efficient and productive science will help improve society.
- We cannot have efficient and productive science without good theory.
- We cannot know we have good theory without having confidence in our hypothesis testing (to determine construct validity).
- We cannot have confidence in hypothesis testing without having confidence in the reliability of our instruments.
- We cannot have confidence in the reliability of our instruments without good measurement.
This course addresses techniques for obtaining reliable measurements in sociological investigations. We will review different methods and discuss their strengths and weaknesses for collecting data that is as reliable as possible for a given situation. The requirements for reliable data collection extend to both qualitative and quantitative data collection. The example above refers to statistical analysis of data that is recorded quantitatively, most likely through survey research methods. Sociologists must have reliable measures also for data collected with qualitative procedures.
RELIABILITY ASSESSMENT
Reliability assessment is the evaluation we make of how much measurement error we have experienced in collecting our data. To collect data with as little measurement error as possible we must:
- develop measures of constructs that are as valid as possible, and
- follow methodological procedures that have been shown to reduce measurement error as much as possible.
Suppose we want to know the width of our classroom from wall to wall. If we use a yardstick, for example, and measure very carefully, it is likely we will record very nearly the same number for this width across repeated trials. We would then have a very reliable measure of width, which would result in a small standard error for a parameter estimate that included width as a variable (assuming we also measured the other variable in the hypothesis with high reliability).
Now, suppose we want to measure the self-esteem of the people in our classroom. Unfortunately, we cannot observe self-esteem with our senses; so we must devise some type of measuring instrument that can be used with high reliability. How do we build a measuring instrument for self-esteem that provides very similar measures on repeated trials? We must meet three conditions:
- We must define self-esteem as precisely as possible by describing its conceptual domain (its meaning) without confusing it with the domain of other constructs. Self-esteem must have a clear definition and it must not be confused with the definition of other constructs.
- We must have good indicators of self-esteem, ones with high content validity.
- We must collect our data with as much accuracy as is possible.
Procedures Used to Assess Reliability
How do we assess reliability? In principle, we simply measure more than once. If we want to know the width of our classroom, for example, we might measure it carefully twice; if we obtain a very similar width twice then we gain some confidence that we have measured with a reliable instrument. The principle for assessing the reliability of an abstract concept is identical; we measure more than once.
Described below are four procedures for assessing reliability. The literature on reliability typically uses the term "test" to refer to a scale or index used to measure a concept. By test, we mean a set of statements on a questionnaire, a quantitative or qualitative evaluation by an observer, or some combination of these ways of measuring.
The Test-Retest Procedure
Same test, given two (or more) times. Example: One might develop one set of ten statements to include on a questionnaire to measure self-esteem. This set of ten statements is administered to a subject twice. If the subject provides very similar answers both times, then one can assume that one has measured self-esteem with a reliable test.
Advantages
- This procedure has strong intuitive appeal; one is measuring more than once with the identical test.
- Because the same test is used for all measurements, one avoids the problem of developing more than one test.
Disadvantages
- History: Events might occur between the administration of the test that change the subject's score on the concept being measured. For example, a person's self-esteem might change from Time 1 to Time 2 because of some event in their life. If the researcher is unaware of events that might change subjects, then the researcher does not know if the subject has changed or if the test is unreliable.
- Maturation: Subjects change between administrations of the test because people change over their life course, not related to any specific event.
- Cueing: Subjects might adjust to the test itself. After providing answers to the test on a questionnaire, for example, subjects might think more in-depth about the concept (e.g., their self-esteem), re-evaluate it, and alter their responses between administrations of the test.
The Alternative Forms Procedure
Two tests, given two (or more) times. Example: One might develop two sets of ten statements for two different questionnaires to measure self-esteem. The first questionnaire contains the first test of self-esteem and the second questionnaire contains the second test of self-esteem. If the subject provides very similar answers both times, then one can assume that one has measured a concept with reliable tests.
Advantages
- Less cueing: Because the tests differ, one incurs less of a problem with cueing than if the same test is used. Nevertheless, some cueing might occur because the subject will have had time to think about the concept between administration of the tests.
Disadvantages
- Must develop two tests, where each one has equal content validity in the opinion of the community of scholars.
- History.
- Maturation.
The Split-Halves Procedure
Same test, administer once, grade each half separately, compare grades from each half. Example: One might develop one set of ten statements to include on a questionnaire to measure self-esteem. Subjects complete the test. The items on the test are split into two sub-tests of five items each. If the score on the first half is very similar to the score on the second half, then one can assume that one has measured a concept with a reliable test.
Advantages
- Need to develop just one test.
- No history.
- No maturation.
- No cueing.
Disadvantages
- The assessment of reliability depends upon how the items are separated. This issue might seem trivial, but experience shows that the similarity of scores on halves can differ significantly depending upon how the items are split into two sub-tests.
The Internal Consistency Procedure
Same test, administered once, score is based upon average similarity of responses to the items (average inter-item correlation). Example: One might develop one set of ten statements to include on a questionnaire to measure self-esteem. Each response to the ten statements is considered as a sub-test. Then, the similarity in responses to each of the ten statements is used to assess reliability. If the subject responds similarly to all ten statements, then one can assume that one has measured a concept with a reliable test.
Advantages
- Need to develop just one test.
- No history.
- No maturation.
- No cueing.
Disadvantages
- The number of items in the test can affect the assessment of reliability. Internal consistency typically is assessed using Cronbach's Alpha. The formula for this statistic incorporates the number of items on the test, such that for a given level of similarly among items (i.e., typically assessed as the average interitem correlation of the items on the test), the greater the number of items, the greater the alpha. Some scholars argue for simply using the average interitem correlation rather than Cronbach's alpha. It might seem like this approach would eliminate the problem of having a reliability assessment affected by the number of items in the test. But it loses the intuitive appeal of "measuring more than once." That is, Cronbach's alpha takes into account the number of times one measured, the interitem correlation does not.
Summary
All procedures for assessing the reliability of an abstract concept have some disadvantages. In practice, most sociologists assess reliability by using Cronbach's alpha when they have the quantitative data to do so. They use procedures for assessing "inter-rater" reliability when they have qualitative data. We will learn more about inter-rater reliability in later sections of this course.
Types of Errors That Affect Validity and Reliability
Validity and reliability reflect the quality of the research design and its administration. Human error enters at all steps of the measurement process. These are elements of the research process that affect validity and reliability.
Measurement Error
- Human failure (i.e., mistakes),
- Unavoidable selective perception (i.e., all observations are "filtered" through our perceptions),
- Intentional misrepresentation by respondents, and
- Poor research design, including poorly worded questions, lack of content validity, or logical flaws in data collection.
Sampling Error
- Inaccurate coverage of possible observations.
Random Error
- This is a generic term given to errors created by lack of knowledge, inability to observe all events, fluctuations in "reality," etc.
Data Analysis Errors
- Type I Error: Reject the null when it is true. This type of error results in a false indication of a significant relationship (e.g., you think you have observed a causal relationship, but you have not).
- Type II Error: Do not reject the null when it is false. This type of error results in a false indication of no significant relationship (e.g., you think no causal relationship exists, but it does).
References
Carmines, Edward G. and Richard A. Zeller. 1979. Reliability and Validity Assessment. Beverly Hills, CA: Sage.