Edgar Degas: Portraits in a New Orleans Cotton Office, 1882.




INTRODUCTION

Citizens are willing to support scientific inquiry because they expect that scientific findings can be used to improve human well being. To meet these expectations, scientists develop and test theories to explain observations, under the presumption that when we know how things work we are in a better position to improve human well being. The procedure scientists use to develop and test theories is the attempt to falsify hypotheses, either those derived from theories or ones being proposed as part of theory building. Through falisification of hypotheses that do not accurately explain observations, scientists build support for theories that do explain observations.

To test hypotheses, scientists must accurately measure their concepts of interest. Physicists, for example, must accurately measure wind velocity and surface temperature to test the hypothesis that the greater the wind velocity, the greater the temperature on the surface of an aircraft's wing. Similarly, sociologists must accurately measure self-esteem and marital satisfaction to test the hypothesis that the greater the self-esteem, the greater the marital satisfaction.

To test hypotheses, therefore, scientists must measure what they think they are measuring (i.e., validity) and do so with an instrument that records observations in a consistent manner (i.e., reliability). It would be invalid, for example, for physicists to attempt to measure temperature with a yardstick. And trying to measure temperature with a thermometer that provides widely varying results with each attempt, while valid in the sense of using the correct instrument, would be unreliable because the observations would be inconsistent (one would never know which observation, if any, was correct). Likewise, sociologists, when measuring self-esteem, must be able to measure it rather than some other concept (e.g., locus-of-control is a similar, but different concept) and do so with an instrument that records self-esteem in a consistent manner.

A key methodological question for scientists, therefore, is, "How does one obtain valid and reliable measures of concepts?"

Physical and life scientists can directly observe most of their subject matter, which gives them a distinct advantage for assessing validity and reliability. Social scientists, on the other hand, do not have this advantage because most of the concepts they examine are abstract rather than concrete. How does one observe, for example, abstract concepts such as self-esteem and marital satisfaction? Such measurements can be made, and with a great deal of validity and reliability; doing so, however, requires a thorough understanding of these requirements and how to assess them in practice.

This page describes validity and reliability and procedures for assessing them in sociological research.

VALIDITY

Validity is the extent to which an instrument measures what it is supposed to measure (Carmines and Zeller, 1979).

Scientists distinguish among different types of validity, and across disciplines refer to the same type of validity using different names, which sometimes can create confusion about what type of validity is being assessed! Basically, validity can be classified as either nonempirical or empirical.

Non-Empirical Validity

By "empirical," we mean "related to observation," or "data-based." The first form of validity we will discuss is non-empirical, meaning not related to observations or data analysis. Content validity (sometimes called face or representational validity) is the consensus (i.e., intersubjective, negotiated) opinion of the community of scholars as to whether the items used to measure a construct refer to the domain of the construct and to no other construct (i.e., the community of scholars is all persons trained within a scientific discipline, typically persons with a PhD degree). In other words, the issue of content validity is, "Does the community of scholars agree that a particular set of observed variables is appropriate to measure a particular physical entity or abstract construct?"

It is important to note that content validity is assessed only by the opinions of the community of scholars. There is no empirical assessment of content validity. The community of scholars believes that a measure has intuitive appeal or not, regardless of what empirical assessments might be brought forth (see example below).

Evaluations of content validity are critically important to all sciences. These evaluations can become problematic in the social sciences because often we investigate phenomena that cannot be observed with our physical senses.

Because content validity is critical to all social research, scholars undertake much effort to develop valid measures. One of the primary purposes of qualitative methods, for example, is to learn enough about subjects and the way they view the world to be able to measure their beliefs, behavior, and so forth in a manner that accurately reflects these phenomenon; that is, in a manner that has content validity. If one is fairly certain of content validity (e.g., the definition of self-esteem and how to measure it for a certain population), one is more confident in using survey research methods.

Empirical Validity

Empirical validity is assessed by evaluating the extent to which a measure relates to other measures consistent with theoretically derived hypotheses concerning the concepts being measured (Carmines and Zeller, 1979). Empirical validity is intrinsically linked to theory. Hence, we assume a measure of a concept is valid if a theoretically derived hypothesis relating the concept to another concept receives support through observation and data analysis. If we hypothesize, for example, that the greater the self-esteem the greater the marital satisfaction, and this hypothesis receives empirical support (i.e., the null hypothesis is rejected), then we assume we have measured each concept correctly and that each concept has empirical validity.

Empirical validity is assessed by an evaluation of three types of relationships:
  1. A proposed causal relationship between the construct and its predictor variables ("items" that make up the scale used to measure a construct; e.g., the ten questions in Rosenberg's self-esteem scale). This type of validity is called "construct validity."
  2. A proposed causal relationship between the construct and another construct that is theoretically linked with it. If we hypothesize, for example, that the greater the self-esteem, the greater the marital satisfaction, and this hypothesis is supported, then we assume we have measured self-esteem (and marital satisfaction) correctly. This type of validity is called "predictive validity."
  3. A proposed reciprocal relationship (correlation) between the construct and another construct that is theoretically linked with it. If we hypothesize, for example, that self-esteem is associated with locus-of-control, and this correlation is found to be statistically significant, then we assume we have measured self-esteem (and locus-of-control) correctly. This type of validity is called "concurrent validity." It is preferred that the "second construct," the one being used to validate the construct under consideration, has been validated in other studies. For example, if one wished to assess the empirical validity of a newly proposed measure of locus-of-control, then one might want to assess its concurrent validity with the Rosenburg self-esteem scale, which has been validated in many studies for more than 40 years.
Consider this example (see diagram):
Note: It can always be the case that sampling error might lead to not rejecting the null hypothesis, meaning that one would not receive an indication of empirical validity because of a large standard error related to sampling error. Generally, data analysis procedures assume no sampling error and therefore social scientists tend to focus upon the three possibilities outlined above as reasons for not obtaining an indication of empirical validity.

Comparison of Non-Empirical and Empirical Validity

Assessments of content (non-empirical) validity depend entirely upon the opinions of the community of scholars. They have no empirical elements to them. Sometimes, scholars will confuse issues related to content validity with those related to construct validity.

Consider this example:
Summary RELIABILITY

Reliability is the extent to which a measurement instrument or procedure yields the same results on repeated trials (Carmines and Zeller, 1979). Without reliable measures, scientists cannot build or test theory, and therefore cannot develop productive and efficient procedures for improving human well being.

To illustrate the importance of reliability, we will discuss the testing of this research hypothesis: The greater the formal education, the greater the income. The diagram below shows two possible relationships between education (measured quantitatively on a continuum of 0-16 years of formal schooling) and income (measured quantitatively on a continuum as total dollar income from salary and wages before taxes). The green-colored line, Y1, shows that income is constant, regardless of education. The blue-colored line, Y2, shows that as education increases, income increases.


Graphic of Y1 and Y2 [D]


Suppose we want to know if the observed relationship between education and income, expressed by the line Y2, can be trusted to represent a "real" relationship between these two variables. Could this relationship have occurred by chance? Or, said another way: Is the parameter estimate (usually expressed as beta, which equals .42 in this example) for the observed relationship expressed by Y2 significantly different from zero, given some specified level of error?

To answer this question we write the research hypothesis (Ha): The greater the education, the greater the income. Because we can never know "true" reality through observation, but only discover false statements about reality, we test the null form of this hypothesis (Ho): There is no relationship between education and income. If the null hypothesis is rejected (falsified), then we can claim support for the research hypothesis. Of course, because we can never know reality we can never know a falsification of reality either. So, we give ourselves some margin of error in stating that the null hypothesis is rejected. It is common in the social sciences to allow ourselves a degree of error equal to 5%.

Our question, then, is this: Is the slope of Y2 (amount of change in income over years of education) different from that of the line Y1 (where the amount of change equals zero) given a degree of error equal to 5%?

To answer this question we must account for some amount of flexibility for the line Y2. That is, because some margin of error exists around the line Y2 we must account for the fact that Y2 is not a perfect representation of the relationship between education and income. What factors can cause this error to occur? First, we do not expect Y2 to provide a perfect prediction of income because other variables besides education affect income. For example, type of education obtained (a 4-year college degree in engineering versus a 4-year college degree in sociology) and sex (males typically receive more income than females for the same education) are two variables that can affect the relationship between education and income. Variance around the line Y2 that occurs due to the effects of other variables is called specification error. Second, our ability to generalize our findings to persons not in our sample is affected by sampling error. Third, we must account for errors that inevitably occur in measuring education and income: measurement error.

In considering the topic of reliability, we shall discuss only the effects of measurement error. Recall that statistics used for testing hypotheses (such as the t-ratio) generally assume that measurement error occurs at random, meaning that the sum of all measurement error is zero. The effects of measurement error are summarized within the standard error of the slope of Y2 (called the standard error of the parameter estimate for the effect of education on income). We should note that the standard error also reflects the range of observations for a variable; but given a certain range of observations, the standard error is reduced by accurate measurement. We should note also that standard error reflects the amount of sampling error; but sampling error must be accounted for by weighting, if possible, prior to evaluating reliability. Therefore, we will focus our discussion solely upon how measurement error affects hypothesis testing.

In the figure shown above, the standard error of the slope of Y2 is represented by the red-colored, bell-shaped curves about Y2. This margin of error shows the range of where Y2 might be, given measurement errors. The standard error can be thought of as the "wiggle room" for Y2: the amount that the straight line Y2 can shift due to measurement error. In this diagram, the bell-shaped curves (the standard error of the slope) are relatively small. A visual examination of standard error in relation to the difference in Y2 compared with Y1 shows that, even accounting for measurement error, Y2 has a different slope than Y1. No matter how much one allows Y2 to wiggle as a straight line within the bell-shaped curves that represent measurement error, it will not wiggle all the way to the flat line represented by Y1.

The statistic used to provide a number to this difference is the t-ratio, which is the sum of the squared distances between observed income and expected income on Y2 for a given level of education, divided by the standard error. If the standard error is small, then even a small difference in the slope between Y2 and Y1 will yield a statistically significant t-ratio, which will provide support for the research hypothesis.

But what if the standard error is relatively large? What if the relationship contains much measurement error? Then, the bell-shaped curves around Y2 might be large enough for Y2 to wiggle sufficiently as a straight line so that it lies flat, like Y1. In this case, even though the line Y2 seems to indicate a positive relationship between education and income, it can wiggle so much that it might also lie flat, indicating no relationship between education and income. In statistical terminology, because we have divided the squared distances between observed income and estimated income by a large standard error, the t-ratio will be small and the null hypothesis will not be rejected.

Thus, we might fail to reject the null hypothesis (Ho: There is no relationship between education and income) for two reasons.
  1. First, the relationship is in fact not a significant one (the null is correct). This finding implies that the research hypothesis is flawed. If this is the case, then we need to revise or perhaps reject our theory.
  2. Second, we might fail to reject the null hypothesis because we have too much measurement error. If this second explanation is correct then we have rejected a good research hypothesis and failed to support a good theory!
Therefore, the only way to test hypotheses with the precision needed to build and test theory--and thereby improve human well being--is to collect data with as little measurement error as possible.

Summary
  1. Presumably, efficient and productive science will help improve society.
  2. We cannot have efficient and productive science without good theory.
  3. We cannot know we have good theory without having confidence in our hypothesis testing (to determine construct validity).
  4. We cannot have confidence in hypothesis testing without having confidence in the reliability of our instruments.
  5. We cannot have confidence in the reliability of our instruments without good measurement.
This course addresses techniques for obtaining reliable measurements in sociological investigations. We will review different methods and discuss their strengths and weaknesses for collecting data that is as reliable as possible for a given situation. The requirements for reliable data collection extend to both qualitative and quantitative data collection. The example above refers to statistical analysis of data that is recorded quantitatively, most likely through survey research methods. Sociologists must have reliable measures also for data collected with qualitative procedures.

RELIABILITY ASSESSMENT

Reliability assessment is the evaluation we make of how much measurement error we have experienced in collecting our data. To collect data with as little measurement error as possible we must:
  1. develop measures of constructs that are as valid as possible, and
  2. follow methodological procedures that have been shown to reduce measurement error as much as possible.
Suppose we want to know the width of our classroom from wall to wall. If we use a yardstick, for example, and measure very carefully, it is likely we will record very nearly the same number for this width across repeated trials. We would then have a very reliable measure of width, which would result in a small standard error for a parameter estimate that included width as a variable (assuming we also measured the other variable in the hypothesis with high reliability).

Now, suppose we want to measure the self-esteem of the people in our classroom. Unfortunately, we cannot observe self-esteem with our senses; so we must devise some type of measuring instrument that can be used with high reliability. How do we build a measuring instrument for self-esteem that provides very similar measures on repeated trials? We must meet three conditions:
  1. We must define self-esteem as precisely as possible by describing its conceptual domain (its meaning) without confusing it with the domain of other constructs. Self-esteem must have a clear definition and it must not be confused with the definition of other constructs.
  2. We must have good indicators of self-esteem, ones with high content validity.
  3. We must collect our data with as much accuracy as is possible.
Procedures Used to Assess Reliability

How do we assess reliability? In principle, we simply measure more than once. If we want to know the width of our classroom, for example, we might measure it carefully twice; if we obtain a very similar width twice then we gain some confidence that we have measured with a reliable instrument. The principle for assessing the reliability of an abstract concept is identical; we measure more than once.

Described below are four procedures for assessing reliability. The literature on reliability typically uses the term "test" to refer to a scale or index used to measure a concept. By test, we mean a set of statements on a questionnaire, a quantitative or qualitative evaluation by an observer, or some combination of these ways of measuring.

The Test-Retest Procedure

Same test, given two (or more) times. Example: One might develop one set of ten statements to include on a questionnaire to measure self-esteem. This set of ten statements is administered to a subject twice. If the subject provides very similar answers both times, then one can assume that one has measured self-esteem with a reliable test.

Advantages Disadvantages The Alternative Forms Procedure

Two tests, given two (or more) times. Example: One might develop two sets of ten statements for two different questionnaires to measure self-esteem. The first questionnaire contains the first test of self-esteem and the second questionnaire contains the second test of self-esteem. If the subject provides very similar answers both times, then one can assume that one has measured a concept with reliable tests.

Advantages Disadvantages The Split-Halves Procedure

Same test, administer once, grade each half separately, compare grades from each half. Example: One might develop one set of ten statements to include on a questionnaire to measure self-esteem. Subjects complete the test. The items on the test are split into two sub-tests of five items each. If the score on the first half is very similar to the score on the second half, then one can assume that one has measured a concept with a reliable test.

Advantages Disadvantages The Internal Consistency Procedure

Same test, administered once, score is based upon average similarity of responses to the items (average inter-item correlation). Example: One might develop one set of ten statements to include on a questionnaire to measure self-esteem. Each response to the ten statements is considered as a sub-test. Then, the similarity in responses to each of the ten statements is used to assess reliability. If the subject responds similarly to all ten statements, then one can assume that one has measured a concept with a reliable test.

Advantages Disadvantages Summary

All procedures for assessing the reliability of an abstract concept have some disadvantages. In practice, most sociologists assess reliability by using Cronbach's alpha when they have the quantitative data to do so. They use procedures for assessing "inter-rater" reliability when they have qualitative data. We will learn more about inter-rater reliability in later sections of this course.

Types of Errors That Affect Validity and Reliability

Validity and reliability reflect the quality of the research design and its administration. Human error enters at all steps of the measurement process. These are elements of the research process that affect validity and reliability.

Measurement Error Sampling Error Random Error Data Analysis Errors References

Carmines, Edward G. and Richard A. Zeller. 1979. Reliability and Validity Assessment. Beverly Hills, CA: Sage.
Link to Home Page Link to Syllabus Link to Class Assignments Link to Readings