Reliability and Validity

Measuring the Reliability and Validity of CLEP Exams

CLEP uses rights-only scoring, which means that the exams are scored without a penalty for incorrect guessing. The test taker's raw score is simply the number of questions answered correctly. However, this raw score is not reported. Instead, it is converted into a scaled score by a process that adjusts for the level of question difficulty on the different forms of the test.

The scaled scores are reported on a scale of 20–80. Because the different forms of the test are not always equal in difficulty, raw-to-scale score conversions may differ from form to form. An easier form means a higher raw score is needed to attain a given scaled score.


The reliability of the test scores of a group of examinees is commonly described by two statistics: the reliability coefficient and the standard error of measurement (SEM).

The reliability coefficient is the correlation between the scores those examinees get (or would get) on two independent replications of the measurement process. The reliability coefficient is intended to indicate the stability of the candidate's test scores, and is often expressed as a number ranging from .00 to 1.00. A value of .00 indicates total lack of stability, while a value of 1.00 indicates perfect stability. The reliability coefficient can be interpreted as the correlation between the scores examinees would earn on two forms of the test that had no questions in common. Statisticians use an internal-consistency measure to calculate the reliability coefficients for the CLEP exam. This involves looking at the statistical relationships among responses to individual multiple-choice questions to estimate the reliability of the total test score. The formula used is known as Kuder-Richardson 20, or KR-20, which is equivalent to a more general formula called coefficient alpha.

The SEM is an estimate of the amount by which a typical test taker's score differs from the average of the scores that a test taker would have gotten on all possible editions of the test. This hypothetical average over all editions of the test is referred to as the true score. It is expressed in score units of the test. Intervals extending one standard error above and below the true score for a test taker will include 68% of that test taker’s obtained scores. Similarly, intervals extending two standard errors above and below the true score will include 95% of the test taker’s obtained scores. The SEM is inversely related to the reliability coefficient. If the reliability coefficient of the test were 1.00 (if it perfectly measured the candidate's knowledge), the SEM would be zero.

An additional index of reliability is the conditional standard of error of measurement (CSEM). Tests can be more reliable at some score levels than at other levels. That is, the reliability estimate is conditional on the score level; there are then different estimates for different score levels and these are referred to as Conditional Standard Errors of Measurement, or CSEMs. For CLEP tests, the CSEM is reported for the score level that corresponds to the recommended C-level credit-granting score. Since different editions of this exam contain different questions, a test taker’s score would not be exactly the same on all possible editions of the exam. The CSEM indicates how much those scores would vary. It is the typical distance of those scores (all for the same test taker) from their average. A test taker’s CSEM on a test cannot be computed, but by using the data from many test takers, it can be estimated. The CSEM estimate reported here is for a test taker whose average score, over all possible forms of the exam, would be equal to the recommended C-level credit-granting score.


Validity is a characteristic of a particular use of the test scores from a group of test takers. If the scores are used to make inferences about the test taker's knowledge of a particular subject, the validity of the scores for that purpose is the extent to which those inferences can be trusted to be accurate.

One type of evidence for the validity of test scores is called content-related evidence of validity. It is usually based upon the judgments of a set of experts who evaluate the extent to which the content of the test is appropriate for the inferences to be made about the examinee's knowledge. The CLEP test development committees select the content of the tests to reflect the content of the corresponding courses at most colleges based on a curriculum survey.

Because colleges differ somewhat in the content of the courses they offer, faculty members are urged to review the content outline and the sample questions to ensure that the test covers core content that corresponds with the courses at their colleges.

Another type of evidence for test score validity is called criterion-related evidence of validity. It consists of statistical evidence that test takers who score high on the test also do well on other measures of the knowledge or skills the test is being used to measure. In the past, criterion-related evidence for the validity of CLEP scores has been provided by studies comparing students' CLEP scores to the grades they received in corresponding classes. Although CLEP no longer conducts these studies, individual colleges using the tests can undertake such studies in their own courses. Learn more about CLEP and ACES, a free College Board service that allows institutions to conduct these studies.