Measuring the Reliability and Validity of CLEP Exams
CLEP uses “rights only” scoring, which means that the exams are scored without a penalty for incorrect guessing. The test taker’s raw score is simply the number of questions answered correctly. However, this raw score is not reported. Instead, it is converted into a scaled score by a process that adjusts for the level of question difficulty on the different forms of the test.
The scaled scores are reported on a scale of 20 to 80. Because the different forms of the test are not always equal in difficulty, raw-to-scale score conversions may differ from form to form. An easier form means a higher raw score is needed to attain a given scaled score.
The reliability of the test scores from a group of test takers is described by two statistics: the reliability coefficient and the standard error of measurement.
The reliability coefficient is the correlation between the scores those test takers receive (or would receive) on two independent replications of the measurement process. It is intended to indicate the stability of the candidates' test scores if those candidates were to take different forms of the same exam. The reliability coefficient can be interpreted as the correlation between the scores the test takers would earn on two forms of the test that had no questions in common. Statisticians use an internal-consistency measure to calculate the reliability coefficients for the CLEP exam. This involves looking at the statistical relationships among responses to individual multiple-choice questions to estimate the reliability of the total test score. The formula used is known as “Kuder-Richardson 20,” or “KR-20,” which is equivalent to a more general formula called “coefficient alpha.”
The standard error of measurement shows how much the test taker’s score might vary over repeated tests. Note that the standard error of measurement is inversely related to the reliability coefficient. If the reliability of the test were 1.00 (a perfect measure of the candidate's knowledge), the standard error of measurement would be zero.
Validity is a characteristic of a particular use of the test scores from a group of test takers. If the scores are used to make inferences about the test takers’ knowledge of a particular subject, the validity of the scores for that purpose is the extent to which those inferences can be trusted to be accurate.
One type of evidence for the validity of test scores is called content-related evidence of validity. It is usually based upon the judgments of a set of experts who evaluate the extent to which the content of the test is appropriate for the inferences to be made about the examinees' knowledge. The CLEP test development committees select the content of the tests to reflect the content of the corresponding courses at most colleges based on a curriculum survey.
Because colleges differ somewhat in the content of the courses they offer, faculty members are urged to review the content outline and the sample questions to ensure that the test covers core content that corresponds with the courses at their colleges.
Another type of evidence for test score validity is called criterion-related evidence of validity. It consists of statistical evidence that test takers who score high on the test also do well on other measures of the knowledge or skills the test is being used to measure. In the past, criterion-related evidence for the validity of CLEP scores has been provided by studies comparing students' CLEP scores to the grades they received in corresponding classes. Although CLEP no longer conducts these studies, individual colleges using the tests can undertake such studies in their own courses. Learn more about CLEP and ACES, a free College Board service that allows institutions to conduct these studies.