On some clinically useful measures of the accuracy of diagnostic tests
ACP J Club. 1998 Sep-Oct;129:A17. doi:10.7326/ACPJC-1998-129-2-A17
In the ACP Journal Club abstracts that describe diagnostic tests, we provide numbers that summarize the tests' accuracy. Several years ago, a summary of guides to help readers critically appraise articles about diagnostic tests was published in ACP Journal Club (1). In this editorial, we review the derivation and meaning of those numbers and introduce the latest additions of terms to our Glossary.
We focus here on a way of thinking about diagnosis that takes into account both components of evidence-based medicine: your individual clinical expertise and the best external evidence. The former frequently determines your assessment of diagnostic possibilities before you do the test (prior or pretest probabilities), and the latter concerns the ability of the test to distinguish patients with and those without the target disorder (providing both the old-fashioned concepts of sensitivity and specificity and the newfangled and more powerful concept of likelihood ratios). These 2 elements of evidence-based medicine can be combined to refine your estimates of whether the patient has the target disorder (posterior or post-test probabilities) and to make the diagnosis. ACP Journal Club reports the diagnostic tests that produce the biggest changes from pretest to post-test probabilities and that are most likely to be useful in your practice. This discussion of the tests' properties is summarized from a recent book about practicing and teaching evidence-based medicine (2).
Where do the pretest probabilities come from? Ideally, they are derived from your own clinic's or hospital's database of accumulating clinical experience, specific for the setting in which you work and the sorts of patients who come or are referred to you. Other times, pretest probabilities may come from community surveys, longitudinal studies, or data banks established expressly for this purpose. Either way, pretest probabilities for the same target disorder can vary widely among and within countries and among primary, secondary, and tertiary care settings. You will have to apply your clinical expertise to modify any initial estimate of the pretest probability in light of each patient's unique biology, age, and presenting symptoms and signs (we describe yet another way to do this in the penultimate paragraph of this editorial).
Suppose that we're working up a patient with anemia. On the basis of her age, duration of symptoms, clinical signs, and initial smear results, we reckon that her probability for iron-deficiency anemia is 50%; that is, the odds are about 50-50 that the anemia is caused by iron deficiency. Although a bone marrow aspirate stained for iron stores is the diagnostic standard for confirming the presence or absence of this target disorder, neither we nor our patient want to be involved with an invasive, uncomfortable, and expensive procedure unless it is absolutely necessary. We are aware of several laboratory tests that can help in the diagnosis but are unaware of the relative value of each test or whether they are all necessary. We decide to search the current literature, and we find a review article by Guyatt and colleagues (3) that compared serum ferritin with bone marrow staining for iron; we judge it to be valid. By the time we've tracked down and studied the external evidence, summarized in Table 1 and Table 2, our patient's serum ferritin test results have come back at 60 mmol/L. How should we put all this information together?
From Guyatt and colleagues' report, we can construct Table 1. We note that 90% of patients with iron-deficiency anemia (731 of 809 patients) have serum ferritin levels < 65 mmol/L. This proportion of patients with the target disorder who have a positive test result is called the test's sensitivity. We also see that 85% of patients who do not have iron-deficiency anemia (1500 of 1770 patients) have serum ferritin levels ≥ 65 mmol/L. This proportion of patients who are free of the target disorder and have negative or normal test results is called the test's specificity.
Sometimes, sensitivity or specificity is so high that it can be used to rule in or rule out a target disorder. When a test has a very high sensitivity, a negative result rules out the diagnosis (SnNout; in Table 1, this corresponds to a very high negative predictive value). When a clinical sign has a very high specificity, a positive test result rules in the diagnosis (SpPin; this corresponds to a very high positive predictive value). Calling a test result positive or negative may be useful when the test has a good SpPin or SnNout, but for most tests, much information can be lost by creating this dichotomy.
As clinicians, our interest isn't in the vertical columns of Table 1 (if we knew which column our patient was in, we wouldn't need the diagnostic test!). We want to know the “horizontal significance” of this test result: What proportion of patients with serum ferritin levels of 60 mmol/L have iron-deficiency anemia? In this study, 731 of 1001 patients, or 73%, did (a proportion called the positive predictive value). Does this mean that the probability of our patient having iron-deficiency anemia is about 73%? No, for 2 reasons.
First, although we reckoned that our patient's probability of having iron-deficiency anemia was 50%, the corresponding pretest probability, or prevalence, in the study patients was only 31% (809 of 2579 patients). To reckon the probability of our patient having the target disorder, we have to extrapolate from the study to patients such as ours by applying the test's sensitivity and specificity to a hypothetical group of patients with a pretest probability for iron deficiency of 50% (for those of you with the gumption to calculate it, the probability should be about 86%).
Second, our patient's probability for iron-deficiency anemia isn't 73%, or even 86%, because although the serum ferritin determination looks impressive when viewed in terms of sensitivity (90%) and specificity (85%), likelihood ratios provide an even better way of expressing the accuracy of serum ferritin. This example shows how we can be misled because the old sensitivity-specificity approach restricts test results to just 2 levels (positive and negative). Most test results, like those of the serum ferritin test, can be divided into several levels.
Table 2 shows a particularly useful way of dividing test results into 5 levels. When this is done, one extreme level of the test result can be shown to rule in the diagnosis. In this case, you can rule in (because of SpPin) 59% of the patients with iron-deficiency anemia despite the unimpressive sensitivity (59%) that would have been achieved if the serum ferritin test results had been split at this level. Likelihood ratios ≥ 10, when applied to pretest probabilities ≥ 33% (0.33/0.67 = pretest odds of 0.5), generate post-test probabilities of 5/6 = ≥ 83%. Moreover, the other extreme level can rule out (because of SnNout) 75% of patients who do not have iron-deficiency anemia (again, despite an unimpressive specificity of 75%). Likelihood ratios ≤ 0.1, when applied to pretest probabilities ≤ 33% (0.33/0.67 = pretest odds of 0.5), generate post-test probabilities of 0.05/1.05 ≤5%. Two other intermediate levels can move a 50% prior probability (pretest odds of 1:1) to the useful but not usually diagnostic post-test probabilities of 4.8/5.8 = 83% and 0.39/1.39 = 28%. And one indeterminate level in the middle (containing about 10% of both sorts of patients) can be seen as uninformative, with a likelihood ratio of 1. To our surprise, our patient's test result generates the indeterminate likelihood ratio of only 1. A test that would have been judged very useful on the basis of the old sensitivity-specificity approach really hasn't been helpful in moving toward the diagnosis. We'll have to think about other tests (including perhaps the reference standard of a bone marrow aspiration) to sort out this problem.
At this and every other stage in the diagnostic process, we must decide whether to test further, treat, or abandon a diagnosis. Although it would be most comfortable to achieve post-test probabilities of 100% or 0%, in actual practice we seldom reach these idealized degrees of certainty. This uncertainty means that we must treat a disorder, although its probability still falls short of 100%, and stop pursuing a diagnosis, although its probability is still above 0%. These limitations are called thresholds, and when the post-test probability is at or below the point at which there is no difference between the value of withholding treatment and not testing further and the value of performing further tests, we neither treat nor test further for the target disorder; we are at the “no-treatment-no-test” threshold (4). And when the post-test probability is at or above the point at which there is no difference between the value of administering treatment and the value of performing further tests, we stop testing and start treating; we have crossed the “test-treatment” threshold. Between these 2 thresholds, we continue to carry out diagnostic tests in the hope that their results will send us across one of the thresholds, which are functions of such factors as the precision and accuracy of the diagnostic test and the risks and benefits of both the test and the treatment. Thresholds vary among diseases and individual patients.
Finally, there's an easier way of manipulating all these probability ↔ odds calculations; Figure is a nomogram for doing so. By anchoring a straight edge at the pretest probability on the left-hand column and rotating it until it intersects the likelihood ratio for the diagnostic test result in the center column, the post-test probability can be read in the right-hand column. This nomogram, when coupled with test-treatment thresholds, also provides clinicians with an additional approach to determining appropriate pretest probabilities. This time, the straight edge is anchored on the likelihood ratio for the diagnostic test result and is rotated until it intersects the post-test probability that corresponds to the appropriate threshold. The straight edge is then followed back to the left-hand scale, where it intersects the pretest probability; clinicians can then ask themselves, “Is a pretest probability at this level (or higher if trying to make a diagnosis, lower if trying to rule it out) clinically reasonable for my patient?”
When reported in ACP Journal Club, these measures of the accuracy of diagnostic tests will be accompanied, whenever possible, by their 95% confidence intervals. The key definitions in this editorial have been added to the Glossary, and future editorials will present other elements of the science of the art of clinical diagnosis in an effort to render it more effective and efficient and to promote improvements in the care of patients.
David L. Sackett, MD
University of Oxford
Oxford, England, UK
Sharon Straus, MD
University of Oxford
Oxford, England, UK
1. Guyatt GH. Readers' guide for articles evaluating diagnostic tests: what ACP Journal Club does for you and what you must do yourself [Editorial]. ACP J Club. 1991 Sep-Oct;115:A16. (Revised). Philadelphia: American College of Physicians; 1998.)
Table 1. Results of a systematic review of serum ferritin as a diagnostic test for iron-deficiency anemia*
|Target disorder (Iron-deficiency anemia)||Total|
|Diagnostic test result for serum ferritin||Positive (< 65 mmol/L)||731 (a)||270 (b)||1001 (a+b)|
|Negative (≥ 65 mmol/L)||78 (c)||1500 (d)||1578 (c+d)|
|Total||809 (a+c)||1770 (b+d)||2579 (a+b+c+d)|
*Sensitivity = a/(a + c) = 731/809 = 90%
Specificity = d/(b + d) = 1500/1770 = 85%
+LR = sensitivity/(1 − specificity) = 90%/15% = 6
-LR = (1 − sensitivity)/specificity = 10%/85% = 0.12
Positive predictive value = a/(a + b) = 731/1001 = 73%
Negative predictive value = d/(c + d) = 1500/1578 = 95%
Prevalence = pretest probability = (a + c)/(a + b + c + d) = 809/2579 = 31%
Table 2. The usefulness of five levels of a diagnostic test result*
|Diagnostic test result for serum ferritin (mmol/L)||Target disorder present||Target disorder absent||Likelihood ratio||Diagnostic impact|
|Very positive (<15)||474 (59)||20 (1.1)||52||Rule in (SpPin)|
|Moderately positive (15 to 34)||175 (22)||79 (4.5)||4.8||Intermediate high|
|Neutral (35 to 64)||82 (10)||171 (10)||1||Indeterminate|
|Moderately negative (65 to 94)||30 (3.7)||168 (9.5)||0.39||Intermediate low|
|Extremely negative (≥ 95)||48 (5.9)||1332 (75)||0.08||Rule out (SnNout)|
|Total||809 (100)||1770 (100)|
*SnNout = when a test has a very high sensitivity, a negative result rules out the diagnosis; SpPin = when a test has a very high specificity, a positive result rules in the diagnosis.
Pretest odds = pretest probability/(1 - pretest probability)
Post-test odds = pretest odds × likelihood ratio
Post-test probability = post-test odds/(post-test odds + 1)
Figure. Nomogram for interpreting diagnostic test results
Reproduced with permission from Fagan TJ. Nomogram for Baye's theorem. N Engl J Med. 1975;293:257.