After reading this chapter, you will be able to:
1. Describe the range of available health indicators and be able to define:
• Incidence and prevalence
• Mortality rates:
− Crude and standardized
• Life expectancy
• Potential years of life lost
• Survival curves
3. Demonstrate an understanding of how to assess the accuracy of health measures, including:
• Interpreting test results:
− Positive and negative predictive values
− Setting cut-scores
− Ruling in, and ruling out, a diagnosis
− Likelihood ratios
− What is a normal value?
Linking these topics to the Medical Council exam objectives, especially section 78-2.
Dr. Rao reviews some health indicators
Dr. Rao, the Richards family’s physician, saw some statistics about Goosefoot in the regional public health department’s “Physician Update” pamphlet. This gives demographic information on the age and sex breakdown of the population and figures on average income, unemployment, and educational attainment. There is also information on health habits and hospital admission rates, death rates, and consultation rates. He is looking at the page showing the following information about the local area and for the whole Weenigo region:
Health indicators for Goosefoot community, compared to the region
|Major health indicators||Goosefoot||Weenigo region|
|Annual number of deaths, 3 year average||132||9,829|
|Annual mortality (per 100,000), 3 year average||884||808|
|Age-standardized mortality rate (per 100,000), 3 year average||690||786|
|Life expectancy at birth, men (years)||76.2||77.5|
|Remaining life expectancy at 65, men (years)||20.0||17.9|
|Life expectancy at birth, women (years)||80.7||82.4|
|Remaining life expectancy at 65, women (years)||20.1||21.7|
|Number of live births (2009)||140||13,981|
|Infant mortality rate (per 1,000 live births, 2009)||4.2||4.9|
|Perinatal mortality rate (per 1,000 total births, 2009)||5.5||7.0|
Dr Rao scratches his head and wonders what these numbers may mean and why there are so many different measures of mortality.
The medical model of health that was described in Chapter 1 defined and measured health in terms of low rates of adverse health events. Early measures of population health were based on rates of “the five Ds”: death, disease, disability, discomfort, or distress (see Chapter 1). Note that the five Ds form a hierarchy, from objective, numerical measures to more subjective, qualitative indicators, and also from those that are routinely collected (death certificates) to those that are available only from a research study (e.g., questions on feelings of distress).
The quality of health data varies. Indicators based on mortality are robust and almost complete because death certification is a legal requirement (although the accuracy of the recorded diagnosis may be questioned). Disease records can also be reasonably complete: diagnoses can be taken from hospital discharge summaries, for example, although such data can only be generalized to people treated in hospital. The quality of statistics based on disease records depends on the accuracy with which physicians completed the original forms, but they are relatively accurate. Because of their availability and comparability, mortality and morbidity statistics are used by national and international agencies to compare health status between countries. Some of the commonest statistics for this purpose include death rates per thousand, infant mortality rates, average life expectancy, and a range of morbidity indicators, such as rates of reportable disease. Data can be further analyzed within a region to compare the health of different groups of people, or to track particular health problems such as influenza or HIV/AIDS.
Interpreting morbidity figures
Evolving health indicators
Health measures can record information on individuals or on whole populations. The most familiar measures of population health, mortality or morbidity rates, are nonetheless based on counts of individuals, aggregated up to a population level (incidence rates and prevalence are examples). These may be termed aggregated measures of population health.2 A second class of health measures includes ecological indicators, used to record factors that directly affect human health. Many of these may be recorded either in the individual or in the environment: lead levels can be measured in the patient’s blood, or else in the air, water or soil. A third category of measures include environmental indicators that act indirectly and have no obvious analogue at the individual level; for example the existence of HEALTHY PUBLIC POLICY, which might be designed to enhance equity in access to care, or to limit smoking in public places. Such policies may be viewed as indicators of the healthiness of the entire population: is this a caring society that tries to protect the health of its citizens? The contrast between aggregated and indirect environmental measures corresponds to the distinction between health in the population and health of the population that was introduced in Chapter 1.
These are aggregate indicators of health that serve different, yet overlapping purposes. Incidence, or the number of new cases that occur in a given time period, is useful for acute conditions, while prevalence (the total number of cases in the population) applies more to chronic diseases. Causal analyses study incident cases, while prevalence is useful when estimating need for health services. For example, we assess the incidence of road traffic injuries under different conditions when looking for ways to prevent them; we assess the prevalence of long-term disability due to road traffic injuries when planning rehabilitation services. Incidence is a measure of the speed at which new events (such as deaths or cases of disease) arise in a population during a fixed time. It may be measured as a frequency count, or as a proportion of the population at risk, or as a rate per unit of time. The distinction between incidence rate and proportion is illustrated in the additional materials link.
Incidence proportion and rate
Incidence can be measured in two main ways and Figure 6.1 illustrates the difference. It shows the results of following six people from the moment they entered a study (or a population) and the moment some of them experienced the event that we are counting, in this case deaths.
One approach is to measure the incidence proportion, which is the number of events that occurred during the time period, divided by the number of people at risk of having an event, counted at a specified time point during that period. The diagram has been deliberately drawn to highlight the challenge posed in setting this denominator for this calculation when people immigrate or emigrate from the population.
- We could take the denominator as including only those present from the beginning of the year – this is the idea of a closed cohort study that does not add new people after the beginning. Persons A, B and C were present at the beginning, giving an incidence proportion of 2/3 or 0.67 per year. This calculation is also known as the cumulative incidence.
- Alternatively, we could think of this as an open cohort (allowing for both immigration and emigration) and define the population as those present at the mid-point of the study. This includes Persons A, B, D, E and F, of whom two died, giving an incidence proportion of 2/5 or 0.4 for the observation year.
Neither approach seems ideal and can lead to biased estimates, especially in small populations where migration is common. An alternative is to calculate the incidence rate, or incidence density. These add up the time that each person was followed up in the denominator. So, in our example,
- Person A: 12 months
- Person B: 10 months
- Person C: 3 months
- Person D: 11 months
- Person E: 5 months
- Person F: 7 months.
This gives a total of 48 person-months (or 4 person-years) of observation and using this as the denominator and counting all the events (or deaths) as the numerator we get 3 per 4 person-years, or 0.75 per person-year. Conceptually, this represents a concentration or density of events over a composite time period, using time rather individuals as the denominator. It is therefore a rate and can be conceived as the force of mortality (or of any other risk factor) acting on the population. It has no meaning at the individual level. By contrast, the incidence proportion focuses attention on the risk for individuals and is a probability with an upper limit of 100%, whereas the rate has no limit. The two approaches are linked mathematically, and more details on the relationship between them are given, for example, by Rothman et al.3 The relationship introduces the notion of survival functions that will be described below.
In research studies incidence is generally calculated as incidence density, being more precise. But with a large population it is not practical to record the follow-up for each person, so surveillance projects generally calculate the incidence proportion using the population size at the mid-point of the observation period as the denominator. This is an adequate approach with a large population because migration is generally not large enough to affect results. However, precise population counts are only available every 5 years from the Census, so the denominator is based on an estimate of the population size. This makes it impossible to individually link people in the denominator with recorded events, as was done in Figure 6.1. For either method of calculation, the number of events is multiplied by 1,000 or 100,000 depending on the number of events to arrive at an incidence of death (i.e. mortality) of, as in the case of Goosefoot, 884 per 100,000 per year.
While incidence measures events, prevalence is a measure of disease state; it is a proportion that counts all existing cases at a particular time, divided by the population size. It reflects both the disease incidence and its duration, which is linked to survival. The time period for calculating prevalence is commonly a single point in time: point prevalence. Alternatively, prevalence can be calculated for a period such as a year: period prevalence. Prevalence is generally the measure of choice in assessing the burden of a chronic disease because new cases might be quite rare and yet last a long time, requiring care and causing significant disability.
Hospitals: where prevalence is high and incidence is low
Mortality is an event that can be presented as an incidence rate or proportion. But it is traditional to speak of “mortality rates” however the figures were actually calculated, and we will follow that tradition. There are various forms of mortality rate, beginning with those that refer to particular groups, such as child mortality rates, and proceeding to the overall, crude mortality (or crude death) rate. Because the death of a child represents the most significant loss of potential life, and also because children are vulnerable and child health is sensitive to variations in the social environment, there are several indicators of child mortality.
Infant mortality rate (IMR)
The infant mortality rate is the total number of deaths in a given year of children less than one year old, divided by the number of live births in the same year, multiplied by 1,000. Because infant mortality is strongly influenced by environmental factors and the quality of health care, the IMR is often quoted as a useful indicator of the level of community health, especially in poorer countries. However, because of the rarity of infant death in developed countries, it is useful only in large populations as chance variation can make rates unstable in small populations.
Perinatal mortality rate (PMR)
In most industrially developed nations, this is defined (for a given year) as:
Neonatal mortality rate (NMR)
Dr. Rao reviews infant mortality figures
Dr. Rao observes with pride the lower infant mortality rate for Goosefoot (4.2 per thousand) than for the region as a whole (4.9): this is an indicator that is sensitive to the quality of medical care and to the home environment during infancy. His team has paid close attention to these factors and the practice nurse makes routine home visits to young mothers with newborns. Indeed, his nurse tells him, their figure is lower than the national average (4.5 in 2017). Cause for celebration! He thinks back to his home country of India where IMR still runs about ten times higher than in Goosefoot and he ponders the huge impact that good access to nutrition, safe environments and adequate primary and preventive care have on this contrast.
But then Dr. Rao notes that the perinatal mortality rate is higher than the infant mortality; as these two indicators overlap, he figures it must mean that there are a lot of fetal deaths in Goosefoot. He worries about a possible connection to environmental pollution, perhaps linked to the mining industry. He realizes he has some work to do to investigate…
This gives an estimate of the rate at which members of a population die during a specified period (typically a year). The numerator is the number of people dying during the period; the denominator includes the size of the population, usually at the middle of the period (mid-year population), and the duration of observation.
(Notes: The “10n” simply means that the rate may be multiplied by 1,000, or even 100,000 for rare diseases, to bring the rate to a convenient whole number. The time period is commonly a year, but the example above quoted three-year averages for Goosefoot. This was done because the population is rather small, and so rates might fluctuate somewhat from year to year; the 3-year average gives a more stable picture.)
Why “crude”? The term warns us that comparing rates of disease or death between different populations may be misleading. Consider two towns, one with a young population and the other a retirement community. Because death rates vary by age, one would expect more deaths in the retirement community, so a direct comparison of mortality rates would reflect the demographic differences as much as their health. If the purpose is to compare the health chances of an individual in each community this would be misleading. In technical terms, the health comparison would be confounded (see Chapter 5) by the age difference. We can, however, adjust the crude rates to remove the effect of confounding factors, permitting a fairer comparison of health status. This adjustment calculates rates specific to strata of the population, here age groups, producing age-specific mortality rates. These can then be combined into an overall standardized rate that reduces the misleading effect of the confounding factor. Rates may also be adjusted for more than one characteristic of the population, for example calculating age-, sex-, and race-specific death rates.
Standardization is used when comparing mortality in two populations that differ in terms of characteristics known to influence mortality, and whose effect one wishes to temporarily remove. Standardization can also be used in comparing one population over time to adjust for demographic changes, and it can be used in comparing the performance of different clinicians, adjusting for differing case-mixes of their practices. “Adjustment” is the more general term covering standardization and other methods for removing the effects of factors that distort or confound a comparison. “Standardization” refers to an approach using weighted averages derived from a standard reference population.
Age-standardized mortality rates
Standardization uses either a direct or an indirect method (the calculations are shown in the Nerd’s Corner box below). The direct approach provides more information but requires more data. Direct standardization is expressed as an age-standardized rate: x number of deaths per y number of individuals. For instance, the 2010 the crude death rate from injuries in Alberta, calculated by simply dividing the number of deaths from injury by the total population, was 49 per 100,000. This figure is useful in estimating need for services. But to highlight the possible impact of injuries in the oil sector, a comparison to other provinces without an oil sector is necessary, and this must be standardized because they will have different age-structures. It is important to understand that this standardized figure is artificial and can only be used in making comparisons across time or place. Standardized rates may be compared in both absolute and relative terms: as a simple difference between populations or as a ratio of two standardized rates.
Indirect standardization is typically used when stratum sizes (e.g., age groups) in the study population are small, leading to unstable stratum-specific rates. Here, only the overall mortality figure for the study population is required; the stratum-specific death rates are taken from a much larger reference population. The result is expressed as a standardized mortality ratio (SMR), which is the ratio of the deaths observed in the study population to the number that would be expected if this population had the same structure as the reference population. An SMR of 100 signifies that deaths are at the expected level, an SMR of 110 indicates a death rate 10% higher than expected (see Figure 6.3 for an illustration).
Dr. Rao meets the SMRs
Dr Rao has been pondering the meaning of the Goosefoot standardized mortality rates for a while now, and wonders why the standardized rate is so much lower than the crude rate (690 versus 884). It is his wife who, in the end, suggests that maybe this is because there are mostly older people in Goosefoot: many of the young adults have moved to Weenigo to find work. She points out that of course this will mean more deaths per thousand than in a younger population.
Dr Rao begins to feel somewhat relieved: standardizing the rates gives a more comparable result, and in fact, when the effect of age is removed Goosefoot is actually doing better than the region as a whole (690 versus 786 deaths per 100,000). He smiles contentedly.
Calculating age-standardized rates and ratios
Using data on mortality by age-group in Goosefoot and the broader Weenigo region as an example:
- Direct standardization
The age-standardized mortality rate (ASMR) is calculated in 4 steps:
- Select a reference population (usually the country as a whole) and find out from the census how many people there are in each age group (usually 5, 10 or 20-year age groups) and enter the data into a spreadsheet (see example below).
- Calculate age-specific death rates (deaths / population size * 100,000) for each age group in the study populations (Goosefoot and the broader Weenigo region).
- Calculate the number of deaths that would be expected in each age-group if the study populations had the same age structure as the reference population (Canada):
Study population age – specific death rate * reference population / 100,000.
For example, for children aged 0-14 in Goosefoot, 111 * 5,607,345 / 100,000 = 6251.
- The ASMR for each study population = Total expected deaths / size of the reference population * 100,000:
|Step 1:||Step 2:||Step 3:||Step 2:||Step 3:|
|Pop’n||# Deaths||Death rate /
|Expected deaths||Pop’n||# Deaths||Death rate / 100,000||Expected deaths|
|Step 4: ASMR
(Expected deaths / ref. population) * 100,000
Here, we see that although the crude death rate in Goosefoot (884) was higher than that for the Weenigo region (808), the age-standardized rate is lower. This arises because there are more elderly people in Goosefoot (25% versus 15%); correcting for this indicates that Goosefoot is actually comparatively healthy.
Note that the ASMRs are artificial figures, and have no meaning in isolation: they have meaning only when compared to the crude death rate in the standard population or to the ASMRs from other groups, calculated using the same age-groups and reference population. Super nerds think about the ASMR as a weighted average of the age-specific rates for a place, with the weights being the proportion of the reference population that falls within each age group.
- Indirect standardization
Indirect standardization offers a short cut when we do not know the age-specific mortality rates for the study population, or when the population is too small to calculate stable stratum-specific rates, as with the deaths in ages
0-14 in Goosefoot. Indirect standardization takes the weights from the study population (i.e. the sizes of the age strata), but takes the death rates for each age group from the standard population. The study population provides the weights while the standard population provides the rates.
The standardized mortality ratio (SMR) is calculated in three steps:
- Obtain the age-specific death rates for the reference population (here, Canada 2011) = Deaths / population *100,000 in each age
- Multiply these rates by the number of people in each study population in each age group to calculate the expected number of deaths. These show the number of deaths that would occur in the study population if each age stratum had the same death rate as in the reference population. I.e. Canadian age‑specific death rate * population of that age group in the study population / 100,000.
- Calculate the ratio between observed deaths and expected deaths in the study population: observed deaths / expected deaths * 100.
|Step 1:||Step 2:||Step 2:|
|# Deaths||Rate /
|Step 3: SMR (Observed deaths / expected deaths) * 100||83||113|
Again, the figures indicate that Goosefoot is relatively healthy compared to the Weenigo region. Note: Each SMR can be compared to the reference population (Goosefoot has a lower mortality than Canada overall which has a value of 100), but it can be misleading to compare between SMRs. This is because each SMR reflects the age-structure of that region and if these are very different, comparisons across SMRs may be biased.4, 5
Life expectancy at birth is an estimate of the expected number of years to be lived by a newborn based on current age-specific mortality rates. Life expectancy is a statistical abstraction—after all, we will have to wait a lifetime to find out how long babies born today will actually live (and some of us may not still be around!). Life expectancy is used as a summary indicator of health that can be compared across countries. In Canada in 2010, life expectancy is almost 83 years for females and 78 years for males, which places us very close to Australia and just behind Japan. The impact of changing social conditions on life expectancy is illustrated by the case of Russia during the 1990s: life expectancy can fall surprisingly quickly if conditions deteriorate, as shown in the Illustration box.
Life expectancy in post-Soviet Russia
The social disruption of post-Soviet Russia was reflected in rapidly rising death rates, so that between 1990 and 1994 the life expectancy for women declined from 74.4 years to 71.2, while that for men dropped by six years from 63.8 to 57.7. Cardiovascular disease and injuries together accounted for 65% of the decline.6
Source: http://en.wikipedia.org/wiki/File:Russian_male_and_female_life_expectancy.PNG accessed July 2010.
Dr. Rao ponders life expectancy
Dr. Rao ponders the curious figures for life expectancy in Goosefoot:
|Life expectancy at birth, men (years)||76.2||77.5|
|Remaining life expectancy at 65, men (years)||20.0||17.9|
|Life expectancy at birth, women (years)||80.7||82.4|
|Remaining life expectancy at 65, women (years)||20.1||21.7|
He knows, of course, that women live longer than men, and the Goosefoot men live less long than those in the region as a whole. But he is surprised that the men who survive to 65 then live longer than the women. “They must be tough old men!” he says to himself, and thinks of the miners he knows. Those who do not have the chronic lung disease are, indeed, active outdoors people. “Perhaps if they do survive to that age they may live longer than men in the city,” he muses. The tendency for longer survival among the hardy few is termed the healthy survivor effect.
Setting priorities for disease prevention is based in part on the impact of each disease on the population. An obvious measure of impact is the number of deaths a disease causes. From Figure 6.2, this implies that cancers, heart disease and stroke are the diseases with the greatest impact. However, these conditions tend to kill people who are reaching the end of their expected life span, so preventing them might have little effect on extending overall life expectancy. Preventing premature deaths would add more years of life (and perhaps also more productive years) to individuals and to society. Premature death can be defined in terms of deaths occurring before the average potential life expectancy for a person of that sex, or it could be based on an arbitrary value, such as 75 years. Thus, a person who dies from a myocardial infarction at age 55 would lose 20 years of potential life, and such results could be summed across the population to indicate the impact (in terms of potential years of life lost) due to each cause. These values can be used to indicate the social impact of diseases in terms of the total potential years of life lost (PYLL) due to each. (You will sometimes see the abbreviation YPLL for “years of potential life lost”: same thing.)
Prevention priorities based on the PYLL will differ from those based on simple mortality rates. Figure 6.5 shows that based on mortality rates, cancer, circulatory and respiratory diseases remain the familiar priorities as in Figure 6.2. However, they kill relatively late in life (although cancer less so) and using PYLL, injuries (unintentional and intentional) become more important than strokes or respiratory diseases. Indeed, taken together, suicides and unintentional injuries cause more years of potential life lost than circulatory disease.
In cohort studies and clinical trials, outcomes may be expressed as symptom-free survival (how long before symptoms return after a treatment) or survival (time between a diagnosis and death); hence the term SURVIVAL CURVE. Kaplan and Meier developed statistical methods such as the log rank test for evaluating the difference between two survival curves. In an extension, Cox’s proportional hazards model allows for a comparison between two survival curves while adjusting for other variables that may differ between the groups such as loss to follow-up.
This kind of SURVIVAL ANALYSIS is common in the clinical literature, as it has a number of advantages. It gives a full picture of the clinical course of a disease in terms of survival rates at specified intervals after diagnosis and/or treatment. Figure 6.6 shows a hypothetical example. Although the outcomes of the two treatments after 48 weeks are similar, Treatment A enhances survival over the first few months after treatment.
Survival curves and incidence rates
There are evident limitations to morbidity and mortality as indicators of health. They only apply to serious conditions, so are largely irrelevant to most middle-aged people. Furthermore, a diagnosis that becomes a morbidity statistic says little about a person’s actual level of function, and morbidity indicators cannot cover positive aspects of health. This led to the development of a range of more subjective indicators of health, termed health measurement scales or patient-reported outcomes. Measuring health, however, is inherently challenging, because health is an abstract concept. Unlike morbidity, health is not defined in terms of specific indicators that can be used as measurement metrics, such as blood pressure in hypertension or blood sugar in diabetes.
Measurement scales have been developed for most common diagnoses, and these are termed “disease-specific scales”. Some rate the severity of symptoms in a particular organ system (e.g., vision loss, breathlessness, limb weakness); others focus on a diagnosis, such as anxiety or depression scales. Other measurements are broader in scope, covering syndromes (emotional well-being scales) or overall health, and the broadest category of all: health-related quality of life. These broad scales are termed “generic scales”, as they can apply to any type of disease and to anyone; a common example is the Short-Form-36 Health Survey, a 36-item summary of functional health.1, pp649-65 A simpler example is the single question: “In general, would you say your health today is Excellent? Very good? Good? Fair? Poor?” This question shows remarkable agreement with much longer scales.1, pp581-7 On an even broader level, some measures seek to capture the well-being of populations, such as the Canadian Index of Wellbeing, in which health forms a significant component.
Measuring health in the clinic
The applications of health measures fall into three broad categories. Diagnostic instruments collect information from self-reports and clinical ratings, and process these using algorithms to suggest a diagnosis. There are many in psychiatry, such as the Composite International Diagnostic Interview.7 Prognostic measures include screening tests and sometimes information on risk factors, and these may be combined into one of many health risk appraisal systems that are available online. Evaluative measures record change in health status over time and are used to record the outcomes of care. This category forms the largest group of instruments, and includes generic and disease-specific outcome measures.
Objective and subjective indicators
Health indicators can be recorded mechanically, as in a treadmill test, or they may derive from expert judgment, as in a physician’s assessment of a symptom. Alternatively, they may be recorded via self-report, as in a patient’s description of her pain. Mechanical measures collect data objectively in that they involve little or no judgment in the collection of information, although judgment may still be required in its subsequent interpretation. With subjective measures, human judgment (by clinician, patient, or both) is involved in the assessment and interpretation. Subjective health measurements hold several advantages: they describe the quality rather than merely the quantity of function; they cover topics such as pain, suffering, and depression which cannot readily be recorded by physical measurements or laboratory tests; and subjective measures do not require invasive procedures or expensive equipment. The great majority of subjective health measures collect information via questionnaires: many have been extensively tested and are commonly used as outcome measures in clinical trials.1 Drug trials are now required to include quality of life scales, in addition to symptom- or disease-specific scales, in order to record possible adverse side effects of treatment, such as nausea, sleeplessness, etc.
Because objective and subjective indicators each have advantages, they are sometimes combined. For example, in deciding whether or not to undergo chemotherapy or surgery, a cancer patient will wish to balance the expected gain in life expectancy against a judgment of the quality of the prolonged life (considering side effects of treatment, pain, residual disability). At a societal level, this helps to address the question of whether extending life expectancy (e.g., by life-saving therapies) may also increase the number of disabled people in society. This question led to the development of combined mortality and quality of life indicators such as quality-adjusted life years (QALYs).
Quality-Adjusted Life Years (QALYs) extend the idea of life expectancy by incorporating an indicator of the quality of life among survivors. Rather than count every year of life lived as though they were equivalent, this statistic downgrades the value of years lived in a state of ill-health: they are counted as being worth less than a year of healthy life. In evaluating a therapy, QALYs count the average number of additional years of life gained from an intervention, multiplied by a judgment of the quality of life in each of those years. For example, a person might be placed on hypertension therapy for 30 years, which prolongs his life by 10 years but at a slightly reduced quality level, owing to dietary restrictions. A subjective weight is given to indicate the quality or utility of a year of life with that reduced quality (say, a value of 0.9 compared to a healthy year valued at 1.0). In addition, the need for continued drug therapy over the 30 years slightly reduces his quality of life by, say, 0.03. Hence, the QALYs gained from the therapy would be 10 years x 0.9 – 30 years x 0.03 = 8.1 quality-adjusted life years.
The numerical weights assigned to represent the severity of disabilities are known as utility scores. They range from 0 (death) to 1.0, which represents the best possible health state. Utilities are obtained via studies in which patients, professionals or members of the public use numerical rating methods to express their preferences for alternative outcomes, considering the severity of various levels of impairment. Common rating methods include the “standard gamble” and the “time trade-off”.
The standard gamble involves asking experimental subjects to choose between (i) living in the state being rated (which is less than ideal) for the rest of one’s life, versus (ii) taking a gamble on a treatment (such as surgery) that has the probability p of producing a cure, but also carries the risk 1-p of operative mortality. To record the perceived severity of the condition, the experimenter increases the risk of death until the person making the rating has no clear preference for option (i) or (ii). This shows how great a risk of operative mortality he or she would tolerate to avoid remaining in the condition described in the first option. In principle, the more severe the rater’s assessment of the condition, the greater the risk of dying in the operation (perhaps five, even ten percent) they would accept to escape the condition. This risk is used as an indicator of the perceived “utility” (i.e. severity) of living in that condition.
The time trade-off offers an alternative way to present the standard gamble. As before, it asks raters to imagine that they are suffering from the condition whose severity is to be rated. They are asked to choose between remaining in that state for the rest of their natural lifespan (e.g., 30 years for a 40 year-old person), or returning to perfect health for fewer years. The number of years of life expectancy they would sacrifice to regain full health indicates how severely they rate the condition. The utility for the person with 30 years of life expectancy would be given as Utility = (30 – Years traded)/30.
An example of utility scaling methods used is given by the Canadian Health Utilities Index. For example, being unable to see at all receives a utility score of 0.61; being cognitively impaired–as with Alzheimer’s disease–receives a score of 0.42.8 Note that these utility judgments are subjective and may vary from population to population (offering an insight into cultural values).
Some measurement procedures allow patients themselves to supply the utility weights. This may be helpful for clinicians in helping a patient to decide whether or not to undergo a therapy that carries a risk of side-effects. An instrument of this type is the QTwiST, or Quality-Adjusted Time without Symptoms and Toxicity.1,pp 559-63
Disability-Adjusted Life Years (DALYs) and Health-Adjusted Life Years (HALYs) work in a very similar manner to QALYs. DALYs focus on the negative impact of disabilities in forming a weighting for adjusting life years, and HALYs base their valuation on the positive impact of good health. The approach of QALYs, DALYs, and HALYs can also be used to adjust estimates of life expectancy, taking account of quality of life, disability, and health respectively—the last giving rise to the acronym HALE, for Health Adjusted Life Expectancy.
All health indicators, measurements, and clinical tests contain some element of error. There are three chief sources of measurement error: in the thing being measured (my weight tends to fluctuate, so it’s difficult to get an accurate picture of it); in the observer (if you ask me my weight on a Monday, I may knock a pound off if I binged on my mother-in-law’s cooking over the weekend―obviously the extra pound doesn’t reflect my true weight!); or in the recording device (the clinic’s weigh scale has been acting up—we really should get it fixed).
As with sampling, both random and systematic errors may occur (see Chapter 5 on sampling errors). Random errors are like noise in the system: they have an inconsistent effect. If large numbers of observations are made, random errors should average to zero, because (being random) some readings overestimate and some underestimate. They can occur for lots of reasons: a buzzing mosquito distracted Dr. Rao when he took Julie’s blood pressure; you can’t really recall how bad your pain was last Tuesday, and so on. Random errors are detected by testing the RELIABILITY of a measurement.
Systematic errors fall in a particular direction and are likely due to a specific cause. Errors that fall in one direction (I do tend to exaggerate my athletic prowess…) bias or distort a measurement and reduce its VALIDITY. These distinctions are illustrated in Figure 6.7, using the metaphor of target shooting that was introduced in Chapter 5: a wide dispersion of bullets indicates unreliability, whereas off-centre shooting indicates bias or poor validity.
Reliability refers to dependability or consistency. Your patient, Jim, is unpredictable: sometimes he comes to his appointment on time and sometimes he’s late, but once or twice he was actually early. Jim is not very reliable. Jack, on the other hand, arrives exactly 10 minutes early every time. Even though he comes at the wrong time, Jack is reliable, or predictable. A reliable measure can be very reproducible, but it may still be wrong. If so, it is reliable but not valid (bottom left cell in Figure 6.7).
An introductory definition of validity is: Does the test measure what we are intending to measure? A slightly more wordy definition is: How closely do the results of a measurement correspond to the true state of the phenomenon being measured? A more abstract definition is: What does a given score on this test mean? This last interpretation of validity fits under a more general conception in terms of “How can we interpret these test results?”
There is no single approach to estimating validity: the approach varies according to the purpose of the measurement and the sources of measurement error you wish to detect. In medicine, the commonest way to assess validity is to compare the measurement with a more extensive clinical or pathological examination of the patient. This is called criterion validation, because it compares the measurement to a full work-up that is considered a “gold standard” criterion. A validity study of fecal occult blood testing as a screen for colon cancer might compare test results to colonoscopy for a sample of people that includes some with and some without the disease. This is used where the measurement offers a brief and simple way to assess the patient’s condition, and our question is “How well does this simple method predict the results of a full and detailed (and also expensive, perhaps invasive) examination?”
Table 6.1 outlines a standard 2 x 2 table as the basis for calculating the criterion validity of a test. A population of N patients has been tested for a given disease with the new test (shown in the rows), and each person has also been given a full “gold standard” diagnostic work-up, shown in the columns. (This is the theoretical “gold standard” that we are assuming is correct; unfortunately in reality gold standards may not be as golden as we would like). Several statistics can be calculated to show the validity of the screening test.
Sensitivity summarizes how well the test detects disease. It is the probability that a person who has the disease will be identified by the test as having the disease. The term makes sense: if a test is sensitive to the disease, it can detect it. Using the notation in the table:
a / (a + c), or TP / (TP + FN)
The complement of sensitivity is the false negative rate (c/a+c), which expresses the likelihood of missing cases of disease. A test with a low sensitivity will produce a large number of false negative results.
Some mnemonics may help you: SeNsitivity is inversely associated with the false Negative rate of a test (high sensitivity = few false negatives). And, on a topic to be discussed later, low seNsitivity leads to a low Negative predictive value.
Specificity measures how well the test identifies those who do not have this disease:
d / (b + d), or TN / (TN + FP)
Specificity is the complement of the false positive rate (b / b+d): the likelihood of people without the disease being mistakenly labelled as having it. Again, the term is intuitive: a specific screening test is one that detects only the disease it is specifically designed to detect; hence, it will not give people with other conditions false positive scores. Specificity is clinically important as a false positive test result can cause worry, lead to the expense of unnecessary further investigation and perhaps unnecessary interventions.
Some mnemonics to help you: SPecificity is inversely associated with the rate of false Positives. And low sPecificity leads to a low Positive predictive value.
Most diagnostic and screening tests provide continuous scores and a crucial point to recognize is that imperfect validity will cause scores on the test to overlap between those with, and those without the condition. This is due to natural biological variability and to random test errors, and is illustrated in Figure 6.8. To interpret test scores a cut-point is identified to distinguish positive test results from negative. The use of a single cut-point means that it is exceedingly rare for a test to have both high sensitivity and high specificity. If the cut point in the diagram is moved to the right, specificity will increase, but sensitivity will fall, perhaps quite sharply, as more of the true cases (in red) are missed. The reverse is also true. The implications of this dilemma will appear in the paragraphs that follow.
Sensitivity and specificity are inherent properties of a test, and are useful in describing its expected performance. But they can only be measured if the actual disease status of individuals undergoing the test is known. Naturally, when we apply a test in clinical practice we do not know who has the disease; we are using the test to help find out. In effect, we know which row of Table 6.1 the patient belongs in, but not which column. Therefore, we are more interested in what a negative or positive test result means for the patient: what is the likelihood that a positive score indicates disease? For this, we use the predictive values.
The positive predictive value (PPV) shows what fraction of patients who receive a positive test result actually have the disease:
a / (a + b), or TP / (TP + FP)
You can see from Table 6.1 that a test with low specificity (i.e. lots of false positives, so b is large) will have a low PPV.
Correspondingly, the negative predictive value (NPV) shows how many people who receive a negative score really do not have the condition:
d / (c + d), or TN / (TN + FN)
If the test has low sensitivity, FN will be large, so its NPV will be reduced.
Predictive values and prevalence
In a further complication of interpreting test scores, predictive values vary according to the prevalence of the disease in the population of patients for whom the test is used. Clinicians must bear this in mind when interpreting a test result: you must treat the patient, not the test result! The reasons are illustrated in Figure 6.9, which contrasts the performance of the same test in high and low prevalence settings.
Tests are often validated in hospital settings, where the prevalence of the disease being tested for is high, similar to that shown in the left panel of the figure. However, the test may then be used in primary care settings, where the disease prevalence is lower, as shown in the right panel.
Note that the sensitivity and specificity of the test remain the same in both settings (they are properties of the test), but the predictive value of a positive test is very different. This is simply because there are many fewer cases to be identified and many more non-cases in the primary care setting. So, if specificity is not extremely high, the number of false positives can exceed the true positives. At the same time, lower prevalence means that a negative test result is more accurate: you can reassure your primary care patient with a negative score that he is very unlikely to have the disease (you will, of course, remind him to come back for re-evaluation if his symptoms continue: he may be one of the few with a false negative result).
In summary, interpreting test results requires insight into the population on which you are applying the test. Beware of applying screening or diagnostic tests in low-prevalence settings: you may find many false positive results. For instance, in general population breast cancer screening programmes, the positive predictive value of a positive mammogram is only around 10%, so for every 100 women who are recalled for further investigation after an abnormal screening mammogram, 90 will not have cancer.9 One strategy to improve the positive predictive value of a test is to change from screening everyone (universal screening) to screening selectively. For example, test only people at high risk of the condition—those with risk factors, a family history, or positive symptoms, among whom the disease will be more prevalent.
Taking this one step further, the more tests you administer to a patient, for example in an annual physical exam, the more likely you are to get a false positive (and therefore misleading) score on one or more of the tests. Hence, tests should be chosen carefully and applied in a logical sequence to rule in, or rule out, a specific diagnosis that you have in mind. Each test should be chosen based on the conclusion you draw from the results of the previous test. Remember that unnecessary tests are not only costly to the health care system, but also ethically questionable if they expose a patient to unnecessary risks such as radiation from an X-ray or unnecessary further investigation after a false positive result.
Finally, we may link the themes of random versus systematic error to the sensitivity and specificity of a test. In some tests, imperfect sensitivity may be due to random errors; in others it may be due to a systematic error, such as the cut-point being set too high for a particular type of patient. If the error is random and the test is repeated, the diseased and non-diseased might be reclassified.10 In such a situation it might be appropriate to repeat slightly elevated tests to confirm that the value is stable (see Nerd’s Corner “Regression to the mean”). The decision to label someone must be made taking the whole picture into account, not just a single test result. There is an inherent tension between wishing to intervene early (when treatment may prevent further deterioration) and falsely labelling a person as a patient. Such balances represent part of the art of medicine.
Regression to the mean
Clinicians frequently use test results to rule in, or to rule out, a possible diagnosis, but the logic of this is often misunderstood. To rule a diagnosis in, you need a test that is high in specificity; to rule a diagnosis out you need a test that has high sensitivity. This may sound counter-intuitive, so we will explore it further, beginning with ruling a diagnosis out. A perfectly sensitive test will identify all cases of a disease, so if you get a negative result on a sensitive test, you can be sure that the patient does not have this disease. The mnemonic is “SnNout“: sensitive test, Negative result rules out. Conversely, to rule a diagnosis in, you need a positive result on a specific test, because a specific test would only identify this type of disease. The mnemonic is “SpPin” – specific test + positive score rules in. Note, unfortunately, that if the patient gets a negative score on this test, they may still have the disease (i.e., the test is specific and therefore maybe not very sensitive, and they had a false negative result).
Linking back to the sequence of testing, if the goal is to rule alternative diagnoses out, then several tests can be run together, to increase sensitivity for detecting rival diagnoses. If the goal is to rule a diagnosis in, the tests can be administered serially, stopping when a positive result is obtained. For example, HIV can be tested first using a sensitive (but not very specific) serological test. This will find the true positives, but also many false positives. Therefore the positives are re-tested using a specific test (Western blot) that will exclude the false positives. The use of tests to enhance the likelihood of a diagnosis introduces the idea of likelihood ratios.
You may occasionally hear someone argue that you need a sensitive test to rule a diagnosis in, but this is false.
The reason is that a sensitive test will indeed identify most of the true cases of the disease, but a highly sensitive test will often have lower specificity (look back at Figure 6.7). This means that there may be a number of false positives mixed in, so the sensitive test cannot rule a disease in.
A likelihood ratio combines sensitivity and specificity into a single figure that indicates by how much having the test result will reduce the uncertainty of making a given diagnosis. The likelihood ratio is the probability that a given test result would occur in a person with the target disorder, divided by the probability that the same result would occur in a person without the disorder.
A positive likelihood ratio (or LR+) indicates how much more likely a person with the disease is to have a positive test result than a person without the disease.
LR+ = sensitivity / (1 – specificity),
or the ratio of true positives to false positives. Using the terms in Table 6.1, the formula is
a/(a + c) / b/(b + d).
A negative likelihood ratio (or LR–) indicates how much more likely a person without the disease is to have a negative test result, compared to a person with the disease.
LR– = (1-sensitivity) / specificity,
the ratio of false negatives to true negatives. Referring to Table 6.1, the formula is
c/(a+c) / d/(b+d).
In other terms, the LR+ expresses how much a positive test result increases the odds that a patient has the disease; an LR– indicates how much a negative test decreases the odds of having it.
In applying this to a clinical situation the use of a nomogram (Figure 6.10) removes the need for calculation. It is necessary to begin from an initial estimate of the patient’s likelihood of having the disease, called the pretest probability; this estimate will then be adjusted according to the test result. The pre-test probability can be difficult to estimate (see Nerd’s corner: Pretest probability), but a first guess is based on the prevalence of the condition in the setting where you are practising. This can then be modified upwards or downwards by your initial clinical impression and history-taking for this particular patient. The scale on the left of the nomogram shows the pre-test probability of disease; the central column shows the likelihood ratios, and the right-hand scale shows the post-test probability. To use nomogram, you need to know (or calculate) the likelihood ratio for the test. If the test result is positive, draw a straight line on the nomogram from the pre-test probability through the likelihood ratio; where this line cuts the right-hand scale indicates the post-test probability that the patient has the disease. Post-test probability means the likelihood that the patient has this disease, taking into account the initial probability and the test score. Of course, if from your initial observations you feel virtually certain the patient has the disease, a positive score on the test will not add very much new information; but a negative score would.
Applying this to the example in Figure 6.9, test sensitivity and specificity are both 0.91 so the LR+ is 0.91/(1-0.91) = 10.1. In general, tests with positive LRs higher than about 5 are useful in ruling in a disease. Draw a line through the pre-test probability on the left of the diagram, through 10.1 in the central column, and then read off the post-test probability on the right-hand column. For the hospital setting, the prevalence was 33%, while it was 3% for the primary care setting; these values give a rough estimate of pre-test probability. So in the hospital setting, a positive test would mean that a patient’s post-test probability of having the disease is over 80% (blue line), whereas in the primary care setting it would be around 20% (red line). In both instances, the test result has substantially increased the clinical probability of the patient having the disease. In the hospital setting, you are now pretty certain of the diagnosis. In the primary care setting, as long as the situation is not urgent, you might want to increase your certainty by further investigation before launching into possibly harmful treatment.
The estimation of pretest probability reminds us of the distinction introduced in Chapter 2 between health determinants and risk factors. Determinants set the incidence rate in a population, and incidence offers a first approximation to the pretest probability of disease for an individual from that population.
But we cannot apply population data directly to the individual (who is unique, so unlikely to perfectly match the population average). But consideration of individual risk factors can be used to modify the crude estimate of pretest probability upwards or downwards. For example, incidence for 35 year old males may be x per cent, but you see a 35 year-old man who is overweight and smokes, so his risk might be estimated as 2x per cent. Furthermore, the pattern of signs and symptoms could raise the risk or likelihood ratio even more, perhaps to 4x per cent.
However, assessment of risk is often imprecisely described as high or low clinical suspicion and a diagnostic test is applied to either confirm or rule out the putative diagnosis (or high pre-test probability). However, a positive confirmatory test often indicates only a higher level of probability, albeit a level at which doubt must be suspended until evidence is found to the contrary. In a similar way, when a test has ruled out the disease the clinician can inform the patient that he is at very low risk, but nonetheless should report worrying symptoms. Clinicians must always be ready to review and revise their diagnoses.
When the test result is negative, the likelihood ratio (LR-) gives a result below 1: values smaller than 0.2 or so are useful in ruling out a disease. In our example, the LR- is 0.099. In the hospital setting, a patient who receives a negative test result would have a post-test probability of having the disease of around 4% (down from 33% before the test was administered). The primary care patient with a negative test would have a post-test probability of having the disease of about 0.2% — he almost certainly does not have it (a 1 in 500 chance).
Making binary decisions is necessary in medicine, such as whether to prescribe a treatment or not, to operate or not, and to reassure the patient that he does not have a disease. However, most biological measurements do not provide binary categories, but instead produce a continuous range of values, as with blood pressure, blood cholesterol, glucose, creatinine or bone density. Hence, a cut-point on each of these scales has to be chosen to separate the “normal” from the “abnormal” results. Even with qualitative assessments, such as X-rays or histology slides, decisions must be made among a range of findings, which vary from definitely abnormal to definitely normal.
Defining normal is not as simple as it might seem. Superficially, it is defined in terms of the average, or most common, presentation for a person of that type. Unfortunately, this does not necessarily imply that it is healthy: on average, Canadians are overweight. In addition, abnormality occurs at both ends of a continuum—being underweight and being overweight are both unhealthy. Therefore, in place of the average, normal could be defined in terms of a range, perhaps defined by percentiles or by standard deviations, on the continuum being measured, such as body weight. This seems to move the notion of normal towards healthy, but setting the margins of the distribution is challenging. We cannot justify defining the normal range in terms of, say, less than two standard deviations above or below the mean, as this would vary from measure to measure.
A more promising approach returns us to the theme of evidence-based medicine and defines normal in terms of a range of scores above or below which treatment would be beneficial: abnormal is the threshold beyond which a person would benefit from treatment. This idea links to an approach to defining need for care (see Chapter 7). An implication of this approach is that an evolution in treatments would modify the range of what is considered normal. For example, new therapies treat cognitive loss earlier than before, so new layers of cognitive impairment are being defined among people who would previously have been considered normal, or at least accepted as having “benign senescent forgetfulness”. In a similar way, cut-points for defining hypertension have changed. Pre-hypertension was redefined in 2003 as a systolic blood pressure of 120-139 mmHg or a diastolic pressure of 80-89 mmHg.12 By altering cut-points, more people are classified as having the disease and, therefore, become eligible for treatment. The reason for altering the cut-points has usually been because a clinical trial has shown new treatments achieve better outcomes for this group of patients, although the improvement may be small. Not surprisingly, such change finds favour with the drug companies that make and sell the treatments.
1. What are the leading causes of death in Canada? How does the ranking of these causes change if you were to use Potential Years of Life Lost?
- McDowell I. Measuring health: a guide to rating scales and questionnaires. New York (NY): Oxford University Press; 2006.
- Morgenstern H. Ecologic studies in epidemiology: concepts, principles, and methods. Annu Rev Public Health. 1995;16:61-81.
- Rothman KJ, Greenland S, Lash TL. Modern epidemiology. New York: Lippincott, Williams and Wilkins; 2008.
- Goldman DA, Brender JD. Are standardized mortality ratios valid for public health data analysis? Statistics in Medicine. 2000;19:1081-8.
- Court B, Cheng K. Pros and cons of standardized mortality ratios. Lancet. 1995;346:1432.
- Notzon FC, Komarov YM, Ermakov SP, Sempos DT, Marks JS, Sempos EV. Causes of declining life expectancy in Russia. JAMA. 1997;279(10):793-800.
- World Health Organization. The Composite International Diagnostic Interview, version 1.1: researcher’s manual. Geneva, Switzerland: WHO; 1994.
- McMaster University. Health utilities group health utilities index and quality of life 2000 . Available from: http://www.fhs.mcmaster.ca/hug/index.htm.
- Wright CJ, Mueller CB. Screening mammography and public health policy: the need for perspective. Lancet. 1995;346(8966):29-32.
- Vickers AJ, Basch E, Kattan MW. Against diagnosis. Ann Intern Med. 2008;149(3):200-3.
- Sackett DL, Haynes RB, Tugwell P. Clinical epidemiology: a basic science for clinical medicine. Philadelphia (PA): Lippincott, Williams & Wilkins; 1991.
- Chobanian AV, Bakris GL, Black HR, et al. The seventh report of the joint national committee on prevention, detection, evaluation, and treatment of high blood pressure: the JNC 7 report. JAMA. 2003;289:2560-72.