After reading this chapter, you will be able to:
1. Describe the important sources of available health data and be able to recognize the advantages and disadvantages of each (MCC 78-2)
3. Critically appraise measurement issues, including:
• Interpreting test results:
− Positive and negative predictive values
− Setting cut-scores
− Ruling in, and ruling out, a diagnosis
− Likelihood ratios
− Establishing a normal value
These topics relate to the Medical Council exam objectives, especially section 78-2.
|Note: The colored boxes contain optional additional information; click on the box open it and to close it again.
Words in CAPITALS are defined in the Glossary
Dr. Rao reviews some health indicators
Dr. Rao, the Richards family’s physician, saw some statistics about Goosefoot in the regional public health department’s Physician Update pamphlet. This gives demographic information on the age and sex breakdown of the population and data on average income, unemployment, and educational attainment. There is also information on health habits and hospital admission rates, death rates, and consultation rates. He is looking at the page showing the following information about the local area and for the whole Weenigo region:
Health indicators for Goosefoot community, compared to the region
|Major health indicators
|Annual number of deaths, 3 year average
|Annual mortality (per 100,000), 3 year average
|Age-standardized mortality rate (per 100,000), 3 year average
|Life expectancy at birth, men (years)
|Remaining life expectancy at 65, men (years)
|Life expectancy at birth, women (years)
|Remaining life expectancy at 65, women (years)
|Number of live births (2022)
|Infant mortality rate (per 1,000 live births, 2022)
|Perinatal mortality rate (per 1,000 total births, 2022)
Dr Rao scratches his head and wonders what these numbers may mean and why there are so many different measures of mortality.
The medical model of health described in Chapter 1 defined and measured health in terms of low rates of adverse health events. Early measures of population health were based on rates of “the five Ds” introduced in Chapter 1: death, disease, disability, discomfort, or distress. Note that the five Ds form a spectrum, from objective, numerical measures to more subjective, qualitative indicators, and also from those that are routinely collected (death certificates) to those that are available only from a research study (e.g., survey questions on feelings of distress).
The quality of health data varies. Indicators based on mortality are robust and almost complete because death certification is a legal requirement, although the accuracy of the recorded diagnosis may vary. Morbidity data (such as that published by the Canadian Institute for Health Information) may also be reasonably complete. Diagnoses can be taken from hospital discharge summaries, for example, although such data can only be generalized to people treated in hospital. The quality of statistics derived from medical records depends on the accuracy with which physicians completed the original forms, but they are relatively accurate. Because of their availability and comparability, mortality and morbidity statistics are used by national and international agencies to compare health status between countries. Some of the commonest statistics for this purpose include death rates per thousand, infant mortality rates, average life expectancy, and a range of morbidity indicators, such as rates of reportable disease. Data can be further analyzed within a region to compare the health of different groups of people, or to track particular health problems such as influenza or COVID-19.
Interpreting morbidity figures
Evolving health indicators
Health indicators can report information on individuals or on whole populations. The most familiar measures of population health, mortality or morbidity rates, are nonetheless based on counts of individuals, aggregated up to a population level (incidence rates and prevalence are examples). These may be termed aggregated measures of population health.2 A second class of health measures includes ecological indicators, used to record factors that affect human health directly. These may be recorded either in the individual or in the environment: lead levels can be measured in the patient’s blood, or else in the air, water or soil. A third category of measures include environmental indicators that act indirectly and have no obvious analogue at the individual level. Healthy public policies, for example, can be designed to enhance equity in access to care, or to limit smoking in public places. Such policies may be viewed as indicators of the healthiness of the entire population: is this a caring society that tries to protect the health of its citizens? The contrast between aggregated and indirect environmental measures corresponds to the distinction between health in the population and health of the population that was introduced in Chapter 1.
These are aggregate indicators of health that serve different, yet overlapping purposes. Incidence, or the number of new cases that occur in a given time period, is useful for acute conditions, while prevalence (the total number of cases in the population) applies more to chronic diseases. Causal analyses study incident cases; prevalence is useful in estimating need for health services. For example, we assess the incidence of road traffic injuries under different conditions when looking for ways to prevent them; we assess the prevalence of long-term disability due to road traffic injuries when planning rehabilitation services. Incidence is a measure of the speed at which new events (such as deaths or cases of disease) arise in a population during a fixed time. It may be measured as a frequency count, or as a proportion of the population at risk, or as a rate per unit of time. The distinction between incidence rate and proportion is illustrated in the additional materials link.
Incidence proportion and rate
Incidence can be measured in two main ways and Figure 6.1 illustrates the difference. It shows the results of following six people from the moment they entered a study (or a population) and the moment some of them experienced the events that we are counting, in this case deaths.
One approach is to measure the incidence proportion, which is the number of events that occurred during the time period, divided by the number of people at risk of having the event, counted at a specified time point during that period. The diagram has been deliberately drawn to highlight the challenge posed in setting this denominator for the calculation when people immigrate or emigrate from the population.
- We could take the denominator as including only those present from the beginning of the year – this is the idea of a closed cohort study that does not add new people after the beginning. Persons A, B and C were present at the beginning; B and C died, giving an incidence proportion of 2/3 or 0.67 per year. This calculation is also known as the cumulative incidence.
- Alternatively, we could think of this as an open cohort (allowing for both immigration and emigration) and define the population as those present at the mid-point of the study. This includes persons A, B, D, E and F, of whom two died, giving an incidence proportion of 2/5 or 0.4 for the observation year.
Oops: neither approach seems ideal and can lead to biased estimates, especially in small populations where migration is common. An alternative is to calculate the incidence rate, or incidence density. These add up the time that each person was followed up to form the denominator. So, in our example,
- Person A: 12 months
- Person B: 10 months
- Person C: 3 months
- Person D: 11 months
- Person E: 5 months
- Person F: 7 months.
This gives a total of 48 person-months (or 4 person-years) of observation and using this as the denominator and counting all the events (here, deaths) as the numerator we get 3 per 4 person-years, or 0.75 per person-year. Conceptually, this represents a concentration or density of events over a composite time period, using time rather individuals as the denominator. It is therefore a rate and can be thought of as the force of mortality (or of any other risk factor) acting on the population. It has no meaning at the individual level. By contrast, the incidence proportion focuses attention on the risk for individuals and is a probability with an upper limit of 100%, whereas the rate has no limit. The two approaches are linked mathematically, and more details on the relationship between them are given, for example, by Rothman et al.3 The relationship introduces the notion of survival functions that will be described below.
In research studies that monitor individuals, incidence is generally calculated as incidence density, being more precise. But with a large population it is not practical to record the duration of follow-up for each person, so surveillance projects generally calculate the incidence proportion using the population size at the mid-point of the observation period as the denominator. This is sufficiently accurate in a large population because migration is generally not large enough to affect results. However, precise population counts are only available every 5 years from the Census, so the denominator is based on an estimate of the population size. This makes it impossible to individually link people in the denominator with recorded events, as was done in Figure 6.1. For either method of calculation, the number of events is multiplied by 1,000 or 100,000 depending on the number of events to arrive at an incidence of death (i.e. mortality rate) such as the figure of 884 per 100,000 per year for Goosefoot.
While incidence measures events, prevalence is a measure of disease state; it is a proportion that counts all existing cases at a particular time, divided by the population size. It reflects both the disease incidence and its duration, which is linked to survival. The time period for calculating prevalence is commonly a single point in time: point prevalence. Alternatively, prevalence can be calculated for a period such as a year: period prevalence. Prevalence is generally the measure of choice in assessing the burden of a chronic disease because new cases might be quite rare and yet last a long time, requiring care and causing significant disability.
Hospitals: where prevalence is high but incidence is low
Mortality is an event that can be presented as an incidence rate or proportion. But it is traditional to speak of “mortality rates” however the figures were actually calculated, and we will follow that tradition. There are various forms of mortality rate, beginning with those that refer to particular groups, such as child mortality rates, and proceeding to the overall, crude mortality (or crude death) rate. Because the death of a child represents such a major loss of potential life, and also because children are vulnerable and child health is sensitive to variations in the social environment, there are several indicators of child mortality.
Infant mortality rate (IMR)
The infant mortality rate is the total number of deaths in a given year of children less than one year old, divided by the number of live births in the same year, multiplied by 1,000. Because infant mortality is strongly influenced by environmental factors and the quality of health care, the IMR is often quoted as a useful indicator of the level of community health, especially in poorer countries. However, because of the rarity of infant death in developed countries, it is useful only in large populations as chance variation can make rates unstable in small populations.
Perinatal mortality rate (PMR)
In most industrially developed nations, this is defined (for a given year) as:
Neonatal mortality rate (NMR)
Dr. Rao reviews infant mortality figures
Dr. Rao observes with some pride the lower infant mortality rate for Goosefoot (4.2 per thousand) compared to the region as a whole (4.9): this is an indicator that is sensitive to the quality of local medical care and to the home environment during infancy. His team has paid close attention to these factors and the practice nurse makes routine home visits to young mothers with newborns. Indeed, his nurse tells him, their figure is lower than the national average (4.5 in 2017). Cause for celebration! He thinks back to his home country of India where IMR still runs about ten times higher than in Goosefoot and he ponders the huge impact that good access to nutrition, safe environments and adequate primary care have on this contrast.
But then Dr. Rao notes that the perinatal mortality rate is higher than the infant mortality; as these two indicators overlap, he figures it must mean that there are a lot of fetal deaths in Goosefoot. He worries about a possible connection to environmental pollution, perhaps linked to the mining industry. He realizes he has some work to do to investigate…
This gives an estimate of the rate at which members of a population die during a specified period (typically a year). The numerator is the number of people dying during the period; the denominator includes the size of the population, usually at the middle of the period (mid-year population), and the duration of observation.
(Notes: The rate may be multiplied by a number larger than 1,000 for rare diseases, to bring the value to a convenient whole number. The time period is commonly a year, but the example above quoted three-year averages for Goosefoot. This was done because in a small population the rates might fluctuate somewhat from year to year, so the 3-year average gives a more stable picture.)
Why “crude”? The term warns us that comparing rates of disease or death between different populations may be misleading. Consider two towns, one with a relatively young population and the other a retirement community. Because death rates increase with age, one would expect more deaths in the retirement community, so age is a CONFOUNDING factor (see Chapter 5) and a direct comparison of mortality rates would reflect the demographic differences as much as their health. If the purpose is to focus on the health of individuals in each community this would be misleading. We can, however, adjust the crude rates to remove the effect of confounding factors, permitting a more informative comparison of health status. This adjustment calculates rates specific to strata of the population, here age groups, producing age-specific mortality rates. These can then be combined into an overall age-standardized rate that reduces the effect of age, to show how health in the two communities would have compared if (hypothetically) their age-structure had been the same. Rates may also be adjusted for more than one characteristic of the population, for example calculating age-, sex-, and race-specific death rates.
Standardization is used when comparing mortality in populations that differ in terms of characteristics known to influence mortality, and whose effect one wishes to temporarily remove. Standardization can also be used in tracing health in a population over time to adjust for demographic changes, and it can be used in comparing the performance of different clinicians, adjusting for differing case-mixes of their practices. “Adjustment” is the more general term covering standardization and other methods for removing the effects of factors that distort or confound a comparison. “Standardization” refers to an approach using weighted averages derived from a standard reference population.
An advantage of using age-standardized mortality rates
Standardization uses either a direct or an indirect method (the calculations are shown in the Nerd’s Corner box below). The direct approach provides more information but requires more data. Direct standardization is expressed as an age-standardized rate: x number of deaths per y number of individuals. For instance, in 2010 the crude death rate from injuries in Alberta, calculated by simply dividing the number of deaths from injury by the total population, was 49 per 100,000. This figure is useful in estimating need for services. But to highlight the possible impact of injuries in the oil sector, a comparison to other provinces without an oil sector is necessary, and this must be standardized because they will have different age-structures. It is important to understand that this standardized figure is artificial and can only be used in making comparisons across time or place. Standardized rates may be compared in both absolute and relative terms: as a simple difference between populations or as a ratio of two standardized rates.
Indirect standardization is typically used when stratum sizes (e.g., age groups) in the study population are small, leading to unstable stratum-specific rates. Indirect standardization only requires the overall mortality figure for the study population; the stratum-specific death rates are taken from a much larger reference population. The result is expressed as a standardized mortality ratio (SMR), which is the ratio of the deaths observed in the study population to the number that would be expected if this population had the same structure as the larger reference population. An SMR of 100 signifies that deaths are at the expected level, an SMR of 110 indicates a death rate 10% higher than expected (see Figure 6.3 for an illustration).
Dr. Rao meets the SMRs
Dr Rao has been pondering the meaning of the Goosefoot standardized mortality rates for a while; he wonders why the standardized rate is so much lower than the crude rate (690 versus 884). It is his wife who, in the end, suggests that the crude rate reflects the older population in Goosefoot: many of the young adults have moved away to find work. She points out that of course this will mean more deaths per thousand than in a younger population.
Dr Rao begins to feel somewhat relieved: standardizing the rates gives a more comparable result, and in fact, when the effect of age is removed Goosefoot is actually doing better than the region as a whole (690 versus 786 deaths per 100,000). He smiles contentedly.
Calculating age-standardized rates and ratios
Using data on mortality by age-group in Goosefoot and the broader Weenigo region as an example:
- Direct standardization
The age-standardized mortality rate (ASMR) is calculated in 4 steps:
- Select a reference population (usually the country as a whole) and find out from the census how many people there are in each age group (usually 5, 10 or 20-year age groups) and enter the data into a spreadsheet (see example below).
- Calculate age-specific death rates (deaths / population size * 100,000) for each age group in the study populations (Goosefoot and the broader Weenigo region).
- Calculate the number of deaths that would be expected in each age-group if the study populations had the same age structure as the reference population (Canada):
Study population age – specific death rate * reference population / 100,000.
For example, for children aged 0-14 in Goosefoot, 111 * 5,607,345 / 100,000 = 6251.
- The ASMR for each study population = Total expected deaths / size of the reference population * 100,000:
|Death rate /
|Death rate / 100,000
|Step 4: ASMR
(Expected deaths / ref. population) * 100,000
Here, we see that although the crude death rate in Goosefoot (884) was higher than that for the Weenigo region (808), the age-standardized rate is lower. This arises because there are more elderly people in Goosefoot (25% versus 15%); correcting for this indicates that Goosefoot is actually comparatively healthy.
Note that the ASMRs are artificial figures, and have no meaning in isolation: they have meaning only when compared to the crude death rate in the standard population or to the ASMRs from other groups, calculated using the same age-groups and reference population. Super nerds think about the ASMR as a weighted average of the age-specific rates for a place, with the weights being the proportion of the reference population that falls within each age group.
- Indirect standardization
Indirect standardization offers a short cut when we do not know the age-specific mortality rates for the study population, or when the population is too small to calculate stable stratum-specific rates, as with the deaths in ages 0-14 in Goosefoot. Indirect standardization takes the weights from the study population (i.e. the sizes of the age strata), but takes the death rates for each age group from the standard population. The study population provides the weights while the standard population provides the rates.
The standardized mortality ratio (SMR) is calculated in three steps:
- Obtain the age-specific death rates for the reference population (here, Canada 2011) = Deaths / population *100,000, for each age group.
- Multiply these rates by the number of people in each study population in each age group to calculate the expected number of deaths. These show the number of deaths that would occur in the study population if each age stratum had the same death rate as in the reference population. I.e., Canadian age‑specific death rate * population size of that age group in the study population / 100,000. Total the expected deaths up.
- Calculate the ratio between observed deaths and expected deaths in the study population: observed deaths / expected deaths * 100.
|Step 3: SMR (Observed deaths / expected deaths) * 100
Again, the figures indicate that Goosefoot is relatively healthy compared to the Weenigo region. Note: Each SMR can be compared to the reference population (Goosefoot has a lower mortality than Canada overall which has a value of 100), but it can be misleading to compare between SMRs. This is because each SMR reflects the age-structure of that region and if these are very different, comparisons across SMRs may be biased.4, 5
Life expectancy at birth is an estimate of the expected number of years to be lived by a newborn based on current age-specific mortality rates. Life expectancy is a statistical abstraction—after all, we will have to wait a lifetime to find out how long babies born today will actually live (and many of us will not be around to collect the data!). Life expectancy forms a summary indicator of health that can be compared across countries. In Canada in 2021, life expectancy was 84.7 years for females and 80.6 years for males, placing us close to Australia and just behind Japan. The remarkable impact of social conditions on life expectancy is illustrated by the case of Russia during the 1990s: life expectancy can fall surprisingly quickly if conditions deteriorate, as shown in the Illustration box.
Life expectancy in post-Soviet Russia
The social disruption of post-Soviet Russia was reflected in rapidly rising death rates, so that between 1990 and 1994 the life expectancy for women declined from 74.4 years to 71.2, while that for men dropped by six years from 63.8 to 57.7. Cardiovascular disease and injuries together accounted for 65% of the decline.6
Source: http://en.wikipedia.org/wiki/File:Russian_male_and_female_life_expectancy.PNG (accessed July, 2010).
Dr. Rao ponders life expectancy
Dr. Rao ponders the curious figures for life expectancy in Goosefoot:
|Life expectancy at birth, men (years)
|Remaining life expectancy at 65, men (years)
|Life expectancy at birth, women (years)
|Remaining life expectancy at 65, women (years)
He knows, of course, that women live longer than men, and the Goosefoot men live less long than those in the region as a whole. But he is surprised that the men who survive to 65 then live as long as the women. “They must be tough old men!” he says to himself, and thinks of the miners he knows. Those who do not have the chronic lung disease are, indeed, active outdoors people. “Perhaps if they do survive to that age they may live longer than men in the city,” he muses. The tendency for longer survival among the hardy few is termed the healthy survivor effect.
Setting priorities for disease prevention is based in part on the impact of each disease on the population. An obvious measure of impact is the number of deaths a disease causes. From Figure 6.2, this implies that cancers, heart disease and stroke are the diseases with the greatest impact. However, these conditions tend to kill people who are reaching the end of their expected life span, so preventing them might have little effect on extending overall life expectancy. Preventing premature deaths would add more years of life (and perhaps also more productive years) to individuals and to society. Premature death can be defined in terms of deaths occurring before the average potential life expectancy for a person of that sex, or it could be based on an arbitrary value, such as 75 years. In this case, a person who dies from a myocardial infarction at age 55 would lose 20 years of potential life, and such results could be summed across the population to indicate the impact (in terms of potential years of life lost) due to each cause. These values can be used to indicate the social impact of diseases in terms of the total Potential Years of Life Lost (PYLL) due to each. (You will sometimes see the abbreviation YPLL for “years of potential life lost”: same thing.)
Prevention priorities based on the PYLL will differ from those based on simple mortality rates. Figure 6.5 shows that based on mortality rates, cancer, circulatory and respiratory diseases remain the familiar priorities as in Figure 6.2. However, they kill relatively late in life (although cancer less so) and using PYLL, injuries (unintentional and intentional) become more important than strokes or respiratory diseases. Indeed, taken together, suicides and unintentional injuries cause more years of potential life lost than circulatory disease.
In cohort studies and clinical trials, outcomes may be expressed as symptom-free survival (how long before symptoms return after a treatment) or survival (time between a diagnosis and death); hence the term SURVIVAL CURVE. Kaplan and Meier developed statistical methods such as the log rank test for evaluating the difference between two survival curves. In an extension, Cox’s proportional hazards model allows for a comparison between two survival curves while adjusting for other variables that may differ between the groups such as loss to follow-up.
This kind of SURVIVAL ANALYSIS is common in the clinical literature, as it has a number of advantages. It gives a full picture of the clinical course of a disease in terms of survival rates at specified intervals after diagnosis and/or treatment. Figure 6.6 shows a hypothetical example. Although the outcomes of the two treatments after 48 weeks are similar, Treatment A enhances survival over the first few months after treatment.
Survival curves and incidence rates
There are evident limitations to morbidity and mortality as indicators of healthiness. They only apply to serious conditions and are largely irrelevant to most middle-aged people. Furthermore, a diagnosis that becomes a morbidity statistic says little about a person’s actual level of function, and morbidity indicators cannot cover positive aspects of health. This led to the development of a range of more subjective indicators of health, termed health measurement scales or patient-reported outcomes. Measuring health, however, is inherently challenging, because health is an abstract concept. Unlike morbidity, health is not defined in terms of specific indicators that can be used as measurement metrics, such as blood pressure in hypertension or blood sugar in diabetes.
Measurement scales have been developed for most common diagnoses, and these are termed “disease-specific scales”. Some rate the severity of symptoms in a particular organ system (e.g., vision loss, breathlessness, limb weakness); others focus on a diagnosis, such as anxiety or depression scales. Other measurements are broader in scope, covering syndromes (emotional well-being scales) or overall health, and the broadest category of all: health-related quality of life. These broad scales are termed “generic scales”, as they can apply to any type of disease and to anyone; a common example is the Short-Form-36 Health Survey, a 36-item summary of functional health.1, pp649-65 A simpler example is the single question: “In general, would you say your health today is Excellent? Very good? Good? Fair? Poor?” This question shows remarkable agreement with much longer scales.1, pp581-7 On an even broader level, some measures seek to capture the well-being of populations, such as the Canadian Index of Wellbeing, in which health forms a significant component.
Measuring health in the clinic
There are three main applications for health measurement scales. Diagnostic instruments collect information from self-reports and clinical ratings, and process these using algorithms to suggest a diagnosis. There are many in psychiatry, such as the Composite International Diagnostic Interview.7 Prognostic measures include screening tests and sometimes information on risk factors, and these may be combined into one of many health risk appraisal systems that are available online. Evaluative measures record change in health status over time and are used to record the outcomes of care. This category forms the largest group of instruments and includes the generic and disease-specific outcome measures mentioned above.
Objective and subjective indicators
Health indicators can be recorded mechanically, as in a treadmill test, or they may derive from expert judgment, as in a physician’s assessment of a symptom. Alternatively, they may be recorded via self-report, as in a patient’s description of her pain. Mechanical measures collect data objectively in that they involve little or no judgment in the collection of information, although judgment may still be required in its subsequent interpretation. With subjective measures, human judgment (by clinician, patient, or both) is involved in the assessment and its interpretation. Subjective health measurements hold several advantages: they describe the quality rather than merely the quantity of function; they cover topics such as pain, suffering, and depression which cannot readily be recorded by physical measurements or laboratory tests; and subjective measures do not require invasive procedures or expensive equipment. The great majority of subjective health measures collect information via questionnaires: many have been extensively tested and are commonly used as outcome measures in clinical trials.1 Drug trials must now include quality of life scales, in addition to symptom- or disease-specific scales, in order to record possible adverse side effects of treatment, such as nausea, sleeplessness, etc.
Because objective and subjective indicators each have advantages, they are sometimes combined. For example, in deciding whether or not to undergo chemotherapy or surgery, a cancer patient will wish to balance the expected gain in life expectancy against a judgment of the quality of the prolonged life (considering side effects of treatment, pain, residual disability). At a societal level, this helps to address the question of whether extending life expectancy (e.g., by life-saving therapies) may also increase the number of disabled people in society. This possibility led to the development of combined mortality and quality of life indicators such as quality-adjusted life years (QALYs).
Quality-Adjusted Life Years (QALYs) extend the idea of life expectancy by incorporating an indicator of the quality of life among survivors. Rather than count every year of life lived as though they were equivalent, this statistic downgrades the value of years lived in a state of ill-health: these are counted as being worth less than a year of healthy life. In evaluating a therapy, QALYs count the average number of additional years of life gained from an intervention, multiplied by a judgment of the quality of life in each of those years. For example, a person might be placed on hypertension therapy for 30 years, which prolongs his life by 10 years but at a slightly reduced quality level, owing to dietary restrictions. A subjective weight is given to indicate the quality or utility of a year of life with that reduced quality (say, a value of 0.9 compared to a healthy year valued at 1.0). In addition, the need for continued drug therapy over the 30 years slightly reduces his quality of life by, say, 0.03. Hence, the QALYs gained from the therapy would be 10 years x 0.9 – 30 years x 0.03 = 8.1 quality-adjusted life years.
The numerical weights assigned to represent the severity of disabilities are known as utility scores (0.9 and 0.03 in the example above). Utilities capture the preferences of people for alternative health states, reflecting a judgment of the quality of life lived in each state. Utility scores range from 0 (death) to 1.0, which represents the best possible health state. Utilities are obtained via studies in which patients, professionals or members of the public use numerical rating methods to express their preferences for alternative outcomes, considering the severity of various levels of impairment. Common rating methods include the “standard gamble” and the “time trade-off”.
The standard gamble involves asking experimental subjects to choose between (i) living in the state being rated (which is less than ideal) for the rest of one’s life, versus (ii) taking a gamble on a treatment (such as surgery) that has the probability p of producing a cure, but also carries the risk 1-p of operative mortality. To record the perceived severity of the condition, the experimenter increases the risk of death until the person making the rating has no clear preference for option (i) or (ii). This shows how great a risk of operative mortality he or she would tolerate to avoid remaining in the condition described in the first option. In principle, the more severe the rater’s assessment of the condition, the greater the risk of dying in the operation (perhaps five, even ten percent) they would accept to escape the condition. This risk is used as an indicator of the perceived “dysutility” (i.e. severity) of living in that condition.
The time trade-off offers an alternative way to present the standard gamble. As before, it asks raters to imagine that they are suffering from the condition whose severity is to be rated. They are asked to choose between remaining in that state for the rest of their natural lifespan (e.g., 30 years for a 40 year-old person), or returning to perfect health for fewer years. The number of years of life expectancy they would sacrifice to regain full health indicates how severely they rate the condition. The utility for the person with 30 years of life expectancy would be given as Utility = (30 – Years traded)/30.
An example of utility scaling methods used is given by the Canadian Health Utilities Index. For example, being unable to see at all receives a utility score of 0.61; being cognitively impaired–as with Alzheimer’s disease–receives a score of 0.42.8 Note that these utility judgments are subjective and may vary from population to population, offering an insight into cultural values.
Some measurement procedures allow patients themselves to supply the utility weights. This may be helpful for clinicians in helping a patient to decide whether to undergo a therapy that carries a risk of side-effects. An instrument of this type is the QTwiST, or Quality-Adjusted Time without Symptoms and Toxicity.1,pp 559-63
Disability-Adjusted Life Years (DALYs) and Health-Adjusted Life Years (HALYs) work in a very similar manner to QALYs. DALYs focus on the negative impact of disabilities in forming a weighting for adjusting life years, and HALYs base their valuation on the positive impact of good health. The approach of QALYs, DALYs, and HALYs can also be used to adjust estimates of life expectancy, taking account of quality of life, disability, and health respectively—the last giving rise to the acronym HALE, for Health Adjusted Life Expectancy.
All health indicators, measurements, and clinical tests contain some element of error. There are three chief sources of measurement error: in the thing being measured (my weight tends to fluctuate, so it’s difficult to get an accurate picture of it); in the observer (if you ask me my weight on a Monday, I may knock a pound off if I binged on my mother-in-law’s cooking over the weekend―obviously the extra pound doesn’t reflect my true weight!); or in the recording device (the clinic’s weigh scale has been acting up—we really should get it fixed).
As with sampling, both random and systematic errors may occur (see Chapter 5 on sampling errors). Random errors are like noise in the system: they have an inconsistent effect. If large numbers of observations are made, random errors should average to zero, because (being random) some readings overestimate and some underestimate. They can occur for lots of reasons: a buzzing mosquito distracted Dr. Rao when he took Julie’s blood pressure; you can’t really recall how bad your pain was last Tuesday, and so on. Random errors are detected by testing the RELIABILITY of a measurement.
Systematic errors in a measurement consistently distort the scores in a particular direction and are likely due to a specific cause. Errors that fall in one direction (I do tend to exaggerate my athletic prowess…) bias a measurement and reduce its VALIDITY. These distinctions are illustrated in Figure 6.7, using the metaphor of target shooting that was introduced in Chapter 5: a wide dispersion of bullets indicates unreliability, whereas off-centre shooting indicates bias or poor validity.
Reliability refers to dependability or consistency. Your patient, Jim, is unpredictable: sometimes he comes to his appointment on time and sometimes he’s late, but once or twice he was actually early. Jim is not very reliable. Jack, on the other hand, arrives exactly 10 minutes early every time. Even though he comes at the wrong time, Jack is reliable, or predictable. A reliable measure will be very reproducible, but it may still be wrong. If so, it is reliable but not valid (bottom left cell in Figure 6.7).
An introductory definition of validity is: “Does the test measure what we are intending to measure?” A slightly more wordy definition is: “How closely do the results of a measurement correspond to the true state of the phenomenon being measured?” A yet more abstract definition is: “What does a given score on this test mean?” This last interpretation of validity fits under a more general conception in terms of “What conclusions can I draw from these test results?” This is exactly what the clinician wants to know.
There is no single approach to estimating the validity of a measurement: the approach varies according to the purpose of the measurement and the sources of measurement error you wish to detect. In medicine, the commonest way to assess validity is to compare the measurement with a more extensive clinical or pathological examination of the patient. This is called criterion validation, because it compares the measurement to a full work-up that is considered a “gold standard” criterion. Criterion validation is typically used when the measurement offers a brief and simple way to assess the patient’s condition, and our question is: “How well does this simple method predict the results of a full and detailed (and also expensive, perhaps invasive) examination?” For example, a validity study of fecal occult blood testing as a screen for colon cancer might compare the test results to colonoscopy findings for a sample of people that includes some with and some without the disease.
Table 6.1 outlines a standard 2 x 2 table as the basis for calculating the criterion validity of a test. A population of N patients has been tested for a given disease with the new test (shown in the rows), and each person has also been given a full “gold standard” diagnostic work-up, shown in the columns. (This is the theoretical “gold standard” that we are assuming is correct; unfortunately in reality gold standards may not be as golden as we would like). Several statistics can be calculated to show the validity of the screening test.
Sensitivity summarizes how well the test detects disease. It is the probability that a person who has the disease will be identified by the test as having the disease. The term makes sense: if a test is sensitive to the disease, it can detect it. Using the notation in the table:
a / (a + c), or TP / (TP + FN), or Hits / (Hits + Misses)
The complement of sensitivity is the false negative rate (c/a+c), which expresses the likelihood of missing cases of disease. A test with a low sensitivity will produce a large number of false negative results.
Some mnemonics may help you: SeNsitivity is inversely associated with the false Negative rate of a test (high sensitivity = few false negatives). And, on a topic to be discussed later, low seNsitivity leads to a low Negative predictive value.
Specificity measures how well the test identifies those who do not have this disease:
d / (b + d), or TN / (TN + FP)
Specificity is the complement of the false positive rate (b / b+d): the likelihood of people without the disease being mistakenly labelled as having it. Again, the term is intuitive: a specific screening test is one that detects only the disease it is specifically designed to detect; hence, it will not give people with other conditions false positive scores. Specificity is clinically important as a false positive test result can cause worry, lead to the expense of unnecessary further investigation and perhaps unnecessary interventions.
Some mnemonics to help you: SPecificity is inversely associated with the rate of false Positives. And low sPecificity leads to a low Positive predictive value.
Most diagnostic and screening tests present results on a numerical scale and a crucial point to recognize is that imperfect validity will cause scores on the test to overlap between those with, and those without the condition. This is due to natural biological variability and to random test errors, and is illustrated in Figure 6.8. To interpret test scores a cut-point is chosen to distinguish positive test results from negative. The use of a single cut-point means that it is exceedingly rare for a test to have both high sensitivity and high specificity. If the cut point in the diagram is moved to the right, specificity will increase, but sensitivity will fall, perhaps quite sharply, as more of the true cases (shown in red) are missed. The reverse is also true. The implications of this unfortunate dilemma will appear in the paragraphs that follow.
Note that the laboratory will report the patient’s score, along with a range of “normal values”. The clinician can choose a cut-point or threshold score for that patient, either to increase sensitivity or specificity for their purpose: this is discussed in ruling in and ruling out diagnoses, below.
Sensitivity and specificity are inherent properties of a test, and are useful in describing its expected performance. But when you apply the test to your patient you will not know whether their positive score is a true positive or perhaps a false positive (and the same for a negative score). In effect, we know which row of Table 6.1 the patient belongs in, but not which column. Therefore, we are more interested in what a negative or positive test result tells us about the patient: what is the LIKELIHOOD that a positive score indicates disease? For this, we use the predictive values.
The positive predictive value (PPV) shows what fraction of patients who receive a positive test result actually have the disease:
a / (a + b), or TP / (TP + FP)
You can see from Table 6.1 that a test with low specificity (i.e. lots of false positives, so b is large) will have a low PPV.
Correspondingly, the negative predictive value (NPV) shows how many people who receive a negative score really do not have the condition:
d / (c + d), or TN / (TN + FN)
If the test has low sensitivity, FN will be large, so its NPV will be reduced. This makes sense, as low sensitivity means the test will miss a lot of cases, so their negative scores may be misleading.
Predictive values and prevalence
In a further complication of interpreting test scores, predictive values vary according to the prevalence of the disease among the patients to whom you are administering the test. Clinicians must bear this in mind when interpreting a test result: you must treat the patient, not the test result! The reasons are illustrated in Figure 6.9, which contrasts the performance of the same test in high and low prevalence settings.
Note that the sensitivity and specificity of the test remain the same in both settings (they are properties of the test), but the predictive value of a positive score is very different. This is simply because there are many fewer cases to be identified and many more non-cases in the primary care setting. But tests are often validated in hospital settings, where the prevalence of the disease being tested for is high, similar to that shown in the left panel of the figure. However, the test may then be used in primary care settings, where the disease prevalence is lower, as shown in the right panel. So, unless the specificity is extremely high, the number of false positives in a primary care setting can exceed the true positives, as in the example. At the same time, lower prevalence means that a negative test result is more accurate: you can reassure your primary care patient with a negative score that he is very unlikely to have the disease (you will, of course, remind him to come back for re-evaluation if his symptoms continue: he may be one of the few with a false negative result).
In summary, interpreting test results requires insight into the population on which you are applying the test. Beware of applying screening or diagnostic tests in low-prevalence settings: you may find many false positive results. For instance, in general population breast cancer screening programmes, the positive predictive value of a positive mammogram is only around 10%, so for every 100 women who are recalled for further investigation after an abnormal screening mammogram, 90 will not have cancer.9 One strategy to improve the positive predictive value of a test is to change from screening everyone (universal screening) to screening selectively. For example, only test people at high risk of the condition—those with risk factors, a family history, or positive symptoms, among whom the disease will be more prevalent.
Taking this one step further, the more tests you administer to a patient, as in an annual physical exam, the more likely you are to get a false positive (and therefore misleading) score on one or more of the tests. This leads to a lot of unnecessary further work-up. Hence, tests should be chosen carefully and applied in a logical sequence to rule in, or rule out, a specific diagnosis that you have in mind. Each test should be chosen based on the conclusion you drew from the result of the previous test. Remember that unnecessary tests are not only costly to the health care system, but also ethically questionable if they expose a patient to unnecessary risks such as radiation from an X-ray or the need for further investigation after a false positive result.
Finally, we may link the themes of random versus systematic error to the sensitivity and specificity of a test. In some tests, imperfect sensitivity may be due to a systematic error, such as choosing the wrong threshold score for a particular type of patient. For example, desirable levels of cholesterol and body weight differ for people with diabetes compared to non-diabetics. In other cases, reduced sensitivity may be due to random errors. If the error is random and the test is repeated, the diseased and non-diseased may be reclassified.10 In this instance it might be appropriate to repeat slightly elevated tests to confirm that the value is stable (see Nerd’s Corner “Regression to the mean”). The decision to label someone must be made taking the whole picture into account, not just a single test result. There is an inherent tension between wishing to intervene early (when treatment may prevent further deterioration) and falsely labelling a person as a patient. Such balances represent part of the art of medicine.
Regression to the mean
Clinicians frequently use test results to rule in, or to rule out, a possible diagnosis, but the logic of this is often misunderstood. To rule a diagnosis in, you need a test that is high in specificity; to rule a diagnosis out you need a test that has high sensitivity. At first, this sounds counter-intuitive, so let’s explore it further, beginning with ruling a diagnosis out. A perfectly sensitive test will identify all cases of a disease, so if you get a negative result on a sensitive test, you can be confident that the patient does not have this disease. The mnemonic is “SnNout“: Sensitive test, Negative result rules out. Conversely, to rule a diagnosis in, you need a positive result on a specific test, because a specific test only identifies this type of disease. The mnemonic is “SpPin” – Specific test + Positive score rules in. Note, unfortunately, that if the patient gets a negative score on this test, they may still have the disease (i.e., the test is specific and therefore maybe not very sensitive, and they had a false negative result).
Linking back to the strategy for applying multiple tests, if the goal is to rule alternative diagnoses out, then several tests with cut-points chosen to increase sensitivity can be run simultaneously, to increase the chances of detecting rival diagnoses. If the goal is to rule a diagnosis in, specific tests can be administered serially, stopping when a positive result is obtained. And permutations can be used: for example, HIV can be tested first using a sensitive (but not very specific) serological test. This will rule out those without HIV and find the true positives, but along with many false positives. Therefore the positives are re-tested using a specific test (e.g., Western blot) that will exclude the false positives. The use of tests to enhance the likelihood of a diagnosis introduces the idea of likelihood ratios.
Head SpPinny logic
You may occasionally hear someone argue that you need a sensitive test to rule a diagnosis in, but this is false.
The reason is that a sensitive test will indeed identify most of the true cases of the disease, but setting the cutting score to increase sensitivity will generally reduce specificity (look back at Figure 6.8). This means that there may be a number of false positives mixed in, so the sensitive test cannot rule the disease in.
A likelihood ratio combines sensitivity and specificity into a single figure that indicates by how much knowing the test result will reduce your uncertainty in making a particular diagnosis. This is an application of Bayesian logic, which here estimates how far to update one’s confidence in a diagnosis in the light of new information. The likelihood ratio is the probability that a given test result would occur in a person with the target disorder, divided by the probability that the same result would occur in a person without the disorder.
A positive likelihood ratio (or LR+) indicates how much more likely a person with the disease is to have a positive test result than a person without the disease.
LR+ = sensitivity ÷ (1 – specificity),
or the ratio of true positives to false positives. Using the terms in Table 6.1, the formula is
a/(a + c) ÷ b/(b + d).
The LR+ indicates the impact of a positive test result on your estimate of whether the person does have the disease:
Post-test odds = Pre-test odds x Likelihood ratio.
A negative likelihood ratio (or LR–) indicates how much more likely a person without the disease is to have a negative test result, compared to a person with the disease.
LR– = (1-sensitivity) ÷ specificity,
or the ratio of false negatives to true negatives. Referring to Table 6.1, the formula is
c/(a+c) ÷ d/(b+d).
In other terms, the LR+ expresses how much a positive test result increases the ODDS that a patient has the disease; an LR– indicates how much a negative test decreases the odds of having it. LR+ and LR- can be regarded as fixed attributes of the test.
In applying this to a clinical situation the use of a nomogram (Figure 6.10) removes the need for calculation. It is necessary to begin from an initial estimate of the patient’s likelihood of having the disease, called the pretest probability; this estimate will later be adjusted according to the test result. The pre-test probability can be difficult to estimate (see Nerd’s corner: Pretest probability), but a first guess is based on the prevalence of the condition in the setting where you are practising. This can then be modified upwards or downwards by your initial clinical impression and history-taking for this patient. The scale on the left of the nomogram shows the pre-test probability of disease; the central column shows the likelihood ratios, and the right-hand scale shows the post-test probability. To use nomogram, you need to know (or calculate, using the formulae above) the likelihood ratio for the test. If the test result is positive, draw a straight line on the nomogram from the pre-test probability through the likelihood ratio; where this line cuts the right-hand scale indicates the post-test probability that the patient has the disease. Post-test probability means the likelihood that the patient has this disease, taking into account the initial probability and the test score. Of course, if from your initial observations you feel virtually certain the patient has the disease, a positive score on the test will not add very much new information; but a negative score would.
Applying this to the example in Figure 6.9, test sensitivity and specificity are both 0.91 so the LR+ is 0.91/(1-0.91) = 10.1. In general, tests with positive LRs higher than about 5 are useful in ruling in a disease. Draw a line through the pre-test probability on the left of the diagram, through 10.1 in the central column, and then read off the post-test probability on the right-hand column. For the hospital setting, the prevalence was 33%, while it was 3% for the primary care setting; these values give a rough estimate of pre-test probability. So, in the hospital setting, a patient with a positive test result would mean that their post-test probability of having the disease is over 80% (green line), whereas in the primary care setting it would be around 20% (red line). In both instances, the test result has substantially increased the clinical probability of the patient having the disease. In the hospital setting, you are now pretty certain of the diagnosis. In the primary care setting, as long as the situation is not urgent, you might want to increase your certainty by further investigation before launching into possibly unnecessary treatment.
The estimation of pretest probability reminds us of the distinction introduced in Chapter 2 between health determinants and risk factors. Determinants set the incidence rate in a population, and incidence offers a first approximation to the pretest probability of disease for an individual from that population.
But we cannot apply population data directly to the individual (who is unique, so unlikely to perfectly match the population average). But taking account of individual risk factors modifies the crude estimate of pretest probability upwards or downwards. For example, incidence for 35 year old males may be x %, but you see a man of this age who is overweight and smokes, so his risk might be estimated as 2x %. Furthermore, the pattern of signs and symptoms could raise the risk or likelihood ratio even more, perhaps to 4x %.
However, assessment of risk is often imprecise and described as “high or low clinical suspicion” and a diagnostic test is applied to either confirm or rule out the putative diagnosis. However, a positive confirmatory test often indicates only a higher level of probability, albeit a level at which doubt must be suspended until evidence is found to the contrary. In a similar way, when a test has ruled out the disease the clinician can inform the patient that he is at very low risk, but nonetheless should report worrying symptoms. Clinicians must always be ready to review and revise their diagnoses.
Turning to negative results, the likelihood ratio (LR-) gives a result below 1: values smaller than 0.2 or so are useful in ruling out a disease. In our example, the LR- is 0.099. In the hospital setting, a patient who receives a negative test result would have a post-test probability of having the disease of around 4% (down from 33% before the test was administered). The primary care patient with a negative test would have a post-test probability of having the disease of about 0.2% — he almost certainly does not have it (a 1 in 500 chance).
Clinicians must make binary decisions: to prescribe a treatment or not, to operate or not, or to reassure the patient that he does not have a disease. However, most biological measurements do not provide binary categories, but instead produce a continuous range of values, as with blood pressure, blood cholesterol, glucose, creatinine or bone density. Hence, a cut-point on each of these scales has to be chosen to separate the “normal” from the “abnormal” results. Even with qualitative assessments, such as X-rays or histology slides, decisions must be made among a range of findings, which include grey areas between definitely abnormal and clearly normal.
Defining normal is not as simple as it might seem. Superficially, it is defined in terms of the average, or most common, presentation for a person of that demographic category. But this does not necessarily imply that it is healthy: on average, Canadians are overweight. In addition, abnormality occurs at both ends of a continuum—being underweight and being overweight are both unhealthy. Therefore, in place of the average, normal could be defined in terms of a range, perhaps defined by percentiles or by standard deviations, on the continuum being measured, such as body weight. This seems to move the notion of normal closer to healthy, but setting the margins of the distribution is challenging. We cannot justify defining the normal range in terms of, say, within two standard deviations above or below the mean, as this would vary from measure to measure.
A more promising approach returns us to the theme of evidence-based medicine and defines normal in terms of a range of scores above or below which treatment would be beneficial: abnormal is the threshold beyond which a person would benefit from treatment. This idea corresponds to an approach to defining need for care, described in Chapter 7. It implies that innovations in treatments would modify the range of what is considered normal. For example, new therapies treat cognitive loss earlier than before, so new layers of cognitive impairment are being defined among people who would previously have been considered normal for their age, described by that wonderful phrase “benign senescent forgetfulness”. In a similar way, cut-points for defining hypertension have changed. Pre-hypertension was redefined in 2003 as a systolic blood pressure of 120-139 mmHg or a diastolic pressure of 80-89 mmHg.12 By altering cut-points, more people are classified as having the disease and, therefore, become eligible for treatment. The reason for altering the cut-points has usually been because a clinical trial has shown new treatments to achieve better outcomes for this group of patients, although the improvement may be small. Not surprisingly, such advances find favour with the drug companies that make and sell the treatments.
1. What are the leading causes of death in Canada? How does the ranking of these causes change if you were to use Potential Years of Life Lost?
- McDowell I. Measuring health: a guide to rating scales and questionnaires. New York (NY): Oxford University Press; 2006.
- Morgenstern H. Ecologic studies in epidemiology: concepts, principles, and methods. Annu Rev Public Health. 1995;16:61-81.
- Rothman KJ, Greenland S, Lash TL. Modern epidemiology. New York: Lippincott, Williams and Wilkins; 2008.
- Goldman DA, Brender JD. Are standardized mortality ratios valid for public health data analysis? Statistics in Medicine. 2000;19:1081-8.
- Court B, Cheng K. Pros and cons of standardized mortality ratios. Lancet. 1995;346:1432.
- Notzon FC, Komarov YM, Ermakov SP, Sempos DT, Marks JS, Sempos EV. Causes of declining life expectancy in Russia. JAMA. 1997;279(10):793-800.
- World Health Organization. The Composite International Diagnostic Interview, version 1.1: researcher’s manual. Geneva, Switzerland: WHO; 1994.
- McMaster University. Health utilities group health utilities index and quality of life 2000 . Available from: http://www.fhs.mcmaster.ca/hug/index.htm.
- Wright CJ, Mueller CB. Screening mammography and public health policy: the need for perspective. Lancet. 1995;346(8966):29-32.
- Vickers AJ, Basch E, Kattan MW. Against diagnosis. Ann Intern Med. 2008;149(3):200-3.
- Sackett DL, Haynes RB, Tugwell P. Clinical epidemiology: a basic science for clinical medicine. Philadelphia (PA): Lippincott, Williams & Wilkins; 1991.
- Chobanian AV, Bakris GL, Black HR, et al. The seventh report of the joint national committee on prevention, detection, evaluation, and treatment of high blood pressure: the JNC 7 report. JAMA. 2003;289:2560-72.