Chapter 5 Assessing Evidence and Information

Assessing Evidence and Information

After completing this chapter, the reader will be able to:

    1. Evaluate sources of data by applying methods of critical appraisal in order to  practise evidence-based medicine;
    2. Describe the major categories of research data, comparing the strengths of qualitative and quantitative approaches
    3. Describe the strengths and limitations of the major categories of study designs:
      Experimental designs and Observational designs
    4. Demonstrate an ability to critically appraise research findings, with particular reference to
      1. characteristics of study designs (randomized controlled trial, cohort, case-control, cross-sectional)
      2. and to potential study errors (bias, confounding) (MCC 78-2)
    5. Describe criteria for assessing causation (78-2)
    6. Discuss different measures of association, including relative risk relative risk, odds ratios, attributable risk, and number needed to treat (78-2)
    7. Discuss the logic of statistical analysis:
      •  Study sampling 
      •  Measures of central tendency
      •  Inferential statistics
      •  Significance of differences
    8. Describe possible sources of error i studies:
      •  Sampling errors
      •  Measurement errors
      •  Objectivity of the researcher 
    9. Explain the hierarchy of quality of research evidence for evidence-based medicine:
      •  Systematic reviews
      •  Meta analyses
      •  Cochrane Collaboration
    10. Apply results to your patients;
    11. Review the limits to evidence-based medicine.

Linking these topics to the Medical Council exam objectives, chiefly section 78-2.

Note: The colored boxes contain optional additional information; click on the box open it and to close it again.
Words in CAPITALS are defined in the Glossary

Magnets and menopause

Julie Richards worries about her menopause. She gets hot flashes and feels generally tired. She fears getting old, including the risk of osteoporosis and cancers. She mentioned this to her daughter Audrey, who found information about hormone therapy, calcium supplements and evening primrose oil on the internet. She also read about physical exercise as a way of improving well-being. Julie Richards shows Dr. Rao the information her daughter found. In particular, she wants to know if a magnet will help her symptoms of menopause. She read about it on the web and shows the printout to Dr. Rao. The website gives information about menopause and cites some peer-reviewed articles that suggest that static magnets are effective in the treatment of dysmenorrhea.

Later that day, Dr. Rao uses Medline and other sources to check this out. He finds that the author of Julie’s article runs a private clinic specializing in menopause problems. Dr Rao also finds a number of articles on magnets in pain management. There is a systematic review of the evidence that concludes that magnets might be minimally effective in osteoarthritic pain, but are of no demonstrated value in other types of pain. Promoters of the magnets say that their mechanism of action is either direct interference with nerve conduction, or action on small vessels to increase blood flow.

Assessing Medical Information

People have claimed the ability to cure ills since the dawn of time. Some cures are based on science, in that their mode of action and the principles underlying them are known. Some are known empirically to improve health, but we do not fully understand how they work. Other treatments have been shown to have no benefit, while many have never been rigorously tested. Finally, some have been shown to have only a placebo effect, a benefit achieved by suggestion rather than by direct chemical action.

In 2022, MEDLINE indexed almost 1.4 million articles on medical research.1 To guide medical practice, various agencies now review these publications and propose clinical guidelines based on the assembled evidence. This supports the practice of evidence-based medicine. However, we don’t have guidelines for every condition, so clinicians need to understand the basics of how to review and evaluate medical research articles. This is complicated by the discovery that many of the research results do not agree.2 There are various reasons for this, ranging from differences in the study design, to the perspective of the investigator, to unique characteristics of the people in the study, or the way the results were analyzed. No study is perfect, yet the ideal of practising medicine based on evidence demands that clinicians base their decisions on the best available scientific evidence. Because some evidence is flawed, clinicians must be able to judge the validity of published information, and this forms the theme of critical appraisal of the literature. This chapter concerns learning how to think, rather than what to think.

Critical appraisal

Critical appraisal refers to judging the validity of the procedures used in a study to collect data, identifying possible biases that may have arisen, assessing the adequacy of the analysis and completeness of reporting, evaluating the conclusions drawn, and reviewing the study’s compliance with ethical standards of research. Checklists help guide the critical appraisal process, but ultimately, clinicians must use their judgement to assess the quality of a study and its relevance to their particular clinical question.

The first step in critical appraisal is the application of common sense. This was summarised in 1990 as FiLCHeRS, which stands for Falsifiability, Logic, Comprehensiveness, Honesty, Replicability, and Sufficiency.3

Table 5.1: Standards for evaluating information quality under the acronym FiLCHeRS

Falsifiability For a conclusion to be based on evidence (rather than belief) it must be possible to conceive of evidence that would prove the claim false (for example, it would be possible to show that magnets do not reduce menopausal symptoms. But there is no logical way of proving that God does not exist)
Integrity Arguments in support of a claim must be exhaustive – they must consider all the evidence and not exclude contrary evidence.
Logic Arguments in support of a claim must be logically coherent (one cannot claim a biological effect of the magnets based on the relief felt by people who use them)
Comprehensiveness The evidence offered in support of any claim must be exhaustive–all of the available evidence must be considered; one cannot simply ignore evidence to the contrary
Honesty The evidence offered in support of any claim must be evaluated with an open mind and without self-deception
Replicability It must be possible for subsequent experiments or trials to obtain similar results
Sufficiency The evidence offered in support of any claim must be adequate to establish the truth of that claim, with these stipulations:
– the burden of proof for any claim rests on the claimant
– extraordinary claims demand extraordinary evidence, and
– evidence based on authority or testimony is inadequate for most claims, especially those which seem unlikely to be true.

Evidence-based medicine

Evidence-based medicine (EBM) refers to “the consistent use of current best evidence derived from published clinical and epidemiologic research in management of patients, with attention to the balance of risks and benefits of diagnostic tests and alternative treatment regimens, taking account of each patient’s unique circumstances, including baseline risk, co-morbid conditions and personal preferences”.4

In practice, EBM means integrating one’s clinical experience with the best available external clinical evidence from systematic research. The approach was developed in Canada by Dr. David Sackett and colleagues at McMaster University during the 1970s, and is now recognized as a key foundation of medical practice.5 Sackett described evidence-based medicine as (1) the process of finding relevant information in medical literature to address a specific clinical problem, (2) the application of simple rules of science and common sense to determine the validity of the information, and (3) the application of the information to answer a clearly formulated clinical question. The aim is to ensure that patient care is based on evidence derived from the best available studies. Sackett argued that the “art of medicine” lies in taking the results of several sources of evidence and interpreting them for the benefit of this particular patient: the opposite of what he called “cookbook medicine” which follows a fixed recipe. The approach has subsequently been applied beyond clinical medicine to propose, for example, evidence-based public health and evidence-based policy making.

Mnemonic: The “5 A”s of evidence-based medicine

Here is a sequence that a clinician may follow in applying evidence-based medicine in deciding how to handle a challenging clinical case:

  1. Assess: Recognize and prioritize this patient’s problems.
  2. Ask: Construct clinical questions that facilitate efficient searching for evidence on managing the condition in the literature. Questions usually follow the PICO format: describe the Patient; describe the Intervention being considered; with what will it be Compared, and what is the Outcome being sought? For example: “Will a 48 year-old peri-menopausal woman who wears a magnetic wrist bracelet experience fewer night sweats compared to a similar woman who does not wear such a magnet?”
  3. Acquire: Gather evidence from quality sources. Librarians are helpful at this stage.
  4. Appraise: Evaluate the evidence for its validity, importance, and usefulness (in particular, does it answer your PICO question?)
  5. Apply: Apply to the patient, taking account of their preferences and values and of the clinical circumstances.

For more information on the 5 As, please visit: and

Types of error in Studies

Critical appraisal judges whether shortcomings in a study design or execution could have produced misleading results. Researchers obviously try to eliminate potential errors in their studies, but this can be exceedingly difficult and always increases the study cost. Potential errors are of two broad types: bias (or systematic distortions in the reported results) and random errors. In addition, studies of causation have to address confounding, a challenge in interpreting statistical associations. These types of error are explained in detail later in this chapter, but brief definitions may help the reader at this point: see the Definitions box.

Types of error

Error: “A false or mistaken result obtained in a study or experiment.”1 We may distinguish between random and systematic errors in the way data are collected:

Random error: deviations from the truth that may either inflate or reduce an estimate derived from a measurement (or from a study sample). These errors are generally assumed to be due to chance and, if the sample is large, to have little effect on distorting the overall results. Statistics, such as the confidence interval, estimate the magnitude of random errors (see Sampling and chance error below).

Systematic error, or bias: a deviation of results or inferences from the truth that produces a consistent exaggeration or an underestimate of effect. These errors may arise from defects in the study design, including the sampling (“selection bias”), or may arise from faulty measurement procedures (“information bias”).

Confounding: a challenge in interpreting the results of a study in which the effects of two processes are not distinguished from each other. Consider (an imaginary) study that discoveres higher rates of cancer among conservative voters. Deeper analysis showes that this association arose because older people are both more likely to get cancer and to vote conservative. Here age is a confounding factor that creates an apparent (or “spurious”) association. If the results are adjusted for age (for example by comparing voting and cancer diagnosis in narrow age-groups) the association is likely to disappear. More detail is given in the section on confounding, below.

Errors of both types can arise in any of the main types of study design, which were created to address different types of research question.

Appraising Scientific Evidence: Qualitative and Quantitative Research

The practitioner of evidence-based medicine must balance scientific evidence derived from studying groups of people against the unique characteristics of this particular patient: blending the science and art of medicine. Similarly, scientific evidence can be drawn from a combination of quantitative and qualitative research. QUALITATIVE RESEARCH (see Glossary) uses non-numerical observations to offer detailed insight into individual cases, and can often address “Why?” questions such as “Why is this patient not adhering to his treatment?”(Optional further detail is in the Here Be Dragons box). QUANTITATIVE RESEARCH methods use data that can be counted or converted into numerical form, and generally address “How?” questions (How effective is this treatment, compared to a placebo?). Table 5.2 summarizes the different purposes of each approach, which most researchers view as complementary, leading to a “mixed methods” approach.

Qualitative variables, or qualitative studies?

Quantitative studies often examine qualitative variables. For example, a study of patient satisfaction might include a question such as: “How satisfied were you with the care you received?” The researcher might use an answer scale that allows the response to be crudely quantified, perhaps through a series of statements: very satisfied, satisfied, unsatisfied, or very unsatisfied. These could be scored 1, 2, 3 or 4 and the researcher could report the mode or the median score. The study, although measuring the quality of something, expresses the results as numbers and is, therefore, generally considered a quantitative study.

Meanwhile, a qualitative study of satisfaction might involve a focus group of patients where a group facilitator asks them about satisfaction with care but allows the participants to talk about what they consider important to their satisfaction, and then asks follow-up questions to explore their responses in depth. The information produced is then assembled and analyzed to identify common themes and sub-themes arising from the focus group discussion.

Table 5.2: Comparison of qualitative and quantitative research methods

Qualitative research Quantitative research
Studies particular instances Identifies general principles underlying observed phenomena
Generates hypotheses States and tests hypotheses
Is generally inductive (works from the particular instance to draw a general conclusion) Is generally deductive (refers to a general theory to generate a particular explanation)
Captures detailed, contextual information from a small number of participants Provides numeric summaries of frequency, severity, or correlations from a large number of participants
Focuses on studying the range of ideas; sampling aims to provide a representative coverage of ideas or concepts Focuses on studying the range of people; sampling provides representative coverage of people in a population
Answers “Why?” and “What does it mean?” questions Answers “What?”, “How much?” or “How many?” questions
Example of a study question: “What is the experience of being treated for breast cancer?” Example of a study question: “Does this treatment for breast cancer reduce mortality and improve quality of life?”

When numbers do not measure up

In quantitative studies, numbers can be used to categorize responses to qualitative questions, such as “How satisfied were you?” [Answer 1= very unsatisfied to 4 = very satisfied]. Beware: these numbers are arbitrary, and we cannot claim that they represent an even gradient of satisfaction. In technical jargon, these are “ordinal” numbers (equivalent to house numbers along a street); the change in satisfaction between each number is not necessarily equal (see SCALES OF MEASUREMENT in the Glossary). Accordingly, such data have to be analysed using nonparametric statistical methods – for example, using a median rather than a mean (see PARAMETRIC in the Glossary).

By contrast, measuring body temperature forms an “interval” measure, in which the amount of change in temperature is equal across successive numbers on the scale. Data from such measurements can be analysed using parametric statistics: mean values can legitimately be calculated.

Qualitative research

Qualitative research “employs non-numeric information to explore individual or group characteristics, producing findings not arrived at by statistical procedures or other quantitative means. Examples of the types of qualitative research include clinical case studies, narrative studies of behaviour, ethnography, and organizational or social studies.”6 Applied to public or population health, qualitative methods are valuable in analyzing details of human behaviour. Beyond merely recording facts (did this person obtain an influenza immunization?), qualitative research delves into motivation and personal narratives that offer insights into why.

Qualitative researchers focus on subjective experiences and reject the positivist idea that there exists an objective reality waiting to be discovered. They argue that human experience can be interpreted in many ways, depending on the perspective of the observer, and that that recording experience or motivation can only be partially objective. Qualitative methods are inductive and flexible, allowing interpretations to emerge from the data rather than from a pre-selected theoretical perspective. Just as successive historians may re-interpret historical events, our understanding of diseases and therapies changes with new discoveries.

Like quantitative research, qualitative studies can be pure or applied, but with greater emphasis on the applied – explaining a particular situation. Qualitative data collection methods may be grouped into in-depth interviewing, participant observation, and focus groups (see the Nerd’s Corner box). The data may take the form of words, pictures or sounds – once described as “any data that is not represented by ordinal values.”7

Types of qualitative study

Qualitative method Type of question Data source Analytic technique
Phenomenology Questions about meaning or questions about the essence of phenomena, or about experiences
(e.g., What do Chinese families mean by “harmony”? What is the experience of a deaf child at school?)
Primary: audiotaped, in-depth conversation.
Secondary: poetry, art, films.
Theming and phenomenological reflection; memoing (making interpretive notes), and reflective writing.
Ethnography Observational questions (e.g., How do surgical team members work together in the OR?) and descriptive questions about values, beliefs and practices of a cultural group (e.g., How does this group view menopause?) Primary: participant observation; field notes; structured or unstructured interviews.
Secondary: documents, focus groups.
Thick description, re-reading notes and coding by topic; narrating the story; case analysis; diagramming to show patterns and processes.
Grounded theory Process questions about how the experience has changed over time or about its stages and phases (e.g., How do medical residents cope with fatigue?) or understanding questions (e.g., How did they learn these techniques?) Primary: audiotaped interviews; observations.Secondary: personal experience. Theoretical sensitivity; developing concepts for theory generation.
Focussed memoing; diagramming; emphasis on search for core concepts and processes.

Source: Adapted from Richards et al.8

Judging the quality of qualitative research

In judging qualitative research you should consider questions such as:

  1. Was the design phase of the project rigorous?

Consider the skill and knowledge of the researcher and the completeness of the literature review. Is the research question clear and suited to a qualitative analysis? The researcher should state the perspective from which the data were gathered and analyzed.

  1. Was the study rigorous?

The final sampling should represent all relevant groups (all relevant types of patients, both sexes, the full age range, etc.). In qualitative research, sample size is not necessarily fixed, but may continue until no new ideas or concepts emerge, a situation known as saturation. Hence the sampling methods focus on covering the content of responses, rather than numbers of persons. Nor is the interview script completely fixed. Questions need not be uniform, but should respond to the participants’ verbal and non-verbal cues so that the topic can be fully explored. As the project continues, the interview script may evolve in response to the findings of previous interviews.

Qualitative biases

Qualitative data collection methods, while flexible, should be systematic and clearly documented. Having more than one researcher analyze the data can reveal possible biases in interpretation; study participants may even be asked to validate the interpretation. The reader should look for evidence that the research was conducted in an ethical manner and that confidentiality and anonymity have been preserved.

Bias is inherent in qualitative research. Collecting data by observing people, whether or not they can see the observer, can influence their behaviour. Similarly, the results of data analysis can depend on the knowledge and perspective of the person doing the analysis. Equivalent challenges exist in quantitative research but the methods to counteract them are not the same. Quantitative research aims for uniformity and standardization to reduce bias. Qualitative research, by its nature, responds to context so is less standardized.  Explanation of the context and stating the researcher’s perspective allows the reader to assess the researcher’s influence on the findings.

  1. Can I transfer the results of this study to my own setting?

Clinicians reviewing an article must decide if the context and subjects of the study are sufficiently like their own context and patients for the results to be applicable. The results can also be compared to previous literature: How closely does this study corroborate other work? If it corroborates closely, it is likely to be generalizable and, therefore, transferable to a similar context.

Qualitative and quantitative complementarity

Cockburn examined patient satisfaction with breast screening services in Australia. She used qualitative methods, including literature reviews and interviews with patients and staff, to identify relevant aspects of satisfaction.  From this she developed a standardized questionnaire to measure satisfaction with screening services; she then surveyed a sample of patients and analyzed data from this questionnaire in a quantitative manner.9

Quantitative research

Quantitative studies in medical research are of two broad types: those that count things, such as the numbers of people with different types of cancer, and studies that identify causal influences, such as whether a treatment reliably produces a cure. The following presentation begins with causal studies, and these generally compare different groups, such as people exposed to a risk factor or given a treatment, with others who were not. However, a fundamental challenge is that causality can never be definitively proven; the best a study can do is to show that the results match a series of criteria for inferring a causal relationship.

Criteria for inferring causation

In 1965, Austin Bradford Hill proposed a set of criteria for assessing the causal status of correlations observed in an epidemiological study; he based these in part on Koch’s postulates from the nineteenth century (see Koch’s postulates box, below). The criteria have been revised many times, so you may find a version with different numbers of criteria. Table 5.3 shows a typical example with a commentary on their limits.

Koch’s postulates

Robert Koch (1843 – 1910) was a Prussian physician who is considered a father of microbiology. He isolated Bacillus anthracis, Vibrio cholerae, and Mycobacterium tuberculosis, once known as Koch’s bacillus, for which he won the 1905 Nobel Prize in Physiology. His causal criteria (or postulates) state that, to classify a microbe as the cause of a disease, the microbe must be

  • Found in all cases of the disease examined
  • Capable of being prepared and maintained in pure culture
  • Capable of producing the original infection, even after several generations in culture, and
  • Retrievable from an inoculated animal and cultured again.

These postulates built upon earlier criteria for causality formulated by the philosopher John Stuart Mill in 1843.

Microbiology nerds will be able to cite diseases caused by organisms that do not fully meet all of these criteria, but nonetheless, Koch’s postulates provided a rational basis for the study of medical microbiology.

Table 5.3: Criteria for inferring a causal relationship in epidemiology

Criteria Comments
1. Chronological relationship: Exposure to the presumed cause must predate the onset of the disease. This is widely accepted.
But beware of the difficulty in knowing when some diseases diseases with long latent periods actually began. Could your patient’s cancer have predated his occupational exposure?
2. Strength of the association: If everyone with a disease was exposed to the presumed causal agent, but very few in a healthy comparison group were exposed, the association is a strong one. In quantitative terms, the relative risk will be large, and a large relative risk could suggest a causal relationship. This criterion can be disputed: the strength depends very much on whether other factors are considered, and how these are controlled in a study. A weak association may still be causal, particularly if the influence is moderated by other factors. Conversely, a relationship can appear strong but could result from an unacknowledged confounding factor. For example, the risk of Down syndrome rises with the child’s birth order. This is actually due to greater maternal age at the birth of higher-order children.
3. Intensity or duration of exposure (also called biological gradient or dose-response relationship): If those with the most intense, or longest, exposure to the agent have the greatest frequency or severity of illness, we judge it more likely that the association is causal. A reasonable criterion if present, but the absence of a dose response does not disprove causality. For example, if a low threshold of exposure is sufficient for the agent to have an effect, increasing exposure may have no further impact.
4. Specificity of association: If an agent or risk factor is found that consistently relates only to this disease, then it appears more likely that it plays a causal role. This was derived from causes of infectious diseases but is a weak criterion for other conditions. Smoking and obesity are causally associated with several diseases; the absence of specificity need not undermine a causal interpretation.
5. Consistency of findings: An association is consistent if it is confirmed by different studies; it is even more persuasive if these are in different populations. A good criterion, although it may lead us to miss causal relationships that apply to only a minority of people. For instance, the drug induced hemolysis associated with glucose-6-phosphate dehydrogenase (GPD) deficiency could be difficult to demonstrate in populations in which GPD deficiency is rare.
6. Coherent or plausible findings: Do we have a biological (or behavioral, etc.) theory to explain the observed association? Evidence from experimental animals, or analogous effects created by analogous agents, and information from other experimental systems and forms of observation should be considered. A good criterion if we do have a theory. Yet the lack of a biological explanation should not lead us to dismiss a potential cause. Knowledge evolves and new theories may be generated to explain initially unexpected findings.
7. Cessation of exposure: If the causal factor is removed from a population, then the incidence of disease should decline. This may work for disease rates in a population, but for an individual, pathology is not always reversible.

Does asbestos cause lung cancer?

The more causal criteria that are met in a study, the stronger is the presumption that an observed association is causal. For example, may exposure to asbestos fibres among construction workers have caused lung cancer in some of them?
1.         Chronological relationship: Can we be sure that the exposure to asbestos predated the cancer (which may have taken years to develop)?
2.         Strength of the association: How much more likely are people working with asbestos to develop cancer, compared to other occupations?
3.         Intensity and duration of the exposure: Were those with the greatest exposure the most likely to get sick?
4.         Specificity: Did they just get lung cancer?
5.         Consistency: Have similar findings been reported from other countries?
6.         Coherence and plausibility: Does it make biological sense that asbestos fibres could cause lung cancer?
7.         Cessation of exposure: After laws were passed banning asbestos, did lung cancer rates decline among construction workers?

In the end, whether or not a factor is accepted as a cause of a disease always remains open to dispute, especially when it is not possible to obtain experimental proof. There are still defenders of tobacco who can use technical arguments to point out the flaws in the evidence that smoking causes cancer and heart disease. The following sections outline the major types of research design, commenting on their strengths and limitations.

Research designs

Quantitative research uses a variety of study designs, which fall into two main classes: experimental studies (or trials) and observational studies. Figure 5.1 maps the distinctions between these, starting from the exposure or disease being studied, at the top of the diagram.

Diagram showing logical structure of alternative study designs
Figure 5.1: What kind of study is it?

Experimental (or interventional) studies

As the name implies, these are studies in which the participants undergo some kind of intervention in order to evaluate its impact. The experimental researcher has control over the intervention, its timing and dose or intensity. The study could test a medical or surgical intervention, a new drug, or an intervention to change lifestyle. As the most methodologically rigorous design, experiments are the default choice for providing evidence for best practice in patient management, so our discussion begins with them.

In its simplest form, an experimental study to test the effect of a treatment follows these steps:

  1. The researcher formally states the hypothesis to be tested
  2. The researcher selects people eligible for the treatment
  3. The sample is divided into two groups
  4. One group (the experimental, or intervention group) is given the intervention while the other (the control group) is not, and
  5. Relevant outcomes are recorded over time, and the results compared between the two groups.

Step 3 leads to a critical distinction, shown at the left of Figure 5.1: the distinction between a randomized controlled trial and non-randomized designs. In the former, people are allocated in step 3 to intervention and control groups by chance alone, while in the latter the choice of who receives the intervention is decided in some other way, such as according to where or in which order they enter the study. There are many types of non-randomized studies and, because the researcher often does not have complete control over the allocation to experimental or control group, they are regarded as inferior to true randomized designs. They are often called quasi-experimental designs (see the Nerd’s corner box).

What’s a quasi-experiment?

For example, your study could treat hypertensive patients attending Hospital A with one treatment protocol and compare their outcomes to patients receiving a different protocol in Hospital B. This is convenient: there is no need to randomize patients in each hospital, and training research staff is simplified. However, many biases might arise in this approach: one hospital may treat more severe patients; patients might choose which hospital or clinician they attend (self-selection); other aspects of care in the two hospitals may be different, and so forth.

Another quasi-experimental design is the Time Series study. This follows a single group and makes serial measurements before and after some intervention, and trends are compared to detect the impact of the intervention. For example, to examine whether introducing a new textbook on public health has any impact on student learning, public health exam marks could be compared for successive cohorts of medical students, for several years before, then for several years after the introduction of the book. The hypothesis is that there will be a significant jump in the scores following the introduction of the new book. This design is called quasi-experimental because it lies mid-way between an observational study and a true experiment. It comes closer to an experiment if the investigator controls when the book is introduced. But other concurrent changes may have occurred in the educational system that could have influenced the results, rather than the book itself. The time-series design has the virtue of feasibility: it would be difficult to randomly allocate some students to read the book and others not, because the book might be shared between the two groups.

Quasi-experiments have sufficient sources of potential bias that they are regarded as substantially inferior to true randomized experiments, so their findings are rarely considered conclusive.

Randomization removes systematic bias in allocating patients to treatment or control groups, making them comparable; it also allows the valid use of statistical tests, which often assume a random allocation. But the key advantage is that any other factors that could affect the outcome (aka confounding factors) should be equally represented in each study group—including unknown factors such as genetic characteristics that could affect prognosis. An unbiased random allocation should ensure that the only difference between the two study groups is the intervention. The larger the study sample the more confident we can be that other factors really will be equivalent in the two groups, so any difference in outcomes between the groups can be attributed to the intervention. Nonetheless, this remains a matter of probabilities, which is why we need tests of statistical significance. These show the (hopefully very small) likelihood that observed differences between experimental and control groups could have arisen merely by chance.

Random sampling and random allocation

Distinguish between random selection of subjects from a sampling frame or list, and random allocation of subjects to experimental or control groups. Random selection of subjects is mainly relevant in descriptive research and helps to ensure that results can be generalized to the broader population, enhancing the EXTERNAL VALIDITY of the study (see the Glossary, and the section on errors in sampling).

Randomly allocating people to experimental and control groups helps to ensure they are equivalent in everything save for the experimental intervention, so the comparison is not confounded by inherent differences between groups. This enhances the INTERNAL VALIDITY of the study (and see Nerd’s corner “Not always truly random”)

Not always truly random

For practical reasons, some trials use non-random patient allocation. For example, using patients’ health insurance numbers, those with an odd number could be assigned to the experimental group and even numbers to the control. This is superior to participants themselves choosing which group to join, and may approach the quality of a random allocation. However, the numbering system should be carefully scrutinized to ensure the numbers were assigned in a truly random manner. Check, for example, to ensure that males are not given odd numbers and females even.

Randomized controlled trials

The most common experimental design in medical research is the randomized controlled trial (RCT – see Figure 5.2). An RCT is a true experiment in that the investigator controls the exposure and, in its simplest form, assigns subjects randomly to the experimental or control group (which may receive no treatment, or the conventional treatment, or a placebo). Both groups are followed and assessed in a rigorous comparison of their rates of morbidity, mortality, adverse events, etc. RCTs are most commonly used in therapeutic trials but can also be used in trials of prevention. Most commonly, people are randomly allocated to the study groups individually, but groups of people can also be allocated, or even whole communities. RCTs are often conducted across many centres, as illustrated by clinical trials of cancer treatments.

Generic plan of a randomized controlled trial
Figure 5.2 Generic plan of a randomized controlled trial

The steps in an RCT are:

  1. State the hypothesis in quantitative and operational terms. For example, using the PICO format: “There will be a 10% reduction in self-recorded night sweats among peri-menopausal women who wear a magnetic wrist bracelet, compared to age-matched women who do not wear a wrist magnet.”
  2. Select the participants. This step includes calculating the required sample size, setting inclusion and exclusion criteria, and obtaining free and informed consent.
  3. Allocate participants randomly to either the treatment or control group; this is normally done using a computer-generated random allocation. Note that there may be more than one intervention group, for example receiving different strengths of magnet in the menopause study. Note also that the control group often receives the standard treatment to which the new one is being compared (here, perhaps, estrogen therapy), or a placebo (such as a wrist band with a fake magnet).
  4. Administer the intervention. This is preferably done in a blinded fashion, so that the patient does not know which group she is in. Ideally, the researcher (and certainly the person intervening and monitoring the patient’s response) should also not know which group a given patient is in (this is called a double-blind experiment). This helps to remove the influence of the patient’s and the clinician’s expectations of the treatments, which could bias their assessment of outcomes. Sometimes, a triple-blind approach is used in which neither patient, nor clinician, nor those who analyze and interpret the data (e.g., read x-rays) know which group received the treatment (the groups are merely labeled A or B). This reduces possible bias even further.
  5. At a pre-determined time, the outcomes are monitored (e.g., physiological or biochemical parameters, morbidity, mortality, adverse events, functional health status, or quality of life) and compared between the intervention and control groups using statistical analyses. This indicates whether any differences in event rates observed in the two groups are greater than might be expected by chance alone.

While RCTs are regarded as the best research design we have, they do have limitations. They intentionally study the efficacy of a treatment under carefully controlled experimental conditions, which may not show how well the treatment will work in normal clinical practice. “Efficacy” refers to the potential impact of a treatment under the optimal conditions typical of a controlled research setting. “Effectiveness” refers to its impact under the normal conditions of routine practice. For example, in trial conditions the medication may be efficacious because patients are being carefully supervised and know that they are participating in a research project. Unfortunately, in the real world the medication’s effectiveness may be lower because, without supervision, patients may not take all of their medication in the correct dose. A potentially efficacious intervention may also not be efficient enough to put into practice. Breast self-examination has been shown to detect early breast cancer, but only in trial conditions in which women received constant follow-up by trained nurses. This level of intervention was too costly to be put into routine practice.

Furthermore, clinical trials are often conducted on selected populations – for example, men aged 50 to 74 with unstable angina who are current smokers, have no co-morbidity and are willing to participate in a research study. This selectivity limits the extent to which results can be generalized to typical angina patients, i.e., the study’s EXTERNAL VALIDITY. Trial results may also be biased due to attrition if participants in one or other group drop out of the study. Finally, intervention trials, although designed to detect differences in the known and desired outcomes, may not be large enough to reliably detect previously unknown or rare effects.

An adaptation of the RCT is the “N of 1” trial which can have especial value in testing a treatment for a particular patient in a way that avoids most sources of bias.

N of 1 trials

An N of 1 trial is a special form of clinical trial that studies a single patient. The patient receives either the active treatment or a control (e.g., a placebo), determined randomly and administered blindly. Outcomes are recorded after a suitable time lapse. This is followed by a washout period when the medication is withheld to eliminate remaining traces of it. The patient then receives the alternate treatment (placebo or active) and outcomes are evaluated. The cycle may be repeated to establish stable estimates of the outcomes. The main advantage is that the study result applies specifically to this patient and allows for calibration to optimize the therapeutic dose. The results cannot be generalized beyond this particular patient, and of course it requires that the treatment effect can be reversed.

The N of 1 approach can also be applied to a group of patients. This can produce highly valid results because almost all sources of bias are eliminated, because each patient acts as his own control.

Ethics of RCTs

Particular ethical issues (see ETHICS in the Glossary) arise in the conduct of medical experiments. A tension may arise between two basic principles: patients have a right to receive an effective treatment (the principle of beneficence), but it is unethical to adopt a new treatment without rigorous testing to prove efficacy and ensure non-maleficence. Therefore, if there is partial evidence that a treatment appears superior, it may be unethical to prove this in a randomized trial as this would entail denying it to patients in the control group. Hence, an RCT can only ethically be applied when there is genuine uncertainty as to whether the experimental treatment is superior; this is termed equipoise. The problem is, genuine equipoise may make it irrelevant to undertake an expensive trial when there is little reason to believe a new medication to be superior. It is also unethical to conduct trials that offer only marginal social value (e.g., studies that benefit the publication record of the researcher more than the health of patients, or studies that double as marketing projects). Naturally, it is also unethical to continue a trial if the treatment is found to be obviously effective or obviously dangerous. Trials are therefore planned with pre-set stopping rules that specify conditions under which they should be prematurely concluded. Ethical considerations mean that many established treatments will probably never be evaluated by a controlled trial:

  • Appendectomy for appendicitis
  • Insulin for diabetes
  • Anaesthesia for surgical operations
  • Vaccination for smallpox
  • Immobilization for fractured bones, and
  • Parachutes for jumping out of airplanes, as the British Medical Journal humorously noted.10

Terminating trials early

The ethical principle of beneficence demands that patients should benefit from a new treatment as soon as it is proven effective, but the principle of non-maleficence implies that this proof must be definitive. Therefore studies are designed to include the minimum sample size necessary for definitive proof. The sample size is calculated before the study begins, based on an estimate of the likely relative benefits of intervention and control treatments. But this is an estimate only, and can be wrong.

Occasionally, early results may seem to show an advantage one way or the other, but being based on small numbers these preliminary results may be due to chance.  Researchers may thus be faced with a choice between stopping a trial before the number of participants is large enough to definitively demonstrate the superiority of one course of action, or to continue the trial even though, as far as they know, one course of action appears superior to the other. This decision becomes especially challenging when the experimental treatment appears to be harmful compared to the conventional treatment. A further complication is that early analyses of the data imply un-blinding the investigators and this can bias their future conclusions. A data-monitoring committee commonly uses methods that allow continuous monitoring of outcomes but without sharing results with the investigators until clinically significant differences occur and the trial can be stopped.

Phases of intervention studies

Once a new pharmaceutical treatment has been developed, it undergoes testing in a sequence of phases before it can be approved by regulatory agencies for public use. Clinical trials form the third stage within this broader sequence, which begins with laboratory studies using animal models, thence to human testing:

Phase I:     The new drug or treatment is tested in a small group of people for the first time to determine safe dosage and to identify possible side effects.

Phase II:    The drug or treatment is given to a larger group at the recommended dosage to determine its efficacy under controlled circumstances and to evaluate safety. This is generally not a randomized study.

Phase III:  The drug or treatment is tested on large groups to confirm effectiveness, monitor side effects, to compare it to commonly used treatments, and to develop guidelines on the safe use of the drug. Phase III testing normally involves a series of randomized trials. At the end of this phase, the drug may be approved for public use. The “on label” approval may restrict how the drug can be used, for instance for specific diseases or in certain age groups.

Phase IV:  After the treatment enters the marketplace, information continues to be collected to describe its effectiveness on different populations, and especially to detect possible side effects or adverse outcomes. This does not involve a formal RCT and is called post-marketing surveillance; information comes from several sources, such as reports of side effects from physicians and patients, or data on outcomes such as hospital readmissions obtained from computerized information systems. Large numbers may be required to detect rare or slowly developing side effects.

Observational studies

In an observational study, the researcher observes and records what happens to people under exposure conditions either chosen by the person (as with exercise or diet) or that are outside of their control (most social determinants of health). There is often a comparison group of people who were not exposed. The key difference with experimental studies is that the researcher chooses which populations and exposures to study, but does not influence them. As there is no random allocation of exposures, the major problem in inferring causation is that the exposed and unexposed groups may differ on other key factors that may themselves be true causes of the outcome, rather than the characteristics under study (i.e., confounding).

The simplest form of observational study is a descriptive design.

Descriptive studies

Descriptive studies describe how things are: they count the numbers of people who have diabetes, or who smoke, or are satisfied with their hospital care. Such a study uses descriptive statistics to summarize the group results – percentages, mean or median values, and perhaps the variability or standard deviation. The data for a descriptive study may come from a survey questionnaire or from sources such as electronic medical records, or from surveillance systems, describing person, place, and time of disease occurrences (Who? Where? When?) (see Surveillance in Chapter 7). Descriptive studies are commonly used with small, local populations, such as the patients in your practice, and are often used to collect information for planning services. Descriptive studies generally refer to a single point in time – usually the present – and give a cross-sectional, representative picture of the population, although repeated cross-sectional studies can illustrate trends over time, such as the changing number of smokers in your practice. When a study collects information on several variables, it can describe the associations among the variables (for example, is diabetes more common in men or women and does it vary by smoking status?) This can be used to generate hypotheses, which may then be tested in an analytic study.

Analytic studies

The critical distinction between descriptive and analytic studies is that the latter are designed to test a hypothesis, generally concerned with identifying a causal relationship. When an outcome variable, such as heart disease, is studied in relation to an exposure variable such as body weight, the study does more than count: it tests a hypothesis predicting an association between the two. The interest is no longer purely local, as with a descriptive study, but to draw a more general conclusion that will apply to a broader population. Hence, the representativeness of the study sample is crucially important to ensure the external validity or generalizability of the sample results. To describe the level of confidence with which we can draw general conclusions from a sample, we use inferential statistics (see section on Chance errors in sampling, below).

Analytic observational studies vary in terms of the time sequence and sampling procedures used to collect data, and can be of three types: cross-sectional, cohort, or case-control studies, as shown at the bottom of Figure 5.1.

Cross-sectional analytic studies

Cross-sectional studies use a single time-reference for the data collected (e.g., referring to events during the past 2 weeks). A common cross-sectional design is the analytical survey that measures variables to test hypotheses concerning their relationships. For example, in a nationally representative sample a researcher might test the hypothesis that feelings of stress increase the use of medical services. The researcher might ask whether people were under stress in the past two weeks, then whether they had sought care over the same period.

Suppose that this study produced the following result:

Table 5.4: Stress and health care: calculating the association between two variables

Doctor visit?
No Total
Felt stress in the past
2 weeks?
Yes 1,442 3,209 4,651
No 2,633 11,223 13,856
Total 4,075 14,432 18,507

Note that the results can be reported in either of two ways:

  1. Of those who suffered stress in the last year, 31% (1442/4651) visited their doctor in the last 2 weeks compared with only 19% (2633/13856) of those who did not suffer stress.
  2. Of those who visited their doctor, 35% (1442/4075) reported stress in the previous year, compared with 22% (3209/14432) of those who did not visit their doctor.

Either approach is correct. The researcher is free to decide which way to report the results; the study design allows both types of analysis. But all that can be concluded is that there is an association between the two variables. It might be supposed that stress predisposes people to visit their doctor, or could it be that the prospect of a visit to the doctor causes stress? Or maybe something else (a confounding factor such as worrying about an underlying illness?) caused both. This study provides little evidence in support of a causal relationship – merely the apparent association between stress and the doctor visit. The chief weakness of cross-sectional studies is that they cannot show temporal sequence: whether the factor (stress) pre-dated the outcome (doctor visit) (see the causal criteria in Table 5.3).

Descriptive and analytic studies usually sample individual people, but they may alternatively study groups of people, such as city populations. These ecological studies often use aggregate data from government sources, making the study easy to undertake. See the Nerd’s Corner box on Ecological studies.

Ecological studies

Ecological studies measure variables at the level of populations (countries, provinces, etc.) rather than individuals. They are the appropriate design for studying the effect of a variable that acts at the population level, such as climate, an economic downturn, or the shortage of physicians. Like a survey, they can be descriptive or analytic. They have the advantage that they can often use readily available data such as government statistics. Ecological studies can be useful in generating hypotheses that can then be tested at the individual level. For example, the hypothesis that dietary fat may be a risk factor for breast cancer arose from a study that showed that countries with high per-capita fat consumption also had higher incidences of breast cancer.

However, there is a logical limitation in drawing conclusions from ecological studies for individual cases. Because the ecological finding was based on group averages, it does not necessarily show that the individuals who consumed a high fat diet were the same ones who got the cancer. The breast cancer cases living in countries with high average fat diets might have consumed little fat: we cannot tell from ecological data. The temptation to draw conclusions about individuals from ecological data forms “the ecological fallacy.” To draw firm conclusions about the link between dietary fat and breast cancer risk, the two factors must be studied in the same individuals. Nevertheless, ecological studies are often used as a first step, to suggest whether or not a more expensive study of individuals may be worthwhile.

Cohort studies

A cohort is a group of people who can be sampled and enumerated, who share a defining characteristic and who can be followed over time: members of a birth cohort share the same year of birth, for example (see  the Nerd’s Corner: Roman cohorts). Cohort studies of health commonly study causal factors; the characteristic of interest is usually some sort of exposure hypothesized to increase the likelihood of a disease. A typical cohort study begins with a representative sample of people who do not (yet) have the disease of interest; it collects information on exposure to the factor being studied, then follows exposed and unexposed people over time (Figure 5.3). Because of this, cohort studies are also known as follow-up or longitudinal  studies. The numbers of newly occurring (incident) cases of disease are recorded and compared between the exposure groups. The hypothesis to be tested is generally that more disease will arise in the exposed group (as indicated by the relative sizes of the rectangles on the right of the figure).

Schema of a cohort study
Figure 5.3: Schema of a cohort study

Roman cohorts

Cohort: from Latin cohors, meaning “an enclosure.” The meaning was extended to an infantry company in the Roman army through the notion of an enclosed group or retinue. Think of a Roman infantry cohort marching forward; some are wearing new metal body armour, while others have the old canvas-and-leather protection. Bandits shoot at them and General Evidentius gets a scribe to record the mortality outcomes and his trusty analyst, Epidemiologicus, compares the results using simple arithmetic, as shown in Table 5.5.

In simple cohort studies the results can be fitted into a “2 by 2” table (2 rows by 2 columns – we don’t count the Total column).

Table 5.5: Generic format for a 2 x 2 table linking an exposure to an outcome.

(e.g., disease)
(e.g., disease)
Exposure (or risk factor) present a b a + b
Exposure (or risk factor) absent c d c + d

The incidence, or risk, of disease in the exposed group is calculated as a / (a + b). Correspondingly, the risk in the non-exposed people is c / (c + d). These risks can be compared to get a risk ratio (often called a relative risk, or RR): [a/(a + b) divided by c/(c + d)]. This gives an estimate of the strength of the association between the exposure and the outcome: how much more frequent is the disease among those exposed? A relative risk of 1.0 indicates that exposed people are neither more nor less likely to get the disease than unexposed people: there is no association between exposure and disease. A relative risk greater than 1.0 implies that people who have been exposed have a greater chance of becoming diseased, while a relative risk of less than 1.0 implies a protective effect (e.g., a reduced risk of COVID-19 among those who had been immunized).

The main advantage of cohort studies is that exposure was recorded before the outcomes, thus meeting the causal criterion of a temporal sequence between exposure and outcome – as long as study participants truly did not have the disease at the outset. Furthermore, because recording of exposures and outcomes is planned from the beginning of the study period, measurements can be standardized. Note that randomized trials are an experimental version of a cohort study.

Definition of exposure groups

Imagine a cohort study designed to test the hypothesis that exposure to welding fumes causes diseases of the respiratory tract. Whom should you sample? A crude but simple approach would be to base exposure on occupation (assume that metal workers have been exposed; carpenters would be presumed to be unexposed). This approach is frequently used in occupational and military epidemiology. A more precise alternative would be to quantify levels of exposure (e.g., from the person’s work history); this requires considerably more information but would permit dose-response to be estimated—one of the criteria for inferring causation listed in Table 5.3.

In an extension of this quantified approach, a cohort study might not select exposed and unexposed groups to follow, but instead select a representative sample of individuals with sufficient variability in their exposure to permit comparisons across levels of exposure. This allows for mathematical modelling of the effect of exposures. Cohort studies of diet, exercise, or smoking often use this approach, obtaining exposure information from a baseline questionnaire. This approach has been used in community cohort studies such as the Framingham Heart Study (see Illustration: the Framingham Study). Cohort studies offer a powerful way to evaluate causal influences, but they may take a long time to complete and hence be expensive. A cheaper alternative is the case control design.

The Framingham Study

Starting in 1948, the town of Framingham, Massachusetts, participated in a cohort study to investigate the risk factors for coronary heart disease. The study has now collected data from two subsequent generations of the families initially involved. This has produced quantitative estimates of the impact of risk factors for cardiac disease, covering levels of exercise, cigarette smoking, blood pressure, and blood cholesterol. Details may be found at the Framingham Heart Study website.

Case-control studies

Case-control studies compare a group of patients with a particular outcome (e.g., cases of pathologist-confirmed pancreatic cancer) to an otherwise similar group of people without the disease (the controls). As shown in Figure 5.4, reports or records of exposure (e.g., alcohol consumption) before the onset of the disease are then compared between the groups. The name of the design reminds you that groups to be compared are defined in terms of the outcome of interest: outcome present (cases) or absent (controls). The hypothesis to be tested is that exposure will be more common among cases than controls, as suggested by the relative size of the circles on the left of the figure.

Schema of a case-control study design
Figure 5.4: Schema of a case-control design

Note that in a case-control study you cannot calculate the incidence or risk of the disease, because the study begins with predetermined numbers of people with and without the disease. Therefore, a risk ratio cannot be calculated. But do not despair: you can calculate the odds of a case having been exposed—the ratio of a:c in Table 5.6. This can be compared to the odds that a control was exposed, the ratio of b:d. The result of the case-control study is then expressed as the ratio of these two odds, giving the odds ratio (OR): a/c divided by b/d. To make the calculation easier, this is usually simplified algebraically to ad/bc (i.e., evidence supporting the hypothesis divided by evidence against).

Table 5.6: Generic 2 x 2 table for calculating an odds ratio

Outcome (or disease) present Outcome (or disease) absent
Exposure (or risk factor) present a b
Exposure (or risk factor) absent c d

The OR calculated from a case-control study can approximate a relative risk, but only when the disease is rare (say, affecting up to around 5% of the population, as in many chronic conditions). The interpretation of the value of an OR is similar to an RR. Like a relative risk, an OR of 1.0 implies no association between exposure and disease. A value over 1.0 implies a greater chance of diseased people having been exposed compared to controls. A value below 1.0 implies that the factor is protective. This might occur, for example, if a case-control study showed that eating a low fat diet protected against heart disease.

Key contrast between cohort and case-control studies

In cohort studies, the participant groups are classified according to their exposure status (whether or not they have the risk factor).

In case-control studies, the different groups are chosen according to their health outcomes (whether or not they have the disease).

Prospective or retrospective?

These terms are frequently misunderstood, and for good reason.

Cohort studies define the comparison groups based on their exposure levels and follows them forward, over time. This forms a prospective cohort study, but this can take a long time to complete. It would be quicker to work from historical records, for example selecting people who worked as welders 30 years ago, contact them and assess their health status now, linking this to their exposure history. This used to be called a retrospective cohort study but is better called an historical cohort study. The word retrospective causes confusion because it was formerly used to refer to case-control studies. Most authorities have now abandoned the term entirely.

Probabilities, odds, and likelihoods

Probabilities and odds express the same information in different ways. Probabilities look forward and express the proportion of people with a certain characteristic (e.g., exposed to a causal factor) who get a disease. Odds take this further, and express the ratio of two probabilities: the probability of a case having been exposed divided by the probability of not having been exposed. Odds are familiar to us when comparing separate groups – the ratio of men to women in your class, for example, or from sports: the odds of winning could be 4 to 1, or 80%.

The relative risk, calculated in Table 5.5, required that the sample forms a single cohort and that all those who were exposed be classified as a case or a non-case (and similarly for the non-exposed). This is necessary so that proportions such as a/(a+b) are informative. In a case-control study, however, the proportion of cases and controls was pre-set and so a proportion such as a/(a+b) in Table 5.6 provides no new information. However, we can use odds and make the calculation vertically in Table 5.6, within the cases and within the controls, and compare the ratio a/c to b/d.

It was noted that the odds ratio approximates the relative risk only when disease is rare. This can be shown as follows. If the number of cases is small and the number of non-cases large, then a proportion such as a/(a+b) will be close to a/b.  The extent of error depends on the size of the relative risk, but as the disease becomes more frequent (e.g. above 5%) the OR tends to exaggerate the RR for risks > 1, and to under-estimate the RR when the risk is < 1.

Whereas probabilities look forward and consider the range of outcomes that may arise, likelihoods look back and consider the plausibility of a conclusion, such as a diagnosis, given the evidence (the lab test results). In a coin toss, the odds are 50:50 for each outcome and show the ratio of possible outcomes. The likelihood is the probability of a given outcome given a fair coin (here, 50%).

Estimating absolute risk: attributable risk and number needed to treat

The RR and OR indicate how much an individual’s risk of disease is increased by having been exposed to a causal factor, in relative terms. Both statistics answer the question a patient who smokes might ask: “Compared to my brother who doesn’t smoke, how much more likely am I to get the disease?” The statistics give the answer as a ratio: “You are twice as likely”, or “10% more likely”. But the patient may equally ask about absolute risk, which relates to disease incidence, as in “What is my chance of getting the disease (in the next year, ten years, over my lifetime)?” The answer is given as an absolute proportion, such as 1 in 10, or 1 in 100. An important point to bear in mind when communicating with a patient is that if the disease is rare, quoting a relative risk of 2 or 3 can appear quite frightening even though the absolute risk is small. A risk factor that doubles an absolute risk of 1 in a million is still only 2 in a million.

Most diseases have multiple causes, so it is convenient to have a way to disentangle these and express the risk due to a particular cause. This introduces the concepts of risk difference and attributable risk, which indicate the number of cases of a disease among exposed individuals that can be attributed to that exposure (assuming it is a causal influence and not a confounder):

Attributable risk = Incidence in the exposed group − Incidence in the unexposed

This tells us how much extra disease has been caused by this exposure, in absolute terms: 1 case per million persons in the example above. In the case of a factor that protects against disease, such as a vaccination, it tells us how many cases can be avoided.

This idea of attributable risk can also be expressed in relative terms, as a proportion of the incidence in exposed persons, yielding the exposed attributable fraction, EAF (also called the attributable proportion):

EAF = [Incidence (exposed) – Incidence (unexposed) ] / Incidence (exposed)

This statistic can be useful in counselling a patient exposed to a risk factor: “Not only are you at high risk of developing lung cancer, but 90% of your risk is due to your smoking. Quitting could have a major benefit”.

In developing health policies, we can also apply the idea of attributable risk to describing the impact of risk factors on the population as a whole. This introduces measures of population attributable risk (PAR) and of population attributable fraction (PAF): statistics that evaluate the impact of a causal factor by substituting incidence in the whole population for incidence in the exposed (see Nerd’s corner).

Population attributable risk

In discussing the impact of primary preventive programmes, the population attributable risk (PAR) indicates the number of cases that would not occur if a risk factor were to be eliminated:

Incidence (population) – Incidence (unexposed)

Compared to the attributable risk formula shown above, the population incidence incorporates the proportion of the population that is exposed to the factor. A causal factor may be strongly associated with disease but, if rare, it will not cause many cases, so the attributable risk may be high but the PAR low. Sadly, this statistic is almost never used, despite its obvious utility in setting priorities for health policy.  Expressed as a proportion of the incidence in the whole population, it yields the population attributable fraction or PAF (which also goes by half a dozen other names):

[Incidence (population) – Incidence (unexposed)] / Incidence (population)

This statistic, relevant for public health, shows the proportion of all cases of a disease that is attributable to a given risk factor, and was used (for example) to estimate that 40,000 Canadians die from tobacco smoking each year.  A little algebra shows that it depends on the prevalence of the risk factor and the strength of its association (relative risk) with the disease. The formula is:

PAF = Pe (RRe-1)  /  [1 + Pe (RRe-1)],

where Pe is the prevalence of the exposure (e.g., the proportion who are overweight) and RRe is the relative risk of disease due to that exposure.

The population prevented fraction is the proportion of the hypothetical total load of disease that has been prevented by exposure to a protective factor, such as an immunization programme. The formula is:

Pe (1-RR).

Number Needed to Treat

A useful extension of the attributable risk is the concept of “Number Needed to Treat” (NNT). This metric summarizes the effectiveness of a therapy or a preventive measure in achieving a desired outcome. The number needed to treat captures the realities that no treatment works infallibly, and also that some patients recover spontaneously. The NNT estimates the number of patients with a condition who must follow a treatment regimen over a specified time to achieve the desired outcome for one person. The NNT is calculated as the reciprocal of the absolute improvement produced by the treatment. Imagine a medication that cures 35% of the people who take it, while 20% of those who do not take it recover spontaneously. Here, the absolute improvement due to the treatment is 15%. The reciprocal of this is 1 / 0.15, or about 7. So, on average, you would need to treat 7 people to achieve 1 cure from the medication (within the specified time). The NNT can also be applied in describing the value of a preventive measure in avoiding an undesirable outcome. It can likewise be used in calculating the hazard of treatments such as adverse drug reactions, when it is termed “number needed to harm”; this shows the average number of people treated with the drug that would generate one adverse event.

Risky calculations

A cohort study of the effectiveness of an immunization examined whether or not immunized and non-immunized people became sick. The results were as follows

Sick Not sick
Immunized 20  (a) 100  (b)
Not immunized 50  (c)  30  (d)
Total = 200

How may we calculate risk? Many are the ways:

Relative risk (RR) a/(a+b)  /  c/(c+d) 0.167 / 0.625 = 0.267
(note that the immunization protects, so the result is < 1)
Odds ratio (OR) ad / bc 0.12 (but as this is a cohort study, one would generally not use the OR)
Attributable risk (AR) (a / (a+b)) –  (c / (c+d)) 0.167 – 0.625 = -0.458
(the negative AR indicates protection)
Absolute risk reduction (ARR) (c / (c+d)) – (a / (a+b)) 0.625 – 0.167 = 0.458
(attributable risk with the sign changed)
Number-needed-to-treat (NNT) 1 / ARR 1 / 0.458 = 2.18
(highly effective immunization!)

Inferential Statistics

Medical research applies information gained from particular study samples to broader populations, for example to estimate the average birthweight of babies in Canada from records of birthweights in your hospital. This population value is called a “parameter”. The uncertainty of estimating a population parameter from a sample introduces inferential statistics.


To provide an accurate estimate of a parameter, a sample should evidently be representative of the population; a random sample offers a good approach. But because people vary, different samples drawn randomly from the same population are likely to give slightly different results purely due to chance variation in who was selected. Random sampling from a population only guarantees that, on average, the results from successive samples will reflect the true population parameter, but the results of any particular sample may differ from those of the parent population, and sometimes substantially. The unintended differences that arise between study findings simply because of their different samples are variously known as “sampling error”, “sampling variation” or “chance error”. But at least we can estimate the likelihood of error in extrapolating, or generalizing, results from a sample to the broader population using inferential statistics such as p-values and confidence intervals (see Additional Material: Statistics).


In statistical terminology, a parameter is the true value in the population (e.g. average birth weight); this is the value that a study sample is being used to estimate. From the population parameter, you can interpret values for your patient: Is this child’s birth weight reassuringly normal for this population?

Parameters in the population are usually designated by Greek letters, while sample estimates of these parameters are shown by Latin letters:

mean parameter = μ  (pronounced “mu”)
sample estimate of the mean =  X with a line over it, called “x-bar”;
standard deviation of the parameter = σ (“sigma”)
sample estimate of the standard deviation = s.


Statistics is the branch of mathematics based on probability theory that deals with analyses of numerical data drawn from samples. It is known as biostatistics when it is applied to biological research. Inferential statistics estimate the likely extent of errors that may arise in applying conclusions from a small study sample to the broader population from which the sample was drawn. In this Primer we provide only a very brief overview of the statistical methods most relevant to evidence-based medicine; you will have to turn to a statistics text for more information.

Estimating a parameter

The confidence interval (or CI) is a statistic used to indicate the probable extent of error in a parameter estimate derived from a sample. An expression such as “mean systolic blood pressure was 120 mmHg (95% CI: 114, 126 mmHg)” indicates that the mean value in the study sample was 120 mmHg, but owing to possible sampling error the actual value in the broader population may not be precisely 120. Based on the size of the sample used and the variability of BP readings within it, there is a 95% probability that the population mean actually lies somewhere between 114 and 126 mmHg. The confidence interval can be represented graphically as a line or error-bar, as seen in Figure 5.8.

Like mean values, odds ratios and relative risks are also reported with confidence intervals. The confidence interval around an odds ratio indicates the likely range within which the true value lies, and also shows whether or not the association is statistically significant. For example, a relative risk of 1.4 (95% CI 0.8, 2.1) means that we can be 95% confident that the true relative risk lies somewhere in the (very wide) range of 0.8 to 2.1: not a lot of confidence! Furthermore, because 1.0 falls within this range, it is quite possible that there is no association in the population.

Significance of differences

Consider an RCT that compares mean blood pressures in a treatment (or intervention) group of hypertensives and a control group. The study hypothesis predicts a difference in BP between the groups, due to the intervention. As a physician, you are not so much interested in this particular study sample, but in how closely the results represent a broader population value, including patients in your practice. Evidently, if the study result was in error due to chance alone, you should not base your practice on those results! Biostatistics provide ways to estimate the probability (hence, p-value) that a study result, such as the difference in BP reported in this study, might have occurred merely by chance, due to random sampling variation.

To estimate the probability that a finding from a study sample might not represent reality in the broader population, we first have to set a threshold for deciding when to consider a finding “real” rather than a chance finding. Intuitively, the larger the sample size and the larger the difference found in mean blood pressures, the more confident we would be that the difference would hold true if the study were to be repeated on another sample, or on your patients. So, the researcher first indicates the probability threshold or p-value that will be used to distinguish between results (here the contrast in mean blood pressure) that may be a chance finding in this particular sample, versus a difference that will be considered real or “statistically significant”. The threshold chosen is usually p < 0.05, or 5% – arbitrary but commonly used. [Now, take a deep breath for the next bit …] A p < .05 here means that the probability of getting a contrast in BP as great as, or greater than, the observed result is less than 5%, if the “null hypothesis” is actually true and there would be no difference in BP if the treatment were tested on other samples of hypertensives. (This is the chance of a false positive result.) When a statistical analysis shows a p value less than .05, the difference would be considered statistically significant. The actual formula used to calculate the p-value depends on various elements in the study design; guidance will be found in a textbook of biostatistics (or maybe from your colleague with an MSc in epidemiology). If the results of a statistical test suggest that the difference may plausibly have occurred by chance (for example, p = 0.06 or p = 0.10), the blood pressure researcher should conclude that there was insufficient evidence that the therapy lowered blood pressure. But even if p < .05, bear in mind that there remains a 5% chance that the conclusion of a real effect is wrong (see Here Be Dragons, and also the Significant statistical limitations box).

Statistical and clinical significance are different

To repeat: the fact that a difference (e.g., between patients treated with the anti-hypertensive and others on placebo) is statistically significant only tells you that the probability of observing a difference at least that large is lower than some cut-off (typically 5%) if the truth is that there is no difference between treatments in the population (i.e., if the null hypothesis is true). Statistical significance does not directly tell you about the magnitude of the difference, which is important in reaching your clinical decision. For example a drop of 2 mmHg in a large trial of antihypertensive therapy might be statistically significant, but may be too small to have clinical importance.

For a study finding to alter your clinical practice, the result must be both statistically significant and clinically important. This thinking resembles that behind the Number Needed to Treat statistic, which also offers a way to summarize the amount of improvement produced by a treatment, rather than just whether it was statistically significant.

To delve deeper, see the box “Significant statistical limitations”.

Significant statistical limitations

When a statistical test shows no significant difference between two groups, this means either that there really is no difference in the population or there may be a difference, but the sample did not reveal it. This could occur because the sample size was not large enough to demonstrate it with confidence (the sample lacked the “power” to detect the actual difference; the confidence interval would tend to be wide). It is intuitively clear that a larger sample will give more precision in any estimate; indeed, if you study the whole population there is no need for confidence intervals or statistical significance because you have measured the actual parameter.

The smaller the true difference (such as between patients treated with a new BP medication and those treated using the conventional therapy), the larger the sample size that will be needed to detect it with confidence. Turning this idea around, if a very large sample size is required to demonstrate a statistically significant difference, the difference must be very small, so you should ponder whether a difference that small is clinically important.

The limits to inferential statistics

A crucial point to recognize is that inferential statistics suggest the level of confidence in generalizing from a random sample of a given size to the population from which it was drawn. But for evidence-based medicine, we frequently wish to generalize to other populations, sometimes in other countries, as illustrated in Figure 5.5. Assessing the validity of this more distant extrapolation requires additional insights into the comparability of the populations and the nature of the topic under study. This information cannot be provided by statistics, yet it forms a critical consideration in evidence-based medicine. Judging the applicability of research results to your patient(s) requires your own knowledge of the similarity or difference between the study sample and your own practice (more on this further on). This introduces the topics of EXTERNAL VALIDITY and sampling bias. The following sections introduce critical appraisal skills for detecting biases in information.

Diagram illustrating steps in extrapolating from a sample to a target population
Figure 5.5: Extrapolation from a sample to the target population

Sources of Error in Studies


Bias, or the systematic deviation of results or inferences from the truth, is a danger in the design of any study.4 Special care is taken by researchers to avoid (or, failing that, to control for) numerous types of bias that have been identified.11 The many possible biases may be grouped into two broad categories: sampling biases (derived from the way that persons in the study were selected) and measurement biases (due to errors in the way that exposures or outcomes were measured).

Sampling (or selection) bias

Simple random sampling aims to select a truly representative sample of people from a broader population; a more formal definition of the idea is that everyone in the population has an equal (and non-zero) chance of being selected. This is especially important in descriptive studies such as those that estimate prevalence. It may be less important in analytic studies that seek to test a theoretical prediction.12 For instance, a researcher who wants to study the association between obesity and arthritis might be justified in drawing her sample from a population at high risk of obesity in order to get adequate numbers of obese and very obese people in her study.

For practical reasons, very few studies are able to sample randomly from the entire target population, so the researcher usually defines a narrower “sampling frame”, which is assumed to be similar to the entire population. The researcher then draws a sample from that frame. For example, Dr. Rao might sample patients attending Weenigo General Hospital in order to make inferences about patients attending similar hospitals. Sampling bias may then occur at two stages: first, in the choice of sampling frame, because patients attending Weenigo General may differ from patients attending other hospitals in the region and, second, in the method used to draw the sample of patients attending the hospital.

Sampling bias chiefly arises when samples are not drawn randomly. For example, a newspaper advertisement that reads “Wanted: Participants for a study of blood pressure” might attract retired or unemployed people who have the time to volunteer, especially those who have a personal interest in the topic (perhaps they have a family history of hypertension). If these characteristics are, in turn, associated with blood pressure level, an estimate of the population mean BP based on this sample will be biased. Much research is undertaken in teaching hospitals, but patients seen in these centres differ systematically from patients with the same diagnosis seen in rural hospitals—they tend to be sicker, to have more co-morbidities, and often conventional therapy has failed, leading to their referral to tertiary care centres. This can lead to these studies yielding different findings than would be seen in the population of all people with a disease. This is a specific form of selection bias known as referral bias.

A magnetic study

Dr. Rao notes that the study on static magnets for menopausal symptoms assembled its sample by placing an advertisement offering women a free trial of the magnet. He worries that women who volunteered may have been predisposed to believing that the magnet works, and that their belief may have been created by the advertisement itself. They may not be representative of all women with menopausal symptoms so the results may not be generalizable.

A biased election poll

During the 1948 U.S. presidential elections, a Gallup poll predicted that Dewey, a Republican, was going to win the election against Truman, a Democrat, by a margin of over 10 percentage points. As it turned out, Truman won by 4.4 percentage points. Among the reasons for this poor prediction was the fact that the poll was carried out by telephone. As telephone ownership at the time was limited, and as richer people were more likely to own a phone and also to vote Republican, the sample was probably biased in favour of Republican supporters. This is an example of a biased sampling frame that selected for a confounding variable (wealth) and that led to a false conclusion  (see the section on Confounding).

Non-response bias

Even if the sampling method is unbiased, not everyone selected will actually participate. If particular types of people do not participate, this can bias the study results. One way to estimate the likelihood of a non-response bias is to compare characteristics of participants, such as their age, sex and where they live, with those of people who refused to participate. However, even if these characteristics match, it does not rule out bias on other characteristics that were not recorded. Those who chose not to respond likely differ in attitudes from those who did, and you will not know how they would have responded. In critically appraising an article you should review the efforts made by the investigator to reduce non-response and judge its likely impact on results.

Information bias: systematic measurement errors

Measurement error refers to deviations of recorded values on a measurement from the true values for individuals in the study. As with sampling error, measurement errors may be random or systematic. For example, social desirability bias refers to systematic response errors whereby people tend to answer questions in a way that will be viewed favourably by others (such as their doctor). Most people report that they are more physically active than the average person, which is illogical. Men tend to exaggerate their height and under-estimate their weight.13 Other measurement biases arise from flaws in the questionnaire design: for example, asking people about their physical activity in certain months only may give a biased estimate of their yearly activity level because of seasonal variations in physical activity.  Recall bias commonly occurs in surveys and especially in case-control studies: people’s memories often err. For example, on questionnaires significantly more women report having had a mammography within the past two years than is shown in mammography clinic records.

Bigger, but no less biased

Increasing the sample size or taking more measurements can minimize random sampling and measurement errors but will have no effect on systematic errors; a biased sample or measurement will remain biased no matter how many subjects participate. A large, biased study may be more misleading than a small one!

Diagram illustrating the impact of random and systematic study and measurement errors on an estimate
Figure 5.6: The impact of random and systematic study and measurement errors on an estimate.

In the figure, the + sign represents the unknown parameter we are trying to estimate; each red dot represents an estimate of the parameter obtained from a sample (it can equally represent a data point from a repeated measurement). The upper sections of the figure illustrate the presence of systematic error and the sample estimates are off-target or biased. In the presence of systematic error, increasing the sample size or making additional (biased) measurements will not bring the study results closer to the truth; it may simply mislead you into believing that they are more accurate. Increasing the sample sizes or the numbers of estimates in the lower section, where there is little systematic error, will reduce the uncertainty of the estimates.

The patterns in the figure above can also be useful in understanding test and measurement validity and reliability, as discussed in Chapter 6. For this, substitute the word “validity” for “systematic error” and “reliability” for “random error.”

Information Bias: the Objectivity of the Researcher

Dr. Rao judges the evidence

When Dr. Rao reads a study report that suggests a relationship between an exposure and an outcome, he needs to be reasonably sure that the results are “true.” By looking at peer reviewed journals, Dr. Rao may be reassured of the statistical analysis, but he should still consider other possible explanations for the findings before accepting them as true. This is why he searched for information on the author of the article on magnets and menopause. Was the author in the medical products business, perhaps selling magnets?

Whether looking at print or Internet information, you should try to find out as much as possible about the source, to check its credibility and to identify possible conflicts of interest. Trials published by people with a financial stake in the product under investigation are more likely to conclude in favour of the product than are trials published by people with no financial interest. The U.S. Food and Drug Agency and the Federal Trade Commission have proposed questions to ask when judging information sources:

  1. Who is behind it?
  2. Why is the information being published?
  3. Where does the information on the website come from?
  4. Are the claims well documented?
  5. Who is responsible for the information?
  6. How old is the information?

This section has mentioned only a few of the many potential types of study bias. Several authors have undertaken more extensive reviews, as shown in the Nerd’s corner box.

Many types of bias

Epidemiologists have long been fascinated by bias (perhaps trying to prove they are objective?) and in 1979, David Sackett catalogued over one hundred named biases that can occur at different stages in a research project. Here are  the main headings in Sackett’s catalogue:11

Biases in the Literature Review
– Selective choice of articles to cite

Study Design biases
– Selection bias
– Sampling frame bias
– Non-random sampling bias
– Non-coverage bias
– Non-comparability bias

Study Execution:  Data Collection
– Instrument bias
– Data source bias
– Subject bias
– Recall bias
– Data handling bias

Study Execution:  Data Analysis
– Confounding bias
– Analysis strategy bias
– Post hoc analysis bias

Biased Interpretation of Results
– Discounting results that do not fit the researcher’s hypothesis

– Non-publication of negative findings.

Real nerds can look up the original article and review subsequent literature to complete the catalogue. Half-hearted nerds should remember that systematic error can creep in at any stage of a research project, so that research reports should be read critically; the reader should question what happened at every stage and judge how this might affect the results.


As noted above, the possibility of confounding forms a major challenge in deciding whether a statistical association represents a true relationship – especially important in interpreting causal relationships. As an example, a study in the 1960s reported a significant tendency for Down syndrome to be more common in fourth-born or higher order children.14 There was no obvious sampling or measurement bias and the result was statistically significant, so would you believe it? Your answer may be “yes” in terms of the existence of an association, but “no” if the implication is a causal one. In other words, birth order may be a risk marker, but not a risk (or causal) factor.

Figure 5.7 Example of confounding
Figure 5.7: An example of confounding

Confounding arises when a third variable (or fourth or fifth variable, etc.) in a causal web is associated with both an exposure and the outcome being studied, as shown in Figure 5.7. If this third variable is not taken into account in the study analysis, conclusions about the relationship between exposure and outcome may be misinterpreted. In the Down syndrome example, the mother’s age is a confounding factor in that fourth-born and higher order infants tend to be born to older mothers, and advancing maternal age is an independent risk factor for Down syndrome. In most scientific articles, the first table compares the study groups (e.g., mothers with a Down infant and others without) on a range of variables that could affect the outcome, such as mean maternal age at the child’s birth. This allows the reader to determine whether any of these variables differs between the study groups, and so act as a potential confounding factor that should be adjusted in the analysis.

Confounded hormones

Before 1990, several observational studies concluded that post-menopausal women who took hormone replacement therapy were less likely to develop cardiovascular problems than those who did not. It was therefore recommended that post-menopausal women should take hormone replacement therapy. However, a subsequent randomized trial, the Women’s Health Initiative study, showed quite the opposite: hormone replacement therapy was linked to an increase in cardiovascular disease. HRT recommendations were quickly changed.

It seems likely that the earlier observational studies were biased by self-selection into the hormone group, and the bias was linked to social status which was acting as a confounding factor: women of higher social status were more likely to take hormone replacement therapy and also less likely to have heart disease.15

Dealing with confounding

Confounding can be minimized at the design stage of a study, or at the analysis stage, or both.

In experimental designs, random allocation to intervention and control groups is the most attractive way to deal with confounding. This is because random allocation should ensure that all characteristics are equal between the study groups—the more so if the groups are large. Nonetheless, all factors that may confound results should be measured and compared in each group at the start of a study. This should be reported to enable the reader to judge whether, despite randomization, potential confounders were more prevalent in one group than the other.

To complement randomization, the study sample can be restricted, for instance, to one sex or to a narrow age range. This reduces the confounding effect of the factors used to restrict the sample, but it limits the study’s generalizability as the results apply only to that restricted population. Another design strategy is matching: that is, the deliberate selection of subjects so that the level of known confounders is equal in all groups to be compared. For example, if sex, age, and smoking status are suspected confounders in a cohort study, the researcher records these characteristics in the exposed group and then samples people in the unexposed group who are similar in terms of these factors.

At the analysis stage of a study, stratification can be used to examine confounding. In stratification, the association between exposure and outcome is examined within strata formed by the suspected confounder, such as age. Published reports often mention a Mantel-Haenszel analysis, which is a weighted average of the relative risks in the various strata. If differences arise between the stratum-specific estimates and the crude (unadjusted) overall estimate, this suggests confounding. Another analytic strategy uses multivariable modelling techniques, such as logistic regression, to adjust a point estimate to remove the effects of confounding variables. The underlying concept of multivariable modelling is similar to that of standardization (see Chapter 6), a technique used to adjust for differing demographic compositions of populations being compared.

Beware: selection and measurement biases cannot be corrected at the analysis stage. Here, only careful sample selection and the use of standardized measurement procedures can minimize these biases.

The Hierarchy of Evidence

Because some study designs provide more reliable evidence than others, the idea of a hierarchy of evidence was proposed in 1979. This arose as a by-product of the Task Force on the Periodic Health Examination that made recommendations on routine screening and preventive interventions. The hierarchy has been modified over the years16 and here is a generic version, starting with the highest quality of evidence:

  I Evidence from systematic reviews or meta-analyses
II Evidence from a well-designed controlled trial
III Evidence from well-designed cohort studies, preferably from more than one centre or research group
IV Evidence from well-designed case-control studies, preferably from more than one centre or research group
V Evidence obtained from multiple time-series studies, with or without the intervention. Dramatic results in uncontrolled experiments (e.g., first use of penicillin in the 1940s) are also included in this category
VI Opinions of respected authorities, based on clinical experience, descriptive studies, reports of expert committees, consensus conferences, etc.

GRADEing studies

Broadening the basis for reviewing study quality, the GRADE system has been proposed for those who review evidence in preparing clinical guidelines. This considers four axes: the design of a study, including bias, the study quality (including the precision of its estimates), consistency across studies and “directness”, or the comparability of samples studied to the patients to whom they will be applied. These judgments are combined to form four categories: High (further research would be unlikely to change the estimate of effect); Moderate (further research is likely to change our confidence in the current estimate of effect, and may change the estimate itself); Low (further evidence is likely to change this estimate); Very Low (an estimate of effect is very uncertain).17

Systematic reviews

A common source of bias in summarizing the literature is the omission of some studies from consideration. These could be studies undertaken in countries outside one’s own, or studies published in less well-known journals. While such omissions may simplify the task of summarizing the literature, it can also lead to bias, often by omitting studies that provide discordant views. Systematic reviews aim to identify all relevant studies related to a given treatment or intervention, to evaluate their quality, and to summarize all of the findings. A key feature is the comprehensiveness of the literature review; conclusions should be based on the whole literature, often including the “grey” literature of reports published as working documents or internal reports.18 A systematic review must follow a rigorous and explicit search procedure that can be replicated by others. The author of a systematic review formulates a narrative summary of the combined study findings (as in the Cochrane Reviews, see below, or in reviews by UpToDate); but where the articles reviewed are similar enough, their data may be pooled into a combined meta-analysis.


A meta-analysis provides a statistical synthesis of data from separate yet comparable studies. It is generally accepted that a meta-analysis of several randomized controlled trials offers better evidence than a single trial. The analysis can either pool data from each person in the various studies and re-analyze the combined dataset, or else aggregate the published results from each study, producing an overall, combined numerical estimate of effect. This is normally weighted according to the relative sizes of the studies, and sometimes also by a judgment of their quality. Meta-analyses can be applied to any study design, but if design differences mean that data cannot be pooled, results of different studies may be summarized in a narrative review or presented in a forest plot, as shown in Figure 5.8.

Illustration of a Forest plot graph.
Figure 5.8: An example forest plot. This compares odds ratio estimates for an exposure from four case-control studies (square symbols with confidence intervals shown by the horizontal lines), and the pooled meta-analysis result (diamond symbol) from the four studies. The size of the squares indicates the relative sample sizes in each study. The vertical line marks the odds ratio of 1.0, indicating no difference in risk between the study groups. Results on the left of the vertical line indicate a reduction in risk (OR < 1.0); results to the right indicate an increase (OR > 1).

The Cochrane Collaboration

Systematic reviews and meta-analyses are normally undertaken by specialized content and research experts; they often work in teams such as those assembled through the Cochrane Collaboration. This international organization helps scientists, physicians, and policy makers make well-informed decisions by coordinating systematic reviews of the effects of health care interventions. The reviews are published electronically in the Cochrane Database of Systematic Reviews. An early example of a systematic review was that on the use of corticosteroids for mothers in premature labour to accelerate fetal lung maturation and prevent neonatal respiratory distress syndrome. Babies born very prematurely are at high risk of respiratory distress owing to their immature lungs, a significant cause of morbidity and mortality. The results of 21 randomised trials give no indication that the corticosteroid treatment increased risk to the mother, but it produced a 30% reduction in neonatal death and similar benefits on a range of other outcomes. Therefore antenatal corticosteroid therapy was widely adopted to accelerate fetal lung maturation in women at risk of preterm birth.

Meta-analyses are now considered to provide the highest level of evidence, so this has changed the original Canadian Task Force hierarchy of evidence. With the idea of literature review in mind, level I of the hierarchy of evidence shown above is now sub-divided into

1.1 Cochrane reviews
1.2 Systematic reviews
1.3 Evidence based-guidelines
1.4 Evidence summaries.

The Final Step: Applying Study Results to your Patients

The topic of systematic reviews brings us full circle, back to critical appraisal of the literature and evidence-based medicine. Judging whether a study produced results that should guide your care begins with formulating an overall judgment of the quality of the study or studies, and for this there are several checklists. The original ones were developed at McMaster University and published in a series of papers in the Journal of the American Medical Association in 1993 and 1994. These described critical appraisal in judging evidence for causation, prognosis, the accuracy of diagnosis, and effectiveness of therapy. To illustrate the general format, we list some questions used to appraise an article that evaluated the effectiveness of a therapy.

Checklist for study quality

Study objectives

  • Was the study question clearly stated? For example, did it follow the PICO format?
  • Was the exposure clearly defined? Was the outcome clearly defined? Or were the study objectives and outcomes vaguely worded, as in “To describe the health effects of fast-food consumption”?

Are the results valid?

  • Was a suitable study design used? Were the patients randomized?
  • Were there systematic errors (biases) in the study execution? For example, was randomization concealed? Were patients aware of their group allocation? (Were the patients “blinded”?)
  • If a case-control study, was there a possible misclassification of patient outcomes?
  • Were clinicians who treated patients aware of the group allocation?
  • Were outcome assessors aware of group allocation? (Were the assessors blinded?)
  • Were patients in the treatment and control groups similar with respect to known prognostic variables? (For instance, were there similar numbers of smokers in each group in a study of asthma therapy?)
  • Was follow-up complete?
  • Were patients analyzed in the groups to which they were allocated?

What are the results?

  • How large was the treatment effect?
  • How precise was the estimate of the treatment effect?

How can I apply the results to patient care in general?

  • Were the study patients similar to patients in my practice?
  • Are the likely treatment benefits worth the potential harm and costs?

What do I do with this patient?

  • What are the likely outcomes in this case?
  • Is this treatment what this patient wants?
  • Is the treatment available here?
  • Is the patient willing and able to have the treatment?

Once you are satisfied that a study provides a valid answer to a relevant clinical question, check that the results are applicable to your patient population.

Target population

Is the study population similar to your practice population, so that you can apply the findings to your own practice (illustrated in Figure 5.5)? Consider whether the gender, age group, ethnic group, life circumstances, and resources used in the study are similar to those in your practice population. For example, a study of the management of heart disease might draw a sample from a specialist cardiovascular clinic. Your patient in a family medicine centre is likely to have less severe disease than those attending the specialist clinic and, therefore, may respond differently to treatment. Narrow inclusion and broad exclusion criteria in the study may mean that very few of your patients are comparable to the patients studied. Furthermore, the ancillary care and other resources available in the specialist centre are also likely to be very different from those in a primary care setting. If these are an important part of management, their absence may erase the benefits of the treatment under study. Other aspects of the environment may also be different: the results of a study set in a large city that found exercise counselling effective may not apply to your patients in a rural area where there are fewer opportunities for conveniently integrating exercise into everyday life.


Is the intervention feasible in your practice setting? Do you have the expertise, training, and resources to carry out the intervention yourself? Can you to refer your patient somewhere else, where the expertise and resources are available? In many cases, practical problems of this type mean that an intervention that has good efficacy in a trial does not prove as effective when implemented in usual practice. The enthusiasm and expertise of the pioneers account for some of this difference; the extra funds and resources used in research projects may also have an effect.

How much does it cost?

The cost includes the money needed to pay for the intervention itself, the time of the personnel needed to carry it out, the suffering it causes the patient, the money the patient will have to pay to get the intervention, ancillary medical services, transport, and time off work. An intervention that costs a great deal of time, money or suffering may not be acceptable in practice.

Intervention in the control group

What, if anything, was done for people in the control group? If they received nothing, including no attention by the researchers, could a placebo effect have accounted for part of the effect observed in the intervention group? In general, new interventions should be compared to standard treatment in the control group so the added benefits and costs of the new can be compared to those of the old treatment.

Your patient’s preferences

Finally, all management plans should respect patients’ preferences and abilities. The clinician should explain the risks and benefits of all treatments—especially novel ones—in terms the patient understands, and support the patient in choosing courses of action to reduce harm and maximize benefit.

The Limits to Evidence-Based Medicine

The central purpose of undertaking research studies in medicine is to guide practice. A fundamental challenge lies in the variability of human populations, so that a study undertaken at a different time and in a different place may or may not provide information relevant to treating the patient in front of you. This is the problem of induction, of generalization when the invariance assumption does not hold. This dilemma demands that the physician should neither unquestioningly apply research results to patient care, nor ignore the evidence of research findings in the rapidly evolving field of medicine (see the Nerd’s Corner box).

EBM has its critics

Several cautions have been raised against the unthinking application of results from empirical studies. Various authors with a philosophical bent have discussed the notion of what should constitute “evidence” in medicine, and especially how different types of evidence should be integrated in delivering optimal patient care. EBM grants priority to empirical evidence derived from clinical research, but how should this be combined with the physician’s clinical experience, with underlying theories of disease and healing, with the patient’s values and preferences, and with the real-life constraints of resources? These sources of insight differ in kind and may lie in tension, so it is not clear how the clinician resolves this.19

Other authors have addressed the logical conundrum of generalizing from study findings to a particular patient: clinical trials do not generate universal knowledge, but tell us about average results from particular (and often highly selected) samples studied. EBM does not supply clear guidelines as to how the clinician should decide how closely the patient at hand matches the patients in the study and so how relevant the study may be. Faced with this uncertainty, the clinician may feel that his or her personal clinical experience is more compelling than the general research evidence. Translational medicine is increasingly developing ways to guide medical decision-making using cost/benefit calculations.20

Other critics of EBM have focused on the primacy of the randomized trial. Worrall, for example, has noted that randomization is just one way, and an imperfect one, of controlling for confounding factors that might bias results. In the presence of large numbers of confounding factors it is unlikely that randomization will balance all of these between the study groups. Worrall concluded that any particular clinical trial will have at least one kind of bias, making the experimental group different from the control group in relevant ways. He argued, in effect, that RCTs are not inherently more reliable than a well-designed observational study.21

Relating to this concern is the common failure to replicate the findings of many clinical trials. Ioannidis has reported that, of 59 highly cited original research studies, fewer than half (44%) were replicated; 16% were contradicted by subsequent studies and 16% found the effect to be smaller than in the original study; the rest had not been repeated or challenged.2 It is well known that even high quality studies funded by pharmaceutical companies are three to four times more likely to show the effectiveness of an intervention than studies funded from other sources. Another common form of publication bias is that studies showing null results are less likely to be published. There is active debate over the future development of evidence-based medicine, and ways to combine it with other forms of evidence seem likely to be proposed in coming years.2

Self-test questions

1. In your capacity as advisor to the Minister of Health, how would you design and implement a series of studies to determine the relationship between personal stereo use and noise-induced hearing loss?

First, discuss study designs. Assuming that a randomized trial on humans would be unethical, what is the feasibility of a cohort study? What would the timeline be? If you resort to a case-control study, how would you collect information on listening volume? Do you actually need to study this at the individual level? Can you get a crude but useful approximation at the population level by correlating the incidence of deafness with sales of devices (hence you have to interview no one)?

Second, consider data collection: How accurate may self-reporting be? Might people whose hearing give a biased report given that they are now suffering a medical problem? Could you instead modify some personal stereo devices to automatically record duration and volume of their use?


  1. National Institutes of Health. Citations added to MEDLINE by fiscal year U.S. National Library of Medicine2017 [cited 2017, November]. Available from:
  2. Ioannidis JP. Contradicted and initially stronger effects in highly cited clinical research. JAMA. 2005;294(2):218-28.
  3. Lett J. A field guide to critical thinking. 1990 [cited 2017, November]. Available from:
  4. Porta M, editor. A dictionary of epidemiology. New York (NY): Oxford University Press; 2008.
  5. Sackett DL, et al. Evidence-based medicine – how to practice and teach EBM. London: Churchill-Livingstone; 2000.
  6. Schwandt TA. Qualitative inquiry: a dictionary of terms. Thousand Oaks (CA): Sage Publications; 1997.
  7. Nkwi P, Nyamongo I, Ryan G. Field research into socio-cultural issues: methodological guidelines. Yaounde, Cameroon: International Center for Applied Social Sciences, Research and Training/UNFPA, 2001.
  8. Richards I, Morese JM. Read me first for a users’ guide to qualitative methods. Thousand Oaks (CA): Sage Publications; 2007.
  9. Cockburn J, Hill D, Irwig L, De Luise T, Turnbull D, Schofield P. Development and validation of an instrument to measure satisfaction of participants at breast cancer screening programmes. Eur J Cancer. 1991;27(7):827-31.
  10. Smith GCS, Pell JP. Parachute use to prevent death and major trauma related to gravitational challenge: systematic review of randomised controlled trials. BMJ. 2003;327(7429):1459-91.
  11. Sackett DL. Bias in analytic research. J Chron Dis. 1979;32:51.
  12. Miettinen OS. Theoretical epidemiology: principles of occurrence research in medicine. New York (NY): John Wiley; 1985.
  13. Connor-Gorber S, Shields M, Tremblay MS, McDowell I. The feasibility of establishing correction factors to adjust self-reported estimates of obesity. Health Reports. 2008;19(3):71-82.
  14. Renkonen KO, Donner M. Mongoloids: their mothers and sibships. Ann Med Exp Biol Fenn. 1964;42:139-44.
  15. Anderson GL, Judd HL, Kaunitz AM, et al. Effects of estrogen plus progestin on gynecologic cancers and associated diagnostic procedures: The Women’s Health Initiative randomized trial. JAMA. 2003;290:1739-48.
  16. Evans D. Hierarchy of evidence: a framework for ranking evidence evaluating healthcare interventions. J Clin Nursing. 2003;12(1):77-84.
  17. GRADE Working Group. Grading quality of evidence and strength of recommendations. BMJ. 2004;328(7454):1490.
  18. Liberati A, Altman DG, Tetzlaff J, et al. The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration. Ann Intern Med. 2009;151:W65-W94.
  19. Tonnelli MR. Integrating evidence into clinical practice: an alternative to evidence-based approaches. Journal of Evaluation in Clinical Practice. 2006;12(3):248-56.
  20. Solomon M. Just a paradigm: evidence-based medicine in epistemological context. European Journal of Philosophical Science. 2011;1:451-66.
  21. Worrall J. Evidence in medicine and Evidence-Based Medicine. Philosophy Compass. 2007;2(6):981-1022.
Print Friendly, PDF & Email

Français (French)