Chapter 5 Assessing Evidence and Information

Assessing Evidence and Information

After completing this chapter, the reader will be able to:

    1. Evaluate sources of data by applying methods of critical appraisal in order to  practise evidence-based medicine;
    2. Describe the major categories of research data, comparing the strengths of qualitative and quantitative approaches;
    3. Describe criteria for assessing causation;
    4. Describe the strengths and limitations of the major categories of study designs:
      Experimental designs:
      •  Randomized controlled trials
      Observational designs:
      Cross sectional
    5. Discuss different measures of association used in studies:
      •  Relative risk 
      •  Odds ratios
      •  Attributable risk
    6. Discuss the logic of statistical analysis:
      •  Study sampling 
      •  Measures of central tendency
      •  Inferential statistics
      •  Significance of differences
    7. Describe possible sources of error in studies:
      •  Sampling errors
      •  Measurement errors
      •  Bias
      •  Objectivity of the researcher 
      •  Confounding
    8. The hierarchy of quality of research evidence for evidence-based medicine:
      •  Systematic reviews
      •  Meta analyses
      •  Cochrane Collaboration
    9. Applying results to your patients;
    10. The limits to evidence-based medicine.

Linking these topics to the Medical Council exam objectives, especially section 78-2.

Magnets and menopause

Julie Richards is worried about her menopause. She gets hot flashes and feels generally tired. She fears the changes of her stage in life, including the risk of osteoporosis and cancers. She mentioned this to her daughter Audrey, who searched the internet and found lots of information about hormone therapy, calcium supplements and products such as evening primrose oil. She also read about physical exercise as a way of improving well-being. Julie Richards shows Dr. Rao the information her daughter found. In particular, Julie wants to know if a magnet will help her symptoms of menopause. She read about it on the web and shows the printout to Dr. Rao. The website gives quite a bit of information about menopause and cites some peer-reviewed articles that suggest that static magnets are effective in the treatment of dysmenorrhea.

Dr. Rao uses Medline and other sources to check this out. He finds that the author set up and runs a private clinic specializing in menopause problems. Through Medline, Dr Rao finds a number of articles on magnets in pain management. There is a systematic review of the evidence that concludes that magnets might be minimally effective in osteoarthritic pain, but of no demonstrated value in other types of pain. Promoters of the magnets say that their mechanism of action is either direct interference with nerve conduction, or action on small vessels to increase blood flow.

Assessing Medical Information

People have claimed the power to cure ills since the dawn of time. Some cures are based on science, in that their mode of action and the principles underlying them are known. Some are known empirically to improve health, but we do not fully understand how they work. Other treatments have been shown to have no benefit. Many have never been rigorously tested. Finally, some have been shown to have only a placebo effect, a benefit achieved by suggestion rather than by direct chemical action.

In 2016, an overwhelming 869,000 articles on medical research were indexed on MEDLINE.1 To guide medical practice, various agencies now review these publications and propose clinical guidelines based on the assembled evidence. However, guidelines do not exist for every condition, so clinicians have to understand the basics of how to critically appraise medical research articles. This is complicated by the discovery that many of the research results do not agree.2 There are various reasons for this, ranging from differences in the study design, to the perspective of the investigator, to unique characteristics of the study subjects, or to methods used to analyse study results. No study is perfect, yet the ideal of practising medicine based on evidence demands that clinicians base their decisions on the best available scientific evidence. Because some evidence is flawed, the clinician must be able to judge the validity of published information, and this forms the theme of critical appraisal of the literature.

Critical appraisal

Critical appraisal refers to judging the validity of the procedures used to collect data, identifying possible biases that may have arisen, assessing the adequacy of the analysis and completeness of reporting, evaluating the conclusions drawn, and reviewing the study’s compliance with ethical standards of research. Checklists help guide the critical appraisal process, but ultimately, clinicians must use their judgement to assess the study quality and its relevance to their particular clinical question.

Research evidence judged to be of good quality is cumulated to form the basis for evidence-based medicine. The steps in cumulating evidence include systematic reviews and meta-analyses, which will be described later on. The first step in critical appraisal is the application of common sense. This was summarised in 1990 as FiLCHeRS, which stands for Falsifiability, Logic, Comprehensiveness, Honesty, Replicability, and Sufficiency.3

Table 5.1: Standards for evaluating information quality under the acronym FiLCHeRS

Falsifiability For a conclusion to be based on evidence (rather than belief) it must be possible to conceive of evidence that would prove the claim false (for example, it would be possible to show that magnets do not reduce menopausal symptoms. But there is no logical way of proving that God does not exist).
Logic Arguments in support of a claim must be logically coherent (one cannot claim a biological effect of the magnets based on the relief felt by people who use them).
Comprehensiveness The evidence offered in support of any claim must be exhaustive–all of the available evidence must be considered; one cannot simply ignore evidence to the contrary.
Honesty The evidence offered in support of any claim must be evaluated with an open mind and without self-deception.
Replicability It must be possible for subsequent experiments or trials to obtain similar results.
Sufficiency The evidence offered in support of any claim must be adequate to establish the truth of that claim, with these stipulations:

– the burden of proof for any claim rests on the claimant;
– extraordinary claims demand extraordinary evidence, and
– evidence based on authority or testimony is inadequate for most claims, especially those which seem unlikely to be true.

Evidence-based medicine

Evidence-based medicine (EBM) refers to “the consistent use of current best evidence derived from published clinical and epidemiologic research in management of patients, with attention to the balance of risks and benefits of diagnostic tests and alternative treatment regimens, taking account of each patient’s unique circumstances, including baseline risk, co-morbid conditions and personal preferences”.4

In practice, EBM means integrating one’s clinical experience with the best available external clinical evidence from systematic research. The approach was primarily developed in Canada by Dr. David Sackett and colleagues at McMaster University during the 1970s, and is now recognized as a key foundation of medical practice.5 Sackett described evidence-based medicine as (1) the process of finding relevant information in medical literature to address a specific clinical problem, (2) the application of simple rules of science and common sense to determine the validity of the information, and (3) the application of the information to a clearly formulated clinical question. The aim was to ensure that patient care is based on evidence derived from the best available studies. Sackett argued that the “art of medicine” lies in taking the results of several sources of evidence and interpreting them for the benefit of individual patients: the opposite of what he called “cookbook medicine” which follows a fixed recipe The approach has subsequently been applied beyond clinical medicine to propose, for example, evidence-based public health and evidence-based policy making.

Mnemonic: The 5 As of evidence-based medicine

Here is a sequence that a clinician may follow in applying evidence-based medicine in deciding how to handle a challenging clinical case:

  • Assess: Recognize and prioritize the patient’s problems.
  • Ask: Construct clinical questions that facilitate efficient searching for evidence in the literature. These usually follow the PICO format: describe the Patient; describe the Intervention; with what will it be Compared, and what is the Outcome being sought? For example: “Will a 48 year-old peri-menopausal woman who wears a magnetic wrist bracelet experience fewer night sweats compared to a similar woman who does not wear such a magnet?”
  • Acquire: Gather evidence from quality sources. Librarians are helpful at this stage.
  • Appraise: Evaluate the evidence for its validity, importance, and usefulness (in particular, does it answer your PICO question?)
  • Apply: Apply to the patient, taking account of their preferences and values and of the clinical circumstances.

For more information on the 5 As, please visit: and

Appraising Scientific Evidence: Qualitative and Quantitative Research

EBM balances scientific evidence derived from groups of people against the unique characteristics of the patient: blending the science and art of medicine. Similarly, scientific evidence can be drawn from a combination of quantitative and qualitative research. Qualitative research uses non-numerical observations to offer detailed insight into individual cases, and can often address “Why?” questions (Why is this patient not adhering to his treatment?). Quantitative methods use data that can be counted or converted into numerical form, and generally addresses “How?” questions (How effective is this treatment, compared to a placebo?). Table 5.2 summarizes the different purposes of each approach, which most researchers view as complementary, leading to a “mixed methods” approach.

Qualitative variables, or qualitative studies?

Quantitative studies often examine qualitative variables. For example, a researcher who wants to know about satisfaction with services might ask a question such as: “How satisfied were you with the care you received?” The researcher might use an answer scale that allows the response to be crudely quantified, perhaps through a series of statements: very satisfied, satisfied, unsatisfied, or very unsatisfied. These could be scored 1, 2, 3 or 4 and the researcher could report the mode or the median score. The study, although measuring the quality of something, expresses the results as numbers and is, therefore, generally considered a quantitative study.

Meanwhile, a qualitative study of satisfaction might involve a focus group of patients where a group facilitator asks them about satisfaction with care but allows the participants to talk about what they consider important to their satisfaction, and then asks follow-up questions to explore their ideas in depth. The information produced is then reviewed and analyzed to identify common themes and sub-themes arising from the focus group discussion.

Table 5.2: Comparison of qualitative and quantitative research methods

Qualitative research Quantitative research
Generates hypotheses States and tests hypotheses
Is generally inductive (works from the particular instance to a general conclusion) Is generally deductive (refers to a general theory to generate a particular explanation)
Captures rich, contextual, and detailed information from a small number of participants Provides numeric estimates of frequency, severity, and associations from a large number of participants
Focuses on studying the range of ideas; sampling aims to provide a representative coverage of ideas or concepts Focuses on studying the range of people; sampling provides representative coverage of people in a population
Answers “why?” and “what does it mean?” questions Answers “what?”, “how much?” or “how many?” questions
Example of a study question: What is the experience of being treated for breast cancer? Example of a study question: Does this treatment for breast cancer reduce mortality and improve quality of life?

When numbers do not measure up

In quantitative studies, numbers can be used to categorize responses to qualitative questions, such as “How satisfied were you?” [Answer 1= very unsatisfied to 4 = very satisfied]. Beware: these numbers are arbitrary, and we cannot claim that they represent an even gradient of satisfaction. In technical jargon, these are “ordinal” numbers (equivalent to house numbers along a street); the change in satisfaction between each number is not necessarily equal (see SCALES OF MEASUREMENT in Glossary). Accordingly, such data have to be analysed using nonparametric statistical methods – for example, using a median rather than a mean (see PARAMETRIC in Glossary).

By contrast, measuring body temperature forms an “interval” measure, in which the amount of change in temperature is equal across successive numbers on the scale. Data from such measurements can be analysed using parametric statistics: mean values can legitimately be calculated.

Qualitative research

Qualitative research “employs non-numeric information to explore individual or group characteristics, producing findings not arrived at by statistical procedures or other quantitative means. Examples of the types of qualitative research include clinical case studies, narrative studies of behaviour, ethnography, and organizational or social studies.”6 Applied to public or population health, qualitative methods are valuable in analyzing details of human behaviour. Beyond merely recording facts (did this person obtain an influenza immunization?), qualitative research delves into motivation and personal narratives that offer insights into why. Hence its sampling methods focus on sampling content, rather than persons.

Qualitative researchers focus on the subjective experiences of individuals and reject the positivist idea that there exists an objective reality waiting to be discovered. They argue that human experience can be interpreted in many ways, reflecting the perspective of the observer, and that researchers can only be partially objective. Qualitative methods are inductive and flexible, allowing interpretations to emerge from the data rather than from a pre-selected theoretical perspective. Just as successive historians may re-interpret historical events, our understanding of diseases and therapies changes with new discoveries.

Like quantitative research, qualitative studies can be pure or applied, but with greater emphasis on the applied – explaining a particular situation. Qualitative data collection methods may be grouped into in-depth interviewing, participant observation, and focus groups (see the Nerd’s Corner box). The data may take the form of words, pictures or sounds – once described as “any data that is not represented by ordinal values.”7

Types of qualitative study

Qualitative method Type of question Data source Analytic technique
Phenomenology Questions about meaning or questions about the essence of phenomena, or about experiences
(e.g., What do Chinese families mean by “harmony”? What is the experience of a deaf child at school?)
Primary: audiotaped, in-depth conversation.
Secondary: poetry, art, films.
Theming and phenomenological reflection; memoing and reflective writing.
Ethnography Observational questions (e.g., How do surgical team members work together in the OR?) and descriptive questions about values, beliefs and practices of a cultural group (e.g., What is going on here?) Primary: participant observation; field notes; structured or unstructured interviews.

Secondary: documents, focus groups.

Thick description, re-reading notes and coding by topic; narrating the story; case analysis; diagramming to show patterns and processes.
Grounded theory Process questions about how the experience has changed over time or about its stages and phases (e.g., How do medical residents learn to cope with fatigue?) or understanding questions (e.g., What are the dimensions of these experiences?) Primary: audiotaped interviews; observations.

Secondary: personal experience.

Theoretical sensitivity; developing concepts for theory generation.

Focussed memoing; diagramming; emphasis on search for core concepts and processes.

Source: Adapted from Richards et al.8

Judging the quality of qualitative research

In judging qualitative research you should consider questions such as:

  1. Was the design phase of the project rigorous?

Elements to consider are the skill and knowledge of the researcher and the completeness of the literature review. The research question should also be clear and suited to a qualitative analysis. The researcher should state the perspective from which the data were gathered and analyzed.

  1. Was the study rigorous?

The final sampling should represent all relevant groups. For instance, a study of patient satisfaction should cover all types of patients that attend the clinic and should sample from both sexes, the full age range, and the full range of complaints. In qualitative research, sample size is not necessarily fixed. Sampling may continue until no new ideas or concepts emerge, a situation known as saturation. Nor is the interview script completely fixed. Questions need not be uniform, but should respond to the participants’ verbal and non-verbal cues so that the topic can be fully explored. As the project continues, the interview script may evolve in response to the findings of previous interviews.

Qualitative biases

Qualitative data collection methods, while flexible, should be systematic and documented. Having more than one researcher analyze the same data is a way to reduce possible biases in interpretation; how differences in interpreting the results were reconciled should be noted. Study participants may be asked to validate the interpretation. The reader should look for evidence that the research was conducted in an ethical manner and that confidentiality and anonymity have been preserved.

Bias is inherent in qualitative research. Collecting data by observing people, whether or not they can see the observer, can influence the subject’s behaviour. The interaction between interviewers and interviewees can also influence responses. Similarly, the results of data analysis can depend on the knowledge and perspective of the person doing the analysis. Equivalent challenges exist in quantitative research, (see the section on bias) but the methods to counteract them are not the same. Quantitative research aims for uniformity and standardization to reduce bias. Qualitative research, by its nature, responds to context so is less standardized.  Explanation of the context and the researcher’s perspective allows the reader to assess the researcher’s influence on the findings.

  1. Can I transfer the results of this study to my own setting?

Clinicians reviewing an article must decide if the context and subjects of the study are sufficiently like their own context and patients for the results to be applicable. The results can also be compared to previous literature: How closely does this study corroborate other work? If it corroborates closely, it is likely to be generalizable and, therefore, transferable to a similar context.

Qualitative and quantitative complementarity

Cockburn examined patient satisfaction with breast screening services in Australia. She used qualitative methods, including literature reviews and interviews with patients and staff, to identify relevant aspects of satisfaction.  From this she developed a standardized questionnaire to measure satisfaction with screening services; she analyzed data from this questionnaire in a quantitative manner.9

Quantitative research

The Western scientific paradigm has been refined over the centuries to identify general principles underlying observable phenomena; qualitative research focuses on describing specific instances of phenomena. As the name suggests, quantitative research is founded on counting, mathematical analysis and standardized study designs, all of which seek to minimize the role of human judgment in collecting and interpreting evidence. Quantitative methods are generally used to compare different groups, such as those exposed to a causal factor or a treatment, with others who were not. Examples include studying patterns of risk factors between people with diabetes and those without, patterns of recovery in patients who get a particular treatment and those who don’t, or lifestyle patterns in different sectors of the population. Clinical trials test the efficacy of a new therapy in curing a disease, such as whether Julie’s magnets reduced her menopausal symptoms. The general goal is to identify causal relations, but quantitative methods alone cannot prove causation.

Criteria for inferring causation

Epidemiological studies can demonstrate associations between variables, but an association is not necessarily causal. Sadly, there is no sufficient way to prove for certain that an association observed between a factor and a disease is a causal relationship. In 1965, Austin Bradford Hill proposed a set of criteria for assessing the causal nature of epidemiological relationships; he based these in part on Koch’s postulates from the nineteenth century (see Koch’s postulates box, below). The criteria have been revised many times, so you may find different versions with different numbers of criteria. Table 5.3 shows a typical example with a commentary on their limits.

Koch’s postulates

Robert Koch (1843 – 1910) was a Prussian physician and, in 1905, the Nobel Prize winner in Physiology for his work on tuberculosis. Considered one of the fathers of microbiology, he isolated Bacillus anthracis, Mycobacterium tuberculosis (once known as Koch’s bacillus) and Vibrio cholerae. His criteria (or postulates) to classify a microbe as the cause of a disease were that the microbe must be

  • Found in all cases of the disease examined
  • Capable of being prepared and maintained in pure culture
  • Capable of producing the original infection, even after several generations in culture
  • Retrievable from an inoculated animal and cultured again.

These postulates built upon earlier criteria for causality formulated by the philosopher John Stuart Mill in 1843.

Microbiology nerds will be able to cite diseases caused by organisms that do not fully meet all the criteria, but nonetheless, Koch’s postulates provided a rational basis for the study of medical microbiology.

Table 5.3: Criteria for inferring a causal relationship

Criteria Comments
1. Chronological relationship: Exposure to the presumed cause must predate the onset of the disease. This is widely accepted. But beware of the difficulty in knowing when some diseases actually began if they have long latent periods. Could the onset of this cancer have predated the occupational exposure?
2. Strength of association: If all those with the disease were exposed to the presumed causal agent, but very few in a healthy comparison group were exposed, the association is a strong one. In quantitative terms, the larger the relative risk, the more likely the association is causal. This criterion can be disputed: the strength depends very much on how many other factors are considered, and how these are controlled in a study. A strong relationship may result from an unacknowledged confounding factor. An example is the strong link between birth order and risk of Down syndrome (see the section on Confounding). This is actually due to maternal age at the child’s birth. A weak association may still be causal, particularly if it is modified by other factors.
3. Intensity or duration of exposure (also called biological gradient or dose-response relationship): If those with the most intense, or longest, exposure to the agent have the greatest frequency or severity of illness, while those with less exposure are not at as sick, then it is more likely that the association is causal. A reasonable criterion if present, but may not apply if a threshold level must be reached for the agent to have an effect. Hence the absence of a dose response does not disprove causality.
4. Specificity of association: If an agent or risk factor is found that consistently relates only to this disease, then it appears more likely that it plays a causal role. This is a weak criterion, and was derived from thinking about infectious diseases. Factors such as smoking or obesity are causally associated with several diseases; the absence of specificity does not undermine a causal interpretation.
5. Consistency of findings: An association is consistent if it is confirmed by different studies; it is even more persuasive if these are in different populations. A good criterion, although it may lead us to miss causal relationships that apply to only a minority of people. For instance, the drug induced haemolysis associated with glucose-6-phosphate dehydrogenase (GPD) deficiency could be difficult to demonstrate in populations with low prevalence of GPD deficiency.
6. Coherent or plausible findings: Do we have a biological (or behavioural, etc.) explanation for the observed association? Evidence from experimental animals, analogous effects created by analogous agents, and information from other experimental systems and forms of observation should be considered. A good criterion if we do have a theory. But this can be subjective: one can often supply a post hoc explanation for an unexpected result. On the other hand, the lack of a biological explanation should not lead us to dismiss a potential cause. Knowledge changes over time, and new theories sometimes arise from unexpected findings. Truth in nature exists independently of our current ability to explain it.
7. Cessation of exposure: If the causal factor is removed from a population, then the incidence of disease should decline. This may work for a population, but for an individual, the pathology is not always reversible.

Does asbestos cause lung cancer?

The more causal criteria that are met in a study, the stronger is the presumption that an association is causal. For example, may exposure to asbestos fibres among construction workers have caused lung cancer in some of them?
1.         Chronological relationship: Can we be sure that the exposure to asbestos predated the cancer (which may have taken years to develop)?
2.         Strength of the association: Did groups of workers with the highest exposure show the highest rates of cancer?
3.         Intensity and duration of the exposure: Were those with the longest work history the most likely to get sick?
4.         Specificity: Did they just get lung cancer?
5.         Consistency: Have similar findings been reported from other countries?
6.         Coherence and plausibility: Does it make biological sense that asbestos could cause lung cancer?
7.         Cessation of exposure: After laws were passed banning asbestos, did lung cancer rates decline among construction workers?

In the end, whether or not a factor is accepted as a cause of a disease always remains open to dispute, especially when it is not possible to obtain experimental proof. There are still defenders of tobacco who can use technical arguments to point out the flaws in the evidence that smoking causes cancer and heart disease.

Types of Error in Studies

All studies have to minimize two types of error in collecting and interpreting data: bias (or systematic distortions) and random errors. In addition, studies of causation have to address confounding, a challenge in the interpretation of the results. These types of error are explained in detail later in the chapter, but brief definitions may help the reader at this point.

Types of error

Error: “A false or mistaken result obtained in a study or experiment.”1 We may distinguish between random and systematic errors in research studies:

Random error: deviations from the truth that can either inflate, or reduce estimates derived from a measurement or a study. For convenience, these are generally assumed to be due to chance and, if the sample is large, to have little effect in distorting the overall results. Statistics, such as the confidence interval, estimate the magnitude of random errors (see Sampling and chance error below).

Systematic error, or bias: a consistent deviation of results or inferences from what is believed to be the truth: a systematic exaggeration of effect, or an underestimate. These may arise from defects in the study design, including the sampling (“selection bias”), or may arise from faulty measurement procedures (“information bias”).

Confounding: a challenge in interpreting the results of a study in which the effects of two processes are not distinguished from each other (see Confounding, below).

Research Designs

Unlike qualitative methods, quantitative research is based on systematically sampling people, and uses standardized measurement, analysis, and interpretation of numeric data. Quantitative research uses a variety of study designs, which fall into two main classes: experimental studies (or trials) and observational studies. Figure 5.1 maps the distinctions between these.

Diagram showing logical structure of alternative study designs
Figure 5.1: What kind of study is it?

Experimental (or interventional) studies

As the name implies, these are studies in which the participants undergo some kind of intervention in order to evaluate its impact. The experimental researcher has control over the intervention, its timing and dose or intensity. An intervention could include a medical or surgical intervention, a new drug, or an intervention to change lifestyle. As the most methodologically rigorous design, experiments are the default choice for providing evidence for best practice in patient management, so the discussion begins with them.

In its simplest form, an experimental study to test the effect of a treatment follows these steps:

  1. The researcher formally states the hypothesis to be tested.
  2. The researcher selects people eligible for the treatment.
  3. The sample is divided into two groups.
  4. One group (the experimental, or intervention group) is given the intervention while the other (the control group) is not.
  5. Relevant outcomes are recorded over time, and the results compared between the two groups.

Step 3 leads to a critical distinction, shown at the left of Figure 5.1: the distinction between a randomized controlled trial and non-randomized designs. In the former, people are allocated to intervention and control groups by chance alone, while in the latter the choice of who receives the intervention is decided in some other way, such as according to where or in which order they enter the study. There are many types of non-randomized studies and, because the researcher often does not have complete control over the allocation to experimental or control group, they are regarded as inferior to true randomized designs (see Nerd’s corner box). They are often called quasi-experimental designs.

What’s a quasi-experiment?

An example of a quasi-experiment would be to treat hypertensive patients attending one hospital with one protocol and compare their outcomes to patients receiving a different treatment protocol in another hospital. This has the advantage of being simple: there is no need to randomise patients in each hospital, and staff training is greatly simplified. However, many biases might arise in such a study: patients might choose which hospital or clinician they attend (self-selection); one hospital may treat more severe patients; other aspects of care in the two hospitals may be different, and so forth.

Another quasi-experimental study is the time series design. This studies a single group and makes serial measurements before and after some intervention, thereby allowing trends to be compared to detect the impact of the intervention. For example, to examine whether introducing a new textbook on public health has any impact on student learning, public health exam marks could be compared for successive cohorts of medical students, for several years before, then for several years after the introduction of the book. The hypothesis is that there will be a significant jump in the scores following the introduction of the new book. This design is called quasi-experimental because it lies mid-way between an observational study and a true experiment. It can be considered an experiment if the investigator has control over the timing of the introduction of the book. But in other instances (such as studying injuries from car collisions before and after the introduction of seat-belt laws) the researcher has no such control and this design could be considered an observational study or an experiment of opportunity. The time-series design has the virtue of feasibility: it would be difficult to randomly allocate some students to have the book and others not, because the book might be shared between the two groups.

Quasi-experiments have sufficient sources of potential bias that they are regarded as substantially inferior to true randomized experiments, so their findings are rarely considered conclusive.

Randomization removes allocation bias, increases the chances that any confounding factors will be distributed evenly between both groups, and it allows the valid use of statistical tests. The key advantage of a random assignment is that other factors that could affect the outcome (aka confounding factors) are likely to be equally represented in each study group—including unknown factors, such as genetic characteristics that affect prognosis. An unbiased random allocation should ensure that the only difference between the two study groups is the intervention, so that any differences in outcomes are probably attributable to the intervention. The larger the study sample the more confident we can be that this is true. Nonetheless, it is still a matter of probabilities, and this is why we need tests of statistical significance. These show the likelihood that observed differences between experimental and control groups could have arisen merely by chance.

Random sampling and random allocation

Distinguish between random selection of subjects from a sampling frame or list, and random allocation of subjects to experimental or control groups. Random selection of subjects is mainly relevant in descriptive research and helps to ensure that results can be generalized to the broader population, so enhancing the external validity of the study (see the section on errors in sampling).

Random allocation to experimental and control groups helps to ensure they are equivalent in everything save for the experimental intervention, so the comparison is not confounded by inherent differences between groups; this enhances the internal validity of the study. (See Nerd’s corner “Not always truly random”)

Not always truly random

For practical reasons, some trials use non-random patient allocation. For example, using patients’ health insurance numbers, those with an odd number could be assigned to experimental group and even numbers to the control.  This is superior to participants themselves choosing which group to join, and may approach the quality of a random allocation. However, the method of allocation should be carefully scrutinized to ensure the numbers were assigned in a truly random manner. Check, for example, to ensure that males are not given odd numbers and females even.

Randomized controlled trials

The most common experimental design in medical research is the randomized controlled trial (RCT – see Figure 5.2). An RCT is a true experiment in that the investigator controls the exposure and, in its simplest form, assigns subjects randomly to the experimental or control group (which may receive no treatment, or the conventional treatment, or a placebo). Both groups are followed and assessed in a rigorous comparison of their rates of morbidity, mortality, adverse events, functional health status, and quality of life. RCTs are most commonly used in therapeutic trials but can also be used in trials of prevention. Most commonly, people are randomly allocated to the study groups individually, but groups of people can also be allocated, or even whole communities. RCTs are often conducted across many centres, as illustrated by clinical trials of cancer treatments.

Generic plan of a randomized controlled trial
Figure 5.2 Generic plan of a randomized controlled trial

The steps in an RCT are:

  1. State the hypothesis in quantitative and operational terms. For example, using the PICO format: “There will be a 10% reduction in self-recorded night sweats among peri-menopausal women who wear a magnetic wrist bracelet, compared to age-matched women who do not wear a wrist magnet.”
  2. Select the participants. This step includes calculating the required sample size, setting inclusion and exclusion criteria, and obtaining free and informed consent.
  3. Allocate participants randomly to either the treatment or control group; this is normally done using a computer-generated random allocation. Note that there may be more than one intervention group, for example receiving different types of magnet. Note also that the control group often receives the standard treatment to which the new one is being compared.
  4. Administer the intervention. This is preferably done in a blinded fashion, so that the patient does not know which group she is in. Ideally, the researcher (and certainly the person intervening and monitoring the patient’s response) should also not know which group a given patient is in (this is called a double-blind experiment). This helps to remove the influence of the patient’s and the clinician’s expectations of the treatments, which could bias their assessment of outcomes. Sometimes, a triple-blind approach is used in which neither patient, nor clinician, nor those who analyze and interpret the data know which group received the treatment (the groups are merely labeled A or B). This reduces possible bias even further.
  5. At a pre-determined time, the outcomes are monitored (e.g., physiological or biochemical parameters, morbidity, mortality, adverse events, functional health status, or quality of life) and compared between the intervention and control groups using statistical analyses. This indicates whether any differences in event rates observed in the two groups are greater than might be expected by chance alone.

While RCTs are regarded as the best research design we have, they do have limitations. By design, they study the efficacy of a treatment under carefully controlled experimental conditions, which may not mirror how well the treatment will work in normal clinical practice. Efficacy refers to the potential impact of a treatment under the optimal conditions typical of a controlled research setting. Effectiveness refers to its impact under the normal conditions of routine practice. For example, in trial conditions the medication may be efficacious because patients know that they are participating in a research project and are being supervised. And yet in the real world the medication may not be effective because, without supervision, patients may not take all of their medication in the correct dose. An efficacious intervention may also not be efficient enough to put into practice. Breast-self examination has been shown to detect early breast cancer, but only in trial conditions in which women received constant follow-up by trained nurses. This level of intervention was too costly to be put into routine practice.

Furthermore, clinical trials are often conducted on selected populations – e.g., men aged 50 to 74 who are current smokers with unstable angina, no co-morbidity and willing to participate in a research study. This limits the extent to which results can be generalized to typical angina patients. Trial results may also be biased due to attrition if participants in one or other group drop out of the study. Finally, intervention trials, although designed to detect differences in the known and desired outcomes, may not be large enough to reliably detect previously unknown or rare effects.

An adaptation of the RCT is the “N of 1” trial which can have particular value in testing a treatment for a particular patient in a way that avoids most sources of bias.

N of 1 trials

An N of 1 trial is a special form of clinical trial that studies a single patient. The patient receives either the active treatment or a control (e.g. a placebo), determined randomly and administered blindly. Outcomes are recorded after a suitable time lapse, followed by a washout period when the medication is withheld to eliminate remaining traces of it. The patient then receives the alternate treatment (placebo or active) and outcomes are evaluated. The cycle may be repeated to establish stable estimates of the outcomes. The main advantage is that the study result applies specifically to this patient and allows for calibration to optimize the therapeutic dose. The results cannot be generalized beyond this particular patient, and of course it requires that the effect of the treatment can be reversed.

Some studies use an N of 1 approach but with a group of patients. These can produce highly valid results because almost all sources of bias are eliminated, because each patient acts as his own control.

Ethics of RCTs

Some special ethical issues (see ETHICS in Glossary) arise in the conduct of all medical experiments. A tension may arise between two basic principles: patients have a right to receive an effective treatment (the principle of beneficence), but it is unethical to adopt a new treatment without rigorous testing to prove efficacy and ensure non-maleficence. Therefore, if there is partial evidence that a treatment seems superior, it may be unethical to prove this in a randomized trial because this would entail denying it to patients in the control group. Hence, an RCT can only ethically be applied when there is genuine uncertainty as to whether the experimental treatment is superior; this is termed equipoise. However, it may be considered irrelevant to undertake an expensive trial when there is no reason to believe the new medication is superior. It is also unethical to continue a trial if the treatment is found to be obviously effective or obviously dangerous. Trials are therefore planned with pre-set stopping rules that specify conditions under which they should be prematurely concluded.  It is also unethical to conduct trials that offer only marginal benefits in terms of broader social value (e.g., studies that benefit the publication record of the researcher more than the health of patients, or studies that double as marketing projects). Ethical considerations mean that many established treatments will probably never be evaluated by a controlled trial:

  • Appendectomy for appendicitis
  • Insulin for diabetes
  • Anaesthesia for surgical operations
  • Vaccination for smallpox
  • Immobilization for fractured bones, and
  • Parachutes for jumping out of airplanes, as the British Medical Journal humorously noted.10

Terminating trials early

The ethical principle of beneficence demands that patients must benefit from a new treatment as soon as it is proven effective, but the principle of non-maleficence implies that this proof must be definitive. Therefore studies are designed to include the minimum sample size required for definitive proof. The sample size is calculated before the study begins from an estimate of the likely relative benefit of intervention and control treatments, but this is an estimate only, and can be wrong.

Occasionally, early results may seem to show an advantage one way or the other, but being based on small numbers these preliminary results may be due to chance.  Researchers may thus be faced with a choice between stopping a trial before the number of participants is large enough to definitively demonstrate the superiority of one course of action, or to continue the trial even though, as far as they know, one course of action appears superior to the other. This decision becomes especially challenging when the experimental treatment appears to be harmful compared to the conventional treatment. A further complication is that undertaking early analyses of the data imply un-blinding the investigators and this can lead to biasing their future conclusions. In general a data-monitoring committee uses methods that allow continuous monitoring of outcomes but does not communicate with the investigators until clinically significant differences occur and the trial can be stopped.

Phases of intervention studies

Once a new pharmaceutical treatment has been developed, it undergoes testing in a sequence of phases before it can be approved by regulatory agencies for public use. Randomized trials form the third stage within this broader sequence, which begins with laboratory studies using animal models, thence to human testing:

Phase I:     The new drug or treatment is tested in a small group of people for the first time to determine safe dosage and to identify possible side effects.

Phase II:    The drug or treatment is given to a larger group at the recommended dosage to determine its efficacy under controlled circumstances and to evaluate safety. This is generally not a randomized study.

Phase III:  The drug or treatment is tested on large groups to confirm effectiveness, monitor side effects, to compare it to commonly used treatments, and to develop guidelines on the safe use of the drug. Phase III testing normally involves a series of randomized trials. At the end of this phase, the drug may be approved for public use. The “on label” approval may restrict how the drug can be used, for instance for specific diseases or in certain age groups.

Phase IV:  After the treatment enters the marketplace, information continues to be collected to describe its effectiveness on different populations, but particularly to detect possible side effects or adverse outcomes. This does not involve a formal RCT and is called post-marketing surveillance; information comes from several sources, such as reports of side effects from physicians and patients, or data on outcomes such as hospital readmissions obtained from computerized information systems. Large numbers may be required to detect rare or slowly developing side effects.

Observational studies

In an observational study, the researcher observes what happens to people under exposure conditions either chosen by the person (such as exercise or diet) or that are outside of their control (such as most social determinants of health). There is often a comparison group of people who were not exposed. The key is that the researcher chooses which populations and exposures to study, but does not influence them. As there is no random allocation of exposures, the major problem in inferring causation is that the exposed and unexposed groups may differ on other key factors that may themselves be true causes of the outcome, rather than the characteristics under study. Such factors are known as confounders.

The simplest form of observational study is a descriptive design.

Descriptive studies

Descriptive studies describe how things are: they count the numbers of people who have diabetes, or who smoke, or are satisfied with their hospital care. Such a study uses descriptive statistics to summarize the group results – percentages, a mean or median value, and perhaps the range of values or the standard deviation. The data for a descriptive study may come from a questionnaire or from sources such as electronic medical records, or from surveillance programmes, describing person, place, and time of disease occurrences (Who? Where? When?) (see Surveillance in Chapter 7). Descriptive studies are commonly used with small, local populations, such as the patients in your practice, and are often used to collect information for planning services. Descriptive studies generally refer to a single point in time – usually the present – so give a cross-sectional picture of the population, although repeated cross-sectional studies can illustrate trends over time, such as the changing number of smokers in your practice. When a study collects information on several variables, it can describe the associations among the variables (for example, is diabetes more common in men or women and does it vary by smoking status?) This can be used to generate hypotheses, which may then be tested in an analytic study.

Analytic studies

The critical distinction between a descriptive and an analytic study is that the latter is designed to test a hypothesis, generally concerned with identifying a causal relationship. When an outcome variable, such as heart disease, is studied in relation to an exposure variable such as body weight, the study does more than count: it tests a hypothesis predicting an association between the two. The interest is no longer purely local, as with a descriptive study, but to draw a more general conclusion that will apply to a broader population. Hence, the representativeness of the study sample is crucially important, introducing the concept of external validity or generalizability of the sample results. To describe the level of confidence with which we can draw general conclusions from a sample, we use inferential statistics (see section on Chance errors in sampling, below).

Analytic observational studies vary in terms of the time sequence and sampling procedures used to collect data, and can be of three types: cross-sectional, cohort or case-control studies, as shown in Figure 5.1.

Cross-sectional analytic studies

Cross-sectional studies use a single time-reference for the data collected (e.g., consultations during the past 2 weeks). A common cross-sectional designs is the analytical survey, an extension of the descriptive survey. The difference is that the analytic study measures variables in order to test hypotheses concerning their relationships, rather than merely to report frequencies of their occurrence. For example, in a nationally representative sample a researcher might test the hypothesis that feelings of stress increase the use of medical services. The researcher might ask whether people were under stress in the last year, then whether they had visited a doctor in the last 2 weeks.

Suppose that this national study produced the following result:

Table 5.4: Stress and physician visits: calculating the association between two variables

Doctor visit in the last 2 weeks?
Yes No Total
Stress in the last year? Yes 1,442 3,209 4,651
No 2,633 11,223 13,856
Total 4,075 14,432 18,507

Note that this result can be reported in either of two ways:

  1. Of those who suffered stress in the last year, 31% (1442/4651) visited their doctor in the last 2 weeks compared with only 19% (2633/13856) of those who did not suffer stress.
  2. Of those who visited their doctor, 35% (1442/4075) reported stress in the previous year, compared with 22% (3209/14432) of those who did not visit their doctor.

Either approach is correct. The researcher is free to decide which way to report the results; the study design allows both types of analysis. All that can be concluded is that there is an association between the two variables. It might be supposed that stress predisposes people to visit their doctor, or could it be that the prospect of a visit to the doctor causes stress, or perhaps something else (a confounding factor such as worrying about an underlying illness?) caused both. This study provides little evidence in support of a causal relationship – merely the apparent association between stress and the doctor visit. The chief weakness of cross-sectional studies is that they cannot show temporal sequence: whether the factor (stress) pre-dated the outcome (doctor visit) (see causal criteria in Table 5.3).

Descriptive and analytic studies usually sample individual people, but they may alternatively study groups of people, such as city populations. These ecological studies often use  aggregate data from government sources, making the study easy to undertake. See the Nerd’s Corner box on Ecological studies.

Ecological studies

Ecological studies measure variables at the level of populations (countries, provinces) rather than individuals. They are the appropriate design for studying the effect of a variable that acts at the population level, such as climate, an economic downturn, or shortages of physicians. Like a survey, they can be descriptive or analytic. They have the advantage that they can often use readily available data such as government statistics. Ecological studies can be useful for generating hypotheses that can be tested at the individual level. For example, the hypothesis that dietary fat may be a risk factor for breast cancer arose from a study that showed that countries with high per-capita fat consumption also had higher incidences of breast cancer.

However, there is a logical limitation in drawing conclusions from ecological studies for individual cases. Because the ecological finding was based on group averages, it does not necessarily show that the individuals who consumed a high fat diet were the ones most likely to get cancer. Perhaps the breast cancer cases living in countries with high fat diets actually consumed little fat: we cannot tell from an ecological study. The temptation to draw conclusions about individuals from ecological data is called “the ecological fallacy.” To draw firm conclusions about the link between dietary fat and breast cancer risk, the two factors must be studied in the same individuals. Nevertheless, ecological studies are often used as a first step, to suggest whether or not a more expensive study of individuals may be worthwhile.

Cohort studies

A cohort is a group of people who can be sampled and enumerated, who share a defining characteristic and who can be followed over time: members of a birth cohort share the same year of birth, for example (see  the Nerd’s Corner: Cohort). Cohort studies of health commonly study causal factors; the characteristic of interest is usually some sort of exposure hypothesized to increase the likelihood of a disease. A typical cohort study begins with a representative sample of people who do not have the disease of interest; it collects information on exposure to the factor being studied, and follows exposed and unexposed people over time (Figure 5.3). Because of this, cohort studies are also known as follow-up or longitudinal  studies. The numbers of newly occurring (incident) cases of disease are recorded and compared between the exposure groups. The hypothesis to be tested is generally that more disease will arise in the exposed group (as indicated by the relative sizes of the rectangles on the right of the figure).

Schema of a cohort study
Figure 5.3: Schema of a cohort study

Latin cohorts

Cohort: from Latin cohors, meaning “an enclosure.” The meaning was extended to an infantry company in the Roman army through the notion of an enclosed group or retinue. Think of a Roman infantry cohort marching forward; some are wearing new metal body armour, while others have the old canvas-and-leather protection. Bandits shoot at them and General Evidentius gets a scribe to record the mortality outcomes and his trusty analyst, Epidemiologicus, compares the results using simple arithmetic, as shown in Table 5.5.

In simple cohort studies the results can be fitted into a “2 by 2” table (2 rows by 2 columns – don’t count the Total column).

Table 5.5: Generic format for a 2 x 2 table linking an exposure to an outcome.

(e.g., disease)
(e.g., disease)
Exposure (or risk factor) present a b a + b
Exposure (or risk factor) absent c d c + d

The incidence, or risk, of disease in the exposed group is calculated as a / (a + b). Correspondingly, the risk in the non-exposed people is c / (c + d). These risks can be compared to get a risk ratio (often called a relative risk, or RR): [a/(a + b) divided by c/(c + d)]. This gives an estimate of the strength of the association between the exposure and the outcome: how much more likely is the disease among those exposed? A relative risk of 1.0 indicates that exposed people are neither more nor less likely to get the disease than unexposed people: there is no association between exposure and disease. A relative risk greater than 1.0 implies that people who have been exposed have a greater chance of becoming diseased, while a relative risk of less than 1.0 implies a protective effect, with exposed people having a lower chance of becoming diseased than unexposed people.

The fact that exposure was recorded before the outcomes is the main advantage of cohort studies (recall that no-one had the disease at the start of the study). They can clearly establish the causal criterion of a temporal sequence between exposure and outcome as long as study participants truly did not have the disease at the outset. Furthermore, because recording of exposures and outcomes is planned from the beginning of the study period, measurements can be standardized. Note that randomized trials are an experimental version of a cohort study, in which the experimenter randomly assigns the exposure to experimental or control subjects.

Definition of exposure groups

Imagine a cohort study designed to test the hypothesis that exposure to welding fumes causes diseases of the respiratory tract. Who would you sample? The sample could be drawn on the basis of a crude indicator of exposure, such as using occupation as a proxy (assume that welders have been exposed; a non-welding occupation would be presumed to be unexposed). This approach is frequently used in occupational and military epidemiology. A more precise alternative would be to quantify levels of exposure (e.g., from the person’s work history); this requires considerably more information but would permit dose-response to be estimated—one of the criteria for inferring causation (see Table 5.3).

In an extension of this quantified approach, a cohort study might not select an unexposed group to follow, but rather select a representative sample of individuals with sufficient variability in their exposure to permit comparisons across all levels of exposure, or to permit mathematical modelling of the effect of exposures. Cohort studies of diet, exercise, or smoking often use this approach, obtaining exposure information from a baseline questionnaire. This approach has been used in community cohort studies such as the Framingham Heart Study (see Illustrative Material: The Framingham Study).  Cohort studies offer a powerful way to evaluate causal influences, but they may take a long time to complete and hence be expensive. A cheaper alternative is the case control design.

The Framingham Study

Since 1948, the town of Framingham, Massachusetts, has participated in a cohort study to investigate the risk factors for coronary heart disease. The study has now collected data from two subsequent generations of the families initially involved. This has produced quantitative estimates of the impact of risk factors for cardiac disease, covering levels of exercise, cigarette smoking, blood pressure, and blood cholesterol. Details of the Framingham studies may be found at the Framingham Heart Study website.

A cohort study proves…?

In drawing conclusions from a cohort study, it is tempting to assume that the exposure caused the outcome. After all, the study can demonstrate that there is a temporal association between the two and may also demonstrate a dose-response gradient. This meets two of the criteria for a causal relation, but confounding can remain an issue (see section on confounding).

Confounding factors may explain why epidemiological studies are famous for producing conclusions that conflict with one another; for this reason there is a strong emphasis on undertaking a randomized controlled trial, when it is feasible and ethical to do so. An example is that of hormone replacement therapy and cardiovascular disease in women (see Illustration box Confounded hormones)

Case-control studies

Case-control studies compare a group of patients with a particular outcome (e.g., cases of pathologist-confirmed pancreatic cancer) to an otherwise similar group of people without the disease (the controls). As shown in Figure 5.4, reports or records of exposure (e.g., alcohol consumption) before the onset of the disease are then compared between the groups. The name of the design reminds you that groups to be compared are defined in terms of the outcome of interest: present (cases) or absent (controls). The hypothesis to be tested is that exposure will be more common among cases than controls, as indicated by the relative size of the circles on the left of the figure.

Schema of a case-control study design
Figure 5.4: Schema of a case-control design

Notice that a case-control study does not allow the calculation of the incidence or risk of the disease, because it begins with predetermined numbers of people who already have, or do not have it. Therefore, a risk ratio cannot be calculated. Instead, the study identifies the exposure status of a sample of cases and another of controls. This information allows the calculation of the odds of a case having been exposed—the ratio of a:c in the 2 x 2 table (Table 5.6). This can be compared to the odds of a control having been exposed, the ratio of b:d. The result of the case-control study is then expressed as the ratio of these two odds, giving the Odds Ratio (OR): a/c divided by b/d. To make the calculation easier, this is usually simplified algebraically to ad/bc.

Table 5.6: Generic 2 x 2 table for calculating an odds ratio

Outcome (or disease) present Outcome (or disease) absent
Exposure (or risk factor) present a b
Exposure (or risk factor) absent c d

The OR calculated from a case-control study can approximate a relative risk, but only when the disease is rare (say, affecting up to around 5% of the population, as is the case for many chronic conditions) (see the Nerd’s Corner box on Probabilities and odds). The interpretation of the value of an OR is similar to an RR. Like a relative risk, an OR of 1.0 implies no association between exposure and disease. A value over 1.0 implies a greater chance of diseased people having been exposed compared to controls. A value below 1.0 implies that the factor is protective. This might occur, for example, if a case-control study showed that eating a low fat diet protected against heart disease.

Key contrast between cohort and case-control studies

In cohort studies, the participant groups are classified according to their exposure status (whether or not they have the risk factor).

In case-control studies, the different groups are identified according to their health outcomes (whether or not they have the disease).

Prospective or retrospective?

These terms are frequently misunderstood, and for good reason.

Cohort studies define the comparison groups on the basis of exposure levels. A study beginning today might identify welders, begin to record their working history from here on and follow them forward over time to check outcomes several years hence. This would be a prospective cohort study. But it would be quicker to work from occupational records and select people who worked as welders 30 years ago, and assess their health status now, comparing this to their exposure history. This used to be called a retrospective cohort study, but is better called an historical cohort study. The word retrospective causes confusion because it was formerly used to refer to case-control studies. Most authorities have now abandoned the term entirely.

Probabilities and odds

Probabilities and odds express the same information in different ways. Probabilities (as used in prevalence and incidence figures) express the proportion of people who have a certain characteristic (for example being exposed to a causal factor). Odds take this further, and express the ratio of two probabilities: the probability of a case having been exposed divided by the probability of not being exposed. From Table 5.6 this would be a/(a+c)  ÷ c/(a+c), which simplifies to a/c. The odds ratio takes it one step further, comparing the odds in each column of the table, or a/c ÷ b/d. Odds are familiar to us when comparing separate groups – the ratio of men to women in your class, for example, or from sports: the odds of winning could be 4 to 1, or 80%.

The relative risk, calculated in Table 5.5, required that the sample forms a single cohort and that all those who were exposed be classified as a case or a non-case (and similarly for the non-exposed). This is necessary so that proportions such as a/(a+b) can be calculated. In a case-control study, however, the proportion of cases and controls was pre-set and so a proportion such as a/(a+b) in Table 5.6 provides no new information. However, we can use odds and make the calculation vertically in Table 5.6, within the cases and within the controls, and compare the ratio a/c to b/d.

It was noted that the odds ratio approximates the relative risk only when disease is rare. This can be shown as follows. If the number of cases is small and the number of non-cases large, then a proportion such as a/(a+b) will be virtually equal to a/b.  The extent of error depends on the size of the relative risk, but as the disease becomes more frequent (e.g. above 5%) the OR tends to exaggerate the RR for risks > 1, and to under-estimate the RR when the risk is < 1.

Measures of absolute risk: attributable risk and number needed to treat

The RR and OR indicate how much an individual’s risk of disease is increased by having been exposed to a causal factor, in relative terms. Both statistics answer the question “Compared to someone without the risk factor, how many times as likely am I to get the disease?”, giving the answer as a ratio:  “You are twice as likely”, or “10% more likely”. A patient’s question, however, often concerns absolute risk, which relates to disease incidence and answers the question “What is my chance of getting the disease (in the next year, ten years, over my lifetime)?” The answer is given as an absolute proportion, such as 1 in 10, or 1 in 100. An important point to bear in mind when communicating with a patient is that if the disease is rare, quoting a relative risk of 2 or 3 can appear quite frightening even though the absolute risk is small. A risk factor that doubles an absolute risk of 1 in a million is still only 2 in a million.

Most diseases have multiple causes, so it is convenient to have a way to express the risk due to a particular cause. This introduces the concept of attributable risk, which indicates the number of cases of a disease among exposed individuals that can be attributed to that exposure:

Attributable risk = Incidence in the exposed group − Incidence in the unexposed

This tells us how much extra disease has been caused by this exposure, in absolute terms: 1 case per million persons in the example above. In the case of a factor that protects against disease, such as a vaccination, it tells us how many cases can be avoided.

This idea of attributable risk can also be expressed in relative terms, as a proportion of the incidence in exposed persons, yielding the exposed attributable fraction, EAF:

EAF = [Incidence (exposed) – Incidence (unexposed) ] / Incidence (exposed)

This statistic can be useful in counselling a patient exposed to a risk factor: “not only are you at high risk of developing lung cancer, but 90% of your risk is attributable to your smoking. Quitting could have a major benefit”.

In developing health policies, we can also apply the idea of attributable risk to describing the impact of risk factors on the population as a whole. This introduces measures of population attributable risk (PAR) and of population attributable fraction (PAF): statistics that evaluate the impact of a causal factor by substituting incidence in the whole population for incidence in the exposed (see Nerd’s corner).

Population attributable risk

In discussing the impact of preventive programmes, the population attributable risk (PAR) indicates the number of cases that would not occur if a risk factor were to be eliminated:

Incidence (population) – Incidence (unexposed)

Sadly, this statistic is almost never used, despite its obvious utility in setting priorities for health policy.  Expressed as a proportion of the incidence in the whole population, it yields the population attributable fraction or PAF (which also goes by half a dozen other names):

[Incidence (population) – Incidence (unexposed)] / Incidence (population)

This statistic, relevant for public health, shows the proportion of all cases of a disease that is attributable to a given risk factor, and was used (for example) to estimate that 40,000 Canadians die from tobacco smoking each year.  A little algebra shows that it depends on the prevalence of the risk factor and the strength of its association (relative risk) with the disease. The formula is:

PAF = Pe (RRe-1)  /  [1 + Pe (RRe-1)],

where Pe is the prevalence of the exposure (e.g., the proportion who are overweight) and RRe is the relative risk of disease due to that exposure.

The population prevented fraction is the proportion of the hypothetical total load of disease that has been prevented by exposure to a protective factor, such as an immunization programme. The formula is:

Pe (1-RR).

A useful extension of the attributable risk is the concept of “Number Needed to Treat” (NNT). This is a metric that summarizes the effectiveness of a therapy or a preventive measure in achieving a desired outcome. The basic idea is that no treatment works infallibly, so the number needed to treat shows the number of patients with a condition who must follow a treatment regimen over a specified time in order to achieve the desired outcome for one person. The NNT is calculated as the reciprocal of the absolute improvement produced by the treatment. So, if a medication cures 35% of the people who take it, while 20% recover spontaneously, the absolute improvement is 15%. The reciprocal = 1 / 0.15 = 7. So, on average, you would need to treat 7 people to achieve 1 cure using the treatment (within the specified time). The NNT can also be applied in describing the value of a preventive measure in avoiding an undesirable outcome, and it can likewise be used in calculating the hazard of treatments such as adverse drug reactions, when it is termed “number needed to harm”.

Risky calculations

A cohort study of the effectiveness of an immunization examined whether or not immunized and non-immunized people became sick. The results are as follows

Sick Not sick
Immunized 20  (a) 100  (b)
Not immunized 50  (c)  30  (d)
Total = 200

How may we calculate risk? Many are the ways:

Relative risk (RR)
(Note that the immunisation protects, so the result is < 1)
a/(a+b)  /  c/(c+d) 0.167 / 0.625 = 0.267
Odds ratio (OR)
(Note that, as this is a cohort study, one would generally not use the OR)
ad / bc 0.12
Attributable risk (AR)
(a negative AR indicates protection)
(a / (a+b)) –  (c / (c+d)) 0.167 – 0.625 = -0.458
Absolute risk reduction (ARR) (attributable risk with the sign changed) (c / (c+d)) – (a / (a+b)) 0.625 – 0.167 = 0.458
Number-needed-to-treat (NNT) 1 / ARR 1 / 0.458 = 2.18

Inferential Statistics

Medical research seeks to apply information gained from particular study samples to broader populations (of patients, but also of animals, cells, etc). The uncertainties of such extrapolation introduces inferential statistics.


Most commonly, a study sample is used to estimate a more general value such as the average birth weight of babies in the broader population; this population value is called a “parameter”. To provide an accurate estimate a sample should evidently be representative of the population; a random sample offers a good approach. But because people vary, different samples drawn randomly from the same population are likely to give slightly different results purely due to chance variation in who gets selected. Random sampling from a population only guarantees that, on average, the results from successive samples will reflect the true population parameter, but the results of any particular sample may differ from those of the parent population, sometimes substantially. The unintended differences that arise between study findings simply because of their different samples are known as “sampling error”, “random variation” or “chance error”. But at least we can estimate the accuracy of extrapolating, or generalizing, from a sample to the broader population using inferential statistics such as p-values and confidence intervals (see Additional Material: Statistics).


In statistical terminology, a parameter is the true value in the population (e.g. average birth weight); this is the value that a study sample is being used to estimate. By knowing the population parameter, you can interpret values for your patient: Is this child’s birth weight reassuringly normal for this population?

Parameters in the population are usually designated by Greek letters, while sample estimates of these parameters are indicated by Latin letters:

mean parameter = μ  (pronounced “mu”)
sample estimate of the mean =  X with a line over it, called “x-bar”;
standard deviation of the parameter = σ (“sigma”)
sample estimate of the standard deviation = s.


Statistics is the branch of mathematics based on probability theory that deals with analyses of numerical data drawn from samples. It is known as biostatistics when it is applied to biological research. Inferential statistics estimate the likely extent of errors that may arise in applying conclusions from a small study sample to the broader population from which the sample was drawn. In this Primer we provide only a very brief overview of the statistical methods most relevant to evidence-based medicine; you will have to turn to a statistics text for more information.

Significance of differences

Consider an RCT that compares mean blood pressures in an experimental and a control group. Biostatistics provide ways to measure the probability (hence, p-value) that a study result, such as the observed difference in BP, could have occurred merely by chance in this particular study, due to random sampling variation. As a physician, you are interested not so much in this particular study sample, but in whether the results would also apply elsewhere, such as to patients in your practice. If the trend shown in the study could have occurred by chance alone, you should not base your practice on those results!

To estimate the probability that a finding from a study sample might be due to chance, we have to set a threshold for deciding when to consider a finding “real” and when to reject it as a chance finding. Intuitively, the larger the sample size and the larger the difference in mean blood pressures, the more confident we would be that the difference would hold true if the study were to be repeated on another sample, or on your patients. For any given study result, statistical analysis calculates the probability that the result obtained may have been a chance finding. The researcher first indicates the probability level or p-value that will be used to distinguish between results (here the contrast in mean blood pressure) that may be a chance finding in this particular sample, versus differences that are considered “statistically significant”. The threshold chosen is usually p < 0.05, or 5% – arbitrary but commonly used. A p < .05 means that the probability of getting a result at least as extreme as the observed result is less than 5% if there is really no association between variables in the population (i.e., if the “null hypothesis” is true). Such a difference would then be termed statistically significant. The formula used to calculate the p-value depends on various elements in the design of the study; information will be found in a textbook of biostatistics. If the results of a statistical test suggest that the difference is quite likely to have occurred by chance (for example, p = 0.06 or p = 0.10), the blood pressure researcher should conclude that he didn’t find sufficient evidence that the therapy reduced blood pressure. But even if p < .05, bear in mind that there remains a 5% chance that this conclusion is wrong (see Here Be Dragons – Statistical and clinical significance are different).

Estimating a parameter

The confidence interval (or CI) is a statistic used to indicate the probable extent of error in an estimate of a parameter derived from a sample. An expression such as “mean systolic blood pressure was 120 mmHg (95% CI: 114, 126 mmHg)” indicates that the mean value in the study sample was 120 mmHg, but owing to possible sampling error the actual value in the broader population may not be precisely 120. Based on the size of the sample used and the variability of BP readings within it, there is a 95% probability that the population mean will lie between 114 and 126 mmHg. The confidence interval can be represented graphically as a line or error-bar, as seen in Figure 5.8.

Like mean values, odds ratios and relative risks are also reported with confidence intervals. The confidence interval around an odds ratio indicates whether or not the association is statistically significant, as well as its likely range. For odds ratios and relative risks, if the confidence interval includes 1.0, it is assumed that there is no statistically significant difference between the two groups since an odds ratio or relative risk of 1.0 means that there is no difference between the two groups. For example, a relative risk of 1.4 (95% CI 0.8, 2.1) means that we can be 95% confident that the true relative risk is somewhere in the (very wide) range of 0.8 to 2.1. Furthermore, because this range includes 1.0, it is quite possible that there is no association in the population.

The limits to inferential statistics

A crucial point to recognize is that inferential statistics suggest the level of confidence in generalizing from a random sample to the population from which it was drawn. But for evidence-based medicine, we frequently wish to generalize to other populations, sometimes in other countries, as illustrated in Figure 5.5. Assessing the validity of this more distant extrapolation requires additional insights into the comparability of the populations and the nature of the topic under study. This information cannot be provided by statistics, yet forms a critical element in evidence-based medicine and introduces the problem of sampling bias. The following sections introduce critical appraisal skills for detecting biases in information.

Diagram illustrating steps in extrapolating from a sample to a target population
Figure 5.5: Extrapolation from a sample to the target population

Statistical and clinical significance are different

The fact that a difference (e.g., between patients treated with the anti-hypertensive and others on placebo) is statistically significant only tells you that the probability of observing a difference at least that large is lower than some cut-off (typically 5%) if the truth is that there is no difference between treatments in the population (i.e., if the null hypothesis is true). Statistical significance does not directly tell you about the magnitude of the difference, which is important in reaching your clinical decision. For example a drop of 2 mmHg in a large trial of antihypertensive therapy might be statistically significant, but is too small to have clinical importance.

For a study finding to alter your clinical practice, the result must be both statistically significant and clinically important. This thinking resembles that behind the Number Needed to Treat statistic, which also offers a way to summarize the amount of improvement produced by a treatment, rather than just whether it was statistically significant.

To delve deeper, see the box “Significant statistical limitations”.

Significant statistical limitations

When a statistical test shows no significant difference between two groups, this means either that there really is no difference in the population or there may be a difference, but the sample did not reveal it. This may occur because the sample size was not large enough to demonstrate it with confidence (the sample lacked the “power” to detect the true difference; the confidence interval would tend to be wide). It is intuitively clear that a larger sample will give more precision in any estimate; indeed, if you study the whole population there is no need for confidence intervals or statistical significance because you have measured the actual parameter.

The smaller the true difference (such as between patients treated with a new BP medication and those treated using the conventional therapy), the larger the sample size that will be needed to detect it with confidence. Turning this idea around, if a very large sample size is needed to demonstrate a difference as being statistically significant, the difference must be very small, so you should ponder whether a difference that small is clinically important.

Sources of Error in Studies


Bias, or the systematic deviation of results or inferences from the truth, is a danger in the design of any study.4 Special care is taken by researchers to avoid (or, failing that, to control for) numerous types of bias that have been identified.11 The many possible biases may be grouped into two general categories: sampling biases (derived from the way that persons in the study were selected) and measurement biases (due to errors in the way that exposures or outcomes were measured).

Sampling (or selection) bias

Simple random sampling seeks to select a truly representative sample of people from a broader population; a more formal definition of the idea is that everyone in the population has an equal (and non-zero) chance of being selected. This is especially important in descriptive studies such as those that estimate prevalence. It may be less important in analytic studies that seek to identify abstract scientific truths such as the association between two variables.12 For instance, a researcher who wants to study the association between arthritis and obesity might be justified in drawing her sample from a population at high risk of obesity in order to get adequate numbers of obese and very obese people in her study.

For practical reasons, very few studies are able to sample randomly from the entire target population, so the researcher usually defines a narrower “sampling frame”, which is assumed to be similar to the entire population. The researcher then draws a sample from that frame. For example, the researcher might sample patients attending Weenigo General Hospital in order to make inferences about patients attending similar hospitals. Referring back to Figure 5.5, sampling bias may then occur at two stages: first, in the choice of sampling frame, because patients attending Weenigo General may differ from patients attending other hospitals in the region and, second, in the method used to draw the sample of patients attending the hospital.

Sampling bias mainly arises when samples are not drawn randomly, so that people in the population do not have an equal chance of being selected. For example, a newspaper advertisement that reads “Wanted: Participants for a study of blood pressure” might attract retired or unemployed people who have the time to volunteer, especially those who have a personal interest in the topic (perhaps they have a family history of hypertension). If these characteristics are, in turn, associated with blood pressure level, an estimate of the population mean BP based on this sample will be biased. Much research is undertaken in teaching hospitals, but patients seen in these centres differ systematically from patients with the same diagnosis seen in rural hospitals—they tend to be sicker, to have more co-morbidities, and often conventional therapy has failed, leading to their referral to tertiary care centres. This can lead to these studies yielding different findings than would be seen in the population of all people with a disease. This is a specific form of selection bias known as referral bias.

A magnetic study

Dr. Rao notes that the study on static magnets for menopausal symptoms assembled its sample by placing an advertisement offering women a free trial of the magnet. He worries that women who answered such an advertisement may have been predisposed to believing that the magnet works, a belief that was possibly established by the advertisement itself. They may not be representative of all women with menopausal symptoms so the results may not be generalizable.

A biased election poll

During the 1948 U.S. presidential elections, a Gallup poll predicted that Dewey, a Republican, was going to win the election against Truman, a Democrat, by a margin of over 10 percentage points. As it turned out, Truman won by 4.4 percentage points. Among the reasons for this poor prediction was the fact that the poll was carried out by telephone. As telephone ownership at the time was limited, and as richer people were more likely to own a phone and also to vote Republican, the sample was probably biased in favour of Republican supporters. This is an example of a biased sampling frame that selected for a confounding variable (wealth) and that led to a false conclusion  (see section on Confounding).

Non-response bias

Even if the sampling method is unbiased, not all of those selected will actually participate. If this is not random, and particular types of people do not participate, this can bias the study results. One way to estimate the likelihood of a non-response bias is to compare characteristics of participants, such as their age, sex and where they live, with those of people who refused to participate. However, even if these characteristics match, it does not rule out bias on other characteristics that were not recorded. It is extremely difficult to adjust estimates for non-response, because even if you know which group is under-represented, those who chose not to respond are likely different from those who did, and you will not know in what way their responses would differ.

Information bias: systematic measurement errors

Measurement error refers to deviations of recorded values on a measurement from the true values for individuals in the study. As with sampling error, measurement errors may sometimes be random, or they can be systematic. For example, social desirability bias refers to systematic response errors whereby people tend to answer questions in a way that will be viewed favourably by others. Most people report that they are more physically active than the average person, which is illogical. Men tend to exaggerate their height and under-estimate their weight.13 Other measurement biases arise from flaws in the questionnaire design: for example, asking people about their physical activity in certain months only may give a biased estimate of their yearly activity level because of seasonal variations in physical activity.  Recall bias commonly occurs in surveys and especially in case-control studies: people’s memories often err. For example, questionnaires on time since last mammography suggest that significantly more women report having had a mammography within the past two years than studies based on mammography billing records.

Bigger, but no less biased

Increasing sample size or taking more measurements can minimize random sampling and measurement errors but will have no effect on systematic errors; a biased sample or measurement will remain biased no matter how many subjects participate. A large, biased study may be more misleading than a small one!

Diagram illustrating the impact of random and systematic study and measurement errors on an estimate
Figure 5.6: The impact of random and systematic study and measurement errors on an estimate

In the figure, the + sign represents the unknown parameter we are trying to estimate; each dot represents an estimate of the parameter obtained from a sample (it can equally represent a data point from a repeated measurement). The upper sections of the figure illustrate the presence of systematic error and the sample estimates are off-target or biased. In the presence of systematic error, increasing the sample size or making additional (biased) measurements will not bring the study results closer to the truth; it may simply make you think that they are more accurate. Increasing the sample size in the lower section, where there is little systematic error, will reduce the uncertainty of the estimates.

The patterns in the figure above can also be useful in understanding test and measurement validity and reliability, as discussed in Chapter 6. For this, substitute the word “validity” for “systematic error” and “reliability” for “random error.”

Information Bias: the Objectivity of the Researcher

Dr. Rao judges the evidence

When Dr. Rao reads a study report that suggests a relationship between an exposure and an outcome, he needs to be reasonably sure that the results are “true.” By reviewing peer reviewed journals, Dr. Rao may be reassured of the statistical analysis, but he should still consider other possible explanations for the findings before accepting them as true. This is why he searched for information on the author of the article on magnets and menopause. Was the author in the medical products business, perhaps selling magnets?

Whether looking at print or Internet information, you should try to find out as much as possible about the source, to check its credibility and to identify possible conflicts of interest. Trials published by people with a financial stake in the product under investigation are more likely to conclude in favour of the product than are trials published by people with no financial interest. The U.S. Food and Drug Agency and the Federal Trade Commission have proposed questions to ask when judging the sources of information:

  1. Who is behind it?
  2. Why is the information being published?
  3. Where does the information on the website come from?
  4. Is the information well documented?
  5. Who is responsible for the information?
  6. How old is the information?

This section offers only a brief overview of the many potential types of study bias. Several authors have undertaken more extensive reviews, as shown in the Nerd’s corner box.

Many types of bias

Epidemiologists seem fascinated by bias (perhaps in an attempt to prove they are dispassionate?) and in 1979, David Sackett catalogued over one hundred named biases that can occur at different stages in a research project. Here are  the main headings of Sackett’s catalogue:11

Biases in the Literature Review
– Selective choice of articles to cite

Study Design biases
– Selection bias
– Sampling frame bias
– Non-random sampling bias
– Non-coverage bias
– Non-comparability bias

Study Execution:  Data Collection
– Instrument bias
– Data source bias
– Subject bias
– Recall bias
– Data handling bias

Study Execution:  Data Analysis
– Confounding bias
– Analysis strategy bias
– Post hoc analysis bias

Biased Interpretation of Results
– Discounting results that do not fit the researcher’s hypothesis

– Non-publication of negative findings.

Real nerds can look up the original article and review subsequent literature to complete the catalogue. Half-hearted nerds should remember that systematic error can creep in at any stage of a research project, so that research reports should be read critically; the reader should question what happened at every stage and judge how this might affect the results.


A study in the 1960s reported a significant tendency for Down syndrome to be more common in fourth-born or higher order children.14 There was no obvious sampling or measurement bias and the result was statistically significant, so would you believe it? Your answer may be “yes” in terms of the existence of an association, but “no” if the implication is a causal one. In other words, birth order may be a risk marker, but not a risk (or causal) factor.

Figure 5.7 Example of confounding
Figure 5.7: An example of confounding

Confounding arises when a third variable (or fourth or fifth variable, etc.) in a causal web is associated with both an exposure and the outcome being studied (see Figure 5.7). If this third variable is not taken into account in a study, conclusions about the relationship between exposure and outcome may be misinterpreted. In the Down syndrome example, the mother’s age is a confounding factor in that fourth-born and higher order infants tend to be born to older mothers, and maternal age is an independent risk factor for Down syndrome. In most scientific articles, the first table compares the study groups (e.g., mothers with a Down infant and others without) on a number of variables that could affect the outcome, such as mean maternal age at the child’s birth. This allows the reader to determine whether any of these variables is associated with the outcome, and so act as a potential confounding factor that should be adjusted in the analysis.

Confounded hormones

Before 1990, a number of observational studies concluded that post-menopausal women who took hormone replacement therapy were less likely to develop cardiovascular problems than those who did not. It was therefore recommended that post-menopausal women should take hormone replacement therapy. Subsequently, however, the Women’s Health Initiative undertook a randomized trial that showed quite the opposite: in fact, hormone replacement therapy was associated with an increase in cardiovascular disease. HRT recommendations were quickly changed.

It seems likely that the observational studies were biased by self-selection into the hormone group, and the bias was linked to social status which was acting as a confounding factor: women of higher social status were more likely to take hormone replacement therapy and also less likely to have heart disease.15

Dealing with confounding

Confounding can be minimized at the design stage of a study, or at the analysis stage, or both.

In experimental designs, random allocation to intervention and control groups is the most attractive way to deal with confounding. This is because random allocation should ensure that all characteristics are equal between the study groups—the more so if the groups are large. Nonetheless, all factors that may confound results should be measured and compared in each group at the start of a study. This should be reported to enable the reader to judge whether, despite randomization, potential confounders were more prevalent in one group than the other.

To complement randomization, the study sample can be restricted, for instance, to one sex or a narrow age range. This reduces the confounding effect of the factors used to restrict the study, but it limits the study’s generalizability as the results apply only to that restricted population. Another design strategy is matching: that is, the deliberate selection of subjects so that the level of known confounders is equal in all groups to be compared. For example, if sex, age, and smoking status are suspected confounders in a cohort study, the researcher records these characteristics in the exposed group and then samples people in the unexposed group who are similar in terms of these factors.

At the analysis stage of a study, stratification can be used to examine confounding. In stratification, the association between exposure and outcome is examined within strata formed by the suspected confounder, such as age. Published reports often mention a Mantel-Haenszel analysis, which is a weighted average of the relative risks in the various strata. If differences arise between the stratum-specific estimates and the crude (unadjusted) overall estimate, this suggests confounding. Another analytic strategy uses multivariable modelling techniques, such as logistic regression, to adjust a point estimate for the effects of confounding variables. The underlying concept of multivariable modelling is similar to that of standardization (see Chapter 6), a technique used to adjust for differing demographic compositions of populations being compared.

Beware: selection and measurement biases cannot be corrected at the analysis stage. Here, only careful sample selection and the use of standardized measurement procedures can minimize these biases.

The Hierarchy of Evidence

Study designs vary in the extent to which they can control various forms of error, so some designs provide more reliable evidence than others. In 1979 the Canadian Task Force on the Periodic Health Examination first proposed the idea of a hierarchy of evidence; this arose as a by-product of their work formulating recommendations about screening and preventive interventions. The hierarchy implies that clinicians should consider the study design when judging the credibility of evidence. The hierarchy has been modified over the years;16 here is a generic version:

  I Evidence from systematic reviews or meta-analyses
II Evidence from a well-designed controlled trial
III Evidence from well-designed cohort studies, preferably from more than one centre or research group
IV Evidence from well-designed case-control studies, preferably from more than one centre or research group
V Evidence obtained from multiple time-series studies, with or without the intervention. Dramatic results in uncontrolled experiments (e.g., first use of penicillin in the 1940s) are also included in this category
VI Opinions of respected authorities, based on clinical experience, descriptive studies, reports of expert committees, consensus conferences, etc.

GRADEing studies

Broadening the basis for reviewing study quality, the GRADE system has been proposed for those who review evidence in preparing clinical guidelines. This considers four axes: the design of a study, the study quality, consistency across studies and “directness”, or the comparability of samples studied to the patients to whom they will be applied. These judgments are combined to form four categories: High (further research would be unlikely to change the estimate of effect); Moderate (further research is likely to change our confidence in the current estimate of effect, and may change the estimate itself); Low (further evidence is likely to change the estimate); Very Low (an estimate of effect is very uncertain).17

Systematic reviews

A common source of bias in summarizing the literature is the omission of some studies from consideration. These could be studies undertaken in countries outside one’s own, or studies published in less well-known journals. While such omissions may simplify the task of summarizing the literature, it can also lead to bias, often by omitting studies that provide discordant views. The aims of a systematic review are to identify all relevant studies related to a given treatment or intervention, to evaluate their quality, and to summarize all of the findings. A key feature is the comprehensiveness of the literature review; conclusions should be based on the whole literature, often including the “grey” literature of reports published as working documents or internal reports.18 A systematic review must follow a rigorous and explicit search procedure that can be replicated by others. The author of a systematic review formulates a narrative summary of the combined study findings (as in the Cochrane Reviews, see below); but where the articles reviewed are similar enough, their data may be pooled into a combined meta-analysis.


A meta-analysis provides a statistical synthesis of data from separate yet comparable studies. It is generally accepted that a meta-analysis of several randomized controlled trials offers better evidence than a single trial. The analysis can either pool data from each person in the various studies and re-analyze the combined data, or else aggregate the published results from each study, producing an overall, combined numerical estimate. This is normally weighted according to the relative sizes of the studies, and sometimes also by a judgment of their quality. Meta-analyses can be applied to randomized controlled trials and to other study designs including case-control or cohort studies. In those cases where differences in study design mean that data cannot be pooled, results of different studies may be summarized in a narrative review or presented in a forest plot, as shown in Figure 5.8.


Illustration of a Forest plot graph.
Figure 5.8: An example forest plot, comparing odds ratio results from four case-control studies (square symbols with confidence intervals shown by the horizontal lines), and the result of the meta-analysis (diamond symbol) that pooled the results of the individual studies. The size of the squares indicates the relative sample sizes in each study. The vertical line marks the odds ratio of 1.0, indicating no difference in risk between the study groups. Results on the left of the vertical line indicate a reduction in risk (OR < 1.0); results to its right indicate an increase (OR > 1).

The Cochrane Collaboration

Systematic reviews and meta-analyses are normally undertaken by specialized content and research experts; they often work in teams such as those assembled through the Cochrane Collaboration. This is an international organization whose goal is to help scientists, physicians, and decision makers make well-informed decisions about health care by coordinating systematic reviews of the effects of health care interventions. The reviews are published electronically in the Cochrane Database of Systematic Reviews. An early example of a systematic review was that on the use of corticosteroids for mothers in premature labour to accelerate fetal lung maturation and prevent neo-natal respiratory distress syndrome. Babies born very prematurely are at high risk of respiratory distress owing to their immature lungs, a significant cause of morbidity and mortality. The results of 21 randomised trials give no indication that the corticosteroid treatment increased risk to the mother, and it produced a 30% reduction in neonatal death and similar benefits on a range of other outcomes. Therefore antenatal corticosteroid therapy was widely adopted to accelerate fetal lung maturation in women at risk of preterm birth.

Meta-analyses are now considered to provide the highest level of evidence, so this has changed the original Canadian Task Force hierarchy of evidence. With the idea of literature review in mind, level I of the hierarchy of evidence shown above is now sub-divided into

1.1 Cochrane reviews
1.2 Systematic reviews
1.3 Evidence based-guidelines
1.4 Evidence summaries.

The Final Step: Applying Study Results to your Patients

The topic of systematic reviews brings us full circle, back to critical appraisal of the literature and evidence-based medicine. This begins with formulating an overall judgment of the quality of the study or studies, and for this there are a number of checklists. The original ones were developed at McMaster and published in a series of papers in the Journal of the American Medical Association from 1993 to 1994. They described critical appraisal in judging evidence for causation, prognosis, the accuracy of diagnosis, and effectiveness of therapy. To illustrate the general format, we list some questions used to appraise an article about the effectiveness of a therapy.

Checklist for study quality

Are the results valid?

  • Were the patients randomized?
  • Was randomization concealed?
  • Were patients aware of their group allocation? (Were the patients “blinded”?)
  • Were clinicians who treated patients aware of the group allocation?
  • Were outcome assessors aware of group allocation? (Were the assessors blinded?)
  • Were patients in the treatment and control groups similar with respect to known prognostic variables? (For instance, were there similar numbers of smokers in each group in a study of asthma therapy?)
  • Was follow-up complete?
  • Were patients analyzed in the groups to which they were allocated?

What are the results?

  • How large was the treatment effect?
  • How precise was the estimate of the treatment effect?

How can I apply the results to patient care in general?

  • Were the study patients similar to patients in my practice?
  • Are the likely treatment benefits worth the potential harm and costs?

What do I do with this patient?

  • What are the likely outcomes in this case?
  • Is this treatment what this patient wants?
  • Is the treatment available here?
  • Is the patient willing and able to have the treatment?

Once you are satisfied that a study provides a valid answer to a relevant clinical question, check that the results are applicable to your patient population.

Target population

Is the study population similar to your practice population, so that you can apply the findings to your own practice (illustrated in Figure 5.5)? Consider whether the gender, age group, ethnic group, life circumstances, and resources used in the study are similar to those in your practice population. For example, a study of the management of heart disease might draw a sample from a specialist cardiovascular clinic. Your patient in a family medicine centre is likely to have less severe disease than those attending the specialist clinic and, therefore, may respond differently to treatment. Narrow inclusion and broad exclusion criteria in the study may mean that very few of your patients are comparable to the patients studied. Furthermore, the ancillary care and other resources available in the specialist centre are also likely to be very different from a primary care setting. If these are an important part of management, their absence may erase the benefits of the treatment under study. Other aspects of the environment may also be different: the results of a study set in a large city that found exercise counselling effective may not apply to your patients in a rural area where there are fewer opportunities for conveniently integrating exercise into everyday life.


Is the intervention feasible in your practice setting? Do you have the expertise, training, and resources to carry out the intervention yourself? Can you to refer your patient somewhere else, where the expertise and resources are available? In many cases, practical problems of this type mean that an intervention that has good efficacy in a trial does not prove as effective when implemented in usual practice. The enthusiasm and expertise of the pioneers account for some of this difference; the extra funds and resources used in research projects may also have an effect.

How much does it cost?

The cost includes the money needed to pay for the intervention itself, the time of the personnel needed to carry it out, the suffering it causes the patient, the money the patient will have to pay to get the intervention, ancillary medical services, transport, and time off work. An intervention that costs a great deal of time, money or suffering may not be acceptable in practice.

Intervention in the control group

What, if anything, was done for people in the control group? If they received nothing, including no attention by the researchers, could a placebo effect have accounted for part of the effect observed in the intervention group? In general, new interventions should be compared to standard treatment in the control group so the added benefits and costs of the new can be compared to those of the old treatment.

Your patient’s preferences

Finally, all management plans should respect patients’ preferences and abilities. The clinician should explain the risks and benefits of all treatments—especially novel ones—in terms the patient understands, and support the patient in choosing courses of action to reduce harm and maximize benefit.


The Limits to Evidence-Based Medicine

The central purpose of undertaking research studies in medicine is to guide practice. A fundamental challenge lies in the variability of human populations, so that a study undertaken at a different time and in a different place may or may not provide information relevant to treating the patient in front of you. This dilemma demands that the physician should neither unquestioningly apply research results to patient care, nor ignore the evidence of research findings in the rapidly evolving field of medicine (see Nerd’s Corner box).

EBM has its critics

Several cautions have been raised against the unthinking application of results from empirical studies. Various authors with a philosophical bent have discussed the notion of what should constitute “evidence” in medicine, and especially how different types of evidence should be integrated in delivering optimal patient care. EBM grants priority to empirical evidence derived from clinical research, but how should this be combined with the physician’s clinical experience, with underlying theories of disease and healing, with the patient’s values and preferences, and with the real-life constraints of resources? These sources of insight differ in kind and may lie in tension, so it is not clear how the clinician resolves this.19

Other authors have addressed the logical conundrum of generalizing from study findings to a particular patient: clinical trials do not generate universal knowledge, but tell us about average results from particular (and often highly selected) samples studied. EBM does not supply clear guidelines as to how the clinician should decide how closely the patient at hand matches the patients in the study and so how relevant the study may be. Faced with this uncertainty, the clinician may feel that his or her personal clinical experience is more compelling than the general research evidence. Translational medicine is increasingly developing ways to guide medical decision-making using cost/benefit calculations.20

Other critics of EBM have focused on the primacy of the randomized trial. Worrall, for example, has noted that randomization is just one way, and an imperfect one, of controlling for confounding factors that might bias results. In the presence of large numbers of confounding factors it is unlikely that randomization will balance all of these between the study groups. Worrall concludes that any particular clinical trial will have at least one kind of bias, making the experimental group different from the control group in relevant ways. He argues, in effect, that RCTs are not inherently more reliable than a well-designed observational study.21

Relating to this concern is the common failure to replicate the findings of many clinical trials. Ioannidis has reported that, of 59 highly cited original research studies, fewer than half (44%) were replicated; 16% were contradicted by subsequent studies and 16% found the effect to be smaller than in the original study; the rest had not been repeated or challenged.2 It is well known that even high quality studies funded by pharmaceutical companies are three to four times more likely to show the effectiveness of an intervention than studies funded from other sources. Another common form of publication bias is that studies showing null results are less likely to be published. There is active debate over the future development of evidence-based medicine, and ways to combine it with other forms of evidence seem likely to be proposed in coming years.2

Self-test questions

1. In your capacity as advisor to the Minister of Health, how would you design and implement a series of studies to determine the relationship between personal stereo use and noise-induced hearing loss?

First, discuss study designs. Assuming that a randomized trial on humans would be unethical, what is the feasibility of a cohort study? What would the timeline be? If you resort to a case-control study, how would you collect information on listening volume? Do you actually need to study this at the individual level? Can you get a crude but useful approximation at the population level by correlating the incidence of deafness with sales of devices (hence you have to interview no one)?

Second, consider data collection: How accurate may self-reporting be? Might people whose hearing give a biased report given that they are now suffering a medical problem? Could you instead modify some personal stereo devices to automatically record duration and volume of their use?


  1. National Institutes of Health. Citations added to MEDLINE by fiscal year U.S. National Library of Medicine2017 [cited 2017, November]. Available from:
  2. Ioannidis JP. Contradicted and initially stronger effects in highly cited clinical research. JAMA. 2005;294(2):218-28.
  3. Lett J. A field guide to critical thinking. 1990 [cited 2017, November]. Available from:
  4. Porta M, editor. A dictionary of epidemiology. New York (NY): Oxford University Press; 2008.
  5. Sackett DL, et al. Evidence-based medicine – how to practice and teach EBM. London: Churchill-Livingstone; 2000.
  6. Schwandt TA. Qualitative inquiry: a dictionary of terms. Thousand Oaks (CA): Sage Publications; 1997.
  7. Nkwi P, Nyamongo I, Ryan G. Field research into socio-cultural issues: methodological guidelines. Yaounde, Cameroon: International Center for Applied Social Sciences, Research and Training/UNFPA, 2001.
  8. Richards I, Morese JM. Read me first for a users’ guide to qualitative methods. Thousand Oaks (CA): Sage Publications; 2007.
  9. Cockburn J, Hill D, Irwig L, De Luise T, Turnbull D, Schofield P. Development and validation of an instrument to measure satisfaction of participants at breast cancer screening programmes. Eur J Cancer. 1991;27(7):827-31.
  10. Smith GCS, Pell JP. Parachute use to prevent death and major trauma related to gravitational challenge: systematic review of randomised controlled trials. BMJ. 2003;327(7429):1459-91.
  11. Sackett DL. Bias in analytic research. J Chron Dis. 1979;32:51.
  12. Miettinen OS. Theoretical epidemiology: principles of occurrence research in medicine. New York (NY): John Wiley; 1985.
  13. Connor-Gorber S, Shields M, Tremblay MS, McDowell I. The feasibility of establishing correction factors to adjust self-reported estimates of obesity. Health Reports. 2008;19(3):71-82.
  14. Renkonen KO, Donner M. Mongoloids: their mothers and sibships. Ann Med Exp Biol Fenn. 1964;42:139-44.
  15. Anderson GL, Judd HL, Kaunitz AM, et al. Effects of estrogen plus progestin on gynecologic cancers and associated diagnostic procedures: The Women’s Health Initiative randomized trial. JAMA. 2003;290:1739-48.
  16. Evans D. Hierarchy of evidence: a framework for ranking evidence evaluating healthcare interventions. J Clin Nursing. 2003;12(1):77-84.
  17. GRADE Working Group. Grading quality of evidence and strength of recommendations. BMJ. 2004;328(7454):1490.
  18. Liberati A, Altman DG, Tetzlaff J, et al. The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration. Ann Intern Med. 2009;151:W65-W94.
  19. Tonnelli MR. Integrating evidence into clinical practice: an alternative to evidence-based approaches. Journal of Evaluation in Clinical Practice. 2006;12(3):248-56.
  20. Solomon M. Just a paradigm: evidence-based medicine in epistemological context. European Journal of Philosophical Science. 2011;1:451-66.
  21. Worrall J. Evidence in medicine and Evidence-Based Medicine. Philosophy Compass. 2007;2(6):981-1022.
Print Friendly, PDF & Email

Français (French)