The use and abuse of Evidence Based Medicine

Towards Knowledge Based Medicine

James W Fairley BSc FRCS MS
Consultant ENT Surgeon
71 Mill Court
Ashford Kent TN24 8DN

Ian D B Hore MSc FRCS
Consultant Paediatric ENT Surgeon
St Thomas’ Hospital
London SE1 7EH

Last updated 27 May 2008
© 2007-2016 JW Fairley, IDB Hore

We thank Dr Mark Haggard, PhD, CBE of MRC Multi-Centre Otitis Media Study Group, Cambridge, UK for his detailed comments on earlier drafts of this manuscript and for introducing the concept of KBM. The opinions expressed and any errors remaining are those of the authors alone.



Sackett (1996) defined Evidence Based Medicine as
“The conscientious, explicit, and judicious use of current best evidence in making decisions about the care of individual patients”
The definition has been broadened to include other areas of practice. Several terms have emerged.

  • Evidence Based Medicine
  • Evidence Based Health Care
  • Evidence Based Practice
  • Evidence Based Approach

They all carry similar meaning. McMaster’s University (2000) defined an Evidence Based Approach as follows
“An Evidence Based Approach is one where the clinician is aware of the evidence that bears on their practice and the strength of that evidence”

What it is not

Engraving Victorian Style Counsels Opinion

Advertisement for Lamplough’s Pyretic Saline from The Graphic, Summer Issue, June 1894. A barrister (Queen’s Counsel) gives a legal opinion on the the strength of the evidence supporting the manufacturer’s claims to efficacy of the product.

“ EBM is not new ”
  1. EBM is not new. Our Victorian forebears practised evidence based medicine. Their evidence came from detailed clinical description, correlating the symptoms and signs in life with the pathological findings. The pathological findings were often post-mortem, since the diseases they dealt with were serious and the therapeutic options limited. Like today, the accumulation of personal experience and skill lay at the core of clinical practice. There was a lively exchange of views at national and international meetings. Regular series of published case reports, reviews and polemic were to be found in the medical literature of the day. Systematic reporting of clinical outcomes was, however, rare. The main differences between then and now are that
    • the sheer volume of information available to modern doctors is
    • the rate of change of medical knowledge continues to increase exponentially, the half life of knowledge is much shorter than in the past. Hence the importance of the word “current” in the definition above.
    • it is no longer possible for any individual to remain up to date in anything other than a small niche of practice
    “ Modern EBM is explicit about what kind of evidence we are using, and the strength of that evidence is rated ”
    • modern EBM is explicit about what kind of evidence we are using, and the strength of that evidence is rated.
  2. EBM is not just about randomised controlled clinical trials (RCTs). The best available evidence (other than the experience of the individual practitioner) could be from any clinically relevant research. Areas of clinically relevant research include
    • basic medical sciences
    • patient centred clinical research
    • accuracy and precision of diagnostic tests
    • power of prognostic markers
    • efficacy and safety of therapeutic interventions
    • efficacy and safety of preventive measures

    Randomised controlled clinical trials do give reliable information about the effectiveness of some treatments. Where the choice of treatment is

    • controversial
    • subject to professional disagreement
    • advocated out of personal conviction
    • lacking any obvious superiority of outcome

    there is no real alternative to RCT’s. But they cannot, for example, tell you

    • if smoking causes cancer
    • what is really important to patients
    • if there are any new diseases which may go onto become a major world health problem.
    • if a new medicine is really safe and free from rare long term serious side effects
    “ The best kind of evidence depends on the question being asked ”

    The best (richest, most reliable) kind of evidence depends on the question being asked. Expanding upon the examples above:

    • In the most famous study to establish a link between smoking and cancer Richard Doll followed up thousands of UK doctors. Starting in the 1930s, he compared doctors who smoked with those who did not. He observed that many more smokers went onto get lung cancer (Doll 1952). For this prospective cohort study (amongst a whole body of work) he was knighted. A randomised trial would have been not only unethical but practically impossible to arrange.
    • To find out what is important to patients, we need to ask them. Open questions are essential. Patients must be allowed to express their own perspective, in their own words. The result is a rich form of data, often containing unexpected findings. Processing this unstructured information in a standardised way can give reproducible results. In a Grounded Theory Study (Glaser and Strauss, 1967), multiple interviews with patients / service users are recorded and transcribed. Investigators go through all the material, looking for themes. Interviews are continued until no new themes arise. The themes are then fed back to those interviewed, to check that is really what they meant. These qualitative studies can uncover information that RCTs cannot. They help us know which outcomes are important to patients.
    • The first descriptions of HIV / AIDS were case reports, not RCTs. The format of medical case reports was already established by the Victorian era. Although the case report – and its big brother, the case series – are not rated highly in the hierarchy of academic publications, this old-fashioned method successfully identified the emergence of a new disease of global importance.
    • Long term vigilance and post-marketing surveillance is the only realistic way that rare side effects of drugs will emerge. In the UK, the centralised Yellow Card reporting scheme for suspected side effects of drugs was introduced in 1964, following the Thalidomide debacle. It has proven to be the most valuable method for discovering Adverse Drug Reactions (ADR’s). Examples of rare adverse effects include Triludan (terfenadine) one of the early non-sedating antihistamines. A rare (one in a million) side effect when combined with macrolides antibiotics was sudden death due to cardiac arrhythmia. An RCT would be unlikely to discover this information, even if tens of thousands of patients took part. Since 2005, UK patients as well as professionals have been able to report suspected adverse drug reactions directly to the Medicines and Healthcare Products Regulatory Agency on their website
  3. “ EBM is not meant to reduce the art of medicine to painting by numbers ”
  4. EBM is not a blanket approach. It is not meant to reduce the art of medicine to painting by numbers. Evidence based treatment recommendations and guidelines are not to be applied indiscriminately to groups of patients, without regard to individual circumstances. The evidence base is only to be used where appropriate, in the context of the individual’s life as a whole, and must take account of what is important to the individual.

What it is

Good EBM – what Haggard (2007) is now calling KBM (Knowledge Based Medicine) – uses all kinds of evidence. The approach varies. The most appropriate type of evidence depends on the question to be answered. An evidence-based approach is useful in studies of

  • Aetiology
  • Diagnosis
  • Differential diagnosis
  • Symptom prevalence
  • Prognosis
  • Prevention
  • Therapeutics
  • Harmful effects of treatment
  • Economic decision analysis

EBM is easy to access. The most up to date evidence is now widely available, mainly thanks to the World Wide Web. Online access is now standard in UK healthcare settings. This is a rapidly evolving area. Useful resources include:

  • The TRIP (Turning Research Into Practice) database This website uses sensitive and specific strategies to find the best available evidence from a very wide net of resources. You can even log your searches for CPD.
  • The Cochrane collaboration is an international collaborative group which aims to systematically find, appraise and review available evidence on all aspects of healthcare. They focus particularly on randomised controlled trials (RCTs). The organization has evolved into the form of a not-for-profit publishing network. Volunteer contributors are supported by a small paid staff. The output of regularly updated evidence-based healthcare databases is freely available online. The Cochrane library is accessible to professionals and public alike. Cochrane searches include areas that are not indexed by automated databases such as PubMed. Hand searching of conference abstracts can discover unpublished studies. Anyone can contribute to this work by contacting a Cochrane Centre.
  • The Centre for Evidence Based Medicine in Oxford provides numerous resources on how to practice EBM.
  • The Oxford-based independent journal Bandolier have been producing digestible EBM monthly since 1994. Articles are short and punchy. Full content is free, without subscription. Their tagline is “Evidence based thinking about health care”. The name comes from their presentational style. Distillation to the essentials is the aim. A series of bullet points are like the bullets in a bandolier. Bandolier popularised Number Needed to Treat (NNT) analysis.
  • Electronic access to journals, for example via an Athens Account Password. Anyone working in an NHS or academic institution can apply for a free Athens account. Several major medical journals, including the BMJ, now provide free electronic access to their full text papers.
  • The NHS Electronic Library for Health is a continually expanding helpful resource, helping access to all the above.
  • CHAINs – Contact, Help, Advice and Information Networks. These are online networks for people working in health and social care. They are based around specific areas of interest, and give people a simple and informal way of contacting each other to exchange ideas and share knowledge. CHAINs are multi-professional and cross organisational. Researchers can find appropriate evidence and avoid replicating work for which answers are already available.

Why our clinical practice should be based on best available evidence

  1. Budgets. Resources are finite in all healthcare systems. How do we decide which new advances should be paid for? Priorities have to be agreed. Commissioners are accountable for their decisions. They must show they have considered the best evidence available. The UK government set up the National Institute for Clinical Excellence (NICE) to do this. NICE primarily study new interventions. They try to assess benefit versus cost. Ultimately, their task is to decide which new treatments should be available on the NHS.
  2. A move away from paternalistic medicine. “Doctor knows best” has been replaced by a cooperative approach, where the patient takes an active part in decisions. Areas of uncertainty are acknowledged. Drawing attention towards, rather than away from, areas of uncertainty is consistent with an evidence based approach to medicine. Uncertainties may not have been emphasised previously (Light. 1979).
  3. Increased direct access to evidence by patients themselves. Nearly all the UK population now have internet access. Health is one of the more popular subjects on search engines.
  4. Increased medico-legal activity.
“ Drawing attention towards, rather than away from, areas of uncertainty is consistent with an evidence based approach ”

Care needed when using EBM

EBM is a powerful tool. Like a chainsaw, it needs skilled and careful handling.
EBM is a powerful tool. Like a chainsaw, it needs skilled and careful handling.

EBM is a powerful tool. Like a chainsaw, it can be dangerous in the wrong hands.

EBM is a powerful tool. Like a chainsaw, it can be dangerous in the wrong hands.

EBM is a powerful tool. Like a chainsaw, it needs skilled and careful handling. Errors in interpretation and implementation are dangerous. One common error is to conclude that, because there is no evidence of effectiveness for a certain treatment, that treatment is ineffective. It is essential to understand that

No evidence of effectiveness does NOT mean evidence of no effectiveness

Systematic reviews of surgical treatments which use strict methodology suffer from a lack of admissible evidence. This is because very few high quality randomised controlled trials (RCTs) of surgical treatments are done. RCTs of surgical treatments are not done for various reasons:

  • The benefit of treatment is obvious
  • There are difficulties in defining outcome measures
  • There are difficulties controlling for surgical skill/preference
  • Techniques are continuously evolving
  • Equipoise is uncommon in surgical practice.

There are inherent limitations in attempting to apply the RCT methodology in surgery.


Definitions of equipoise vary (Gifford, 2001), but it can be regarded as the condition in which there is genuine uncertainty, with the scales of judgement balanced equally between two possible courses of action. It is only when clinical equipoise exists that it becomes ethical to advise patients that the choice of treatment may reasonably be based on chance alone. But whose equipoise is it? Does it have to exist in the mind of each individual doctor seeing each individual patient, or is it an attribute of the research community as a whole? And if the research community is uncertain about treatment options, but an individual doctor has a preference in a particular case, is it ethical for that doctor to pretend he doesn’t really know which is best, and recommend his patient enters a randomised controlled trial? Many doctors are unhappy about sublimating their finely honed instincts, their accumulated experience, their nous, all the unquantifiable nuances that contribute to clinical judgement, to the random results of the toss of a coin. Treatment at random under these circumstances has been described as a betrayal of Hippocrates (Retsas, 2004). Surgeons in particular make crucial decisions with immediate feedback of results, and they need to know what to do. Sitting on the fence, delaying decisions, is not what patients expect of their surgeon. Whether surgeons really do know what is best for their patients, or are merely opinionated and possessed of “indefensible certainty” (Burton, 2007) is of course open to debate.

Even where RCT’s have been attempted, many fail to reach successful completion. Often this is because patients in the control group are not happy. They simply go off and get treatment somewhere else.

An example: RCT’s of grommets in glue ear

Parents of children with middle ear effusions agreed to be part of a study where the control group were randomised not to have grommets. When it came to it, most of the control group were unhappy. They were not prepared to accept their child being disadvantaged by continued reduced hearing when others were successfully treated. As soon as the glue ear persisted beyond a time where it may reasonably have been expected to clear by itself, most of the control group elected for surgery. In the study by Maw (1999) 85% in the watchful waiting group had grommets by 18 months. For good methodological reasons, the results were analysed on the basis of the groups as originally randomised. This is known as an Intention to Treat (ITT) analysis. The other options would have been

  • to analyse only the results of those who stuck to the protocol
  • to analyse on the basis of treatment actually received

The first option greatly reduces the power of the study. Both options would introduce potentially large amounts of bias. Yet, by including the hearing results of children who had grommets in the control group, we automatically dilute the estimate of efficacy of grommets in the treatment group. This matters a great deal when the study results are summarised. Commentators who have neglected to consider the detail point out the small effect size and use it as a weapon to attack the “ineffective” operation.

Efficacy vs effectiveness

“ Grommets are really very good at restoring hearing, but you might not need them. You might get better anyway ”

Efficacy is not the same thing as effectiveness. Efficacy is a generic property, applied to a treatment. A question of efficacy asks whether the treatment has the desired and expected effect on the organ in question. Effectiveness is more specific. It is grounded in the detailed clinical stage of the condition being studied. A question of clinical effectiveness asks whether, given this stage of the disease, the treatment being studied offers a worthwhile benefit compared to another option. In the case of the grommet study, the other option was to continue watchful waiting. The basis of watchful waiting in glue ear is that 50% of newly diagnosed cases will resolve within 3 months, and 75% within 6 months. If they don’t, grommets can be fitted later. But Maw’s cases in Bristol were stringently selected, and already had persistent glue. During the study period, 85% of the control group had continuing problems. Paradoxically, a well designed and well executed study gave an artificially low estimate of treatment efficacy. But that was not the question the study was designed to answer. The study was designed to answer a question of considerable relevance to NHS practice, where we see patients who are likely to wait some months before treatment is given. Does it matter if your grommet operation is delayed a few months? You might get better anyway, and avoid the risks of surgery. Most specialists in the field railed at the apparent absurdity of the ITT analysis. But almost everyone forgot the question the trial was designed to answer. The clinical question is whether, given these children with this stage of the disease, there is any advantage to giving the grommets now as opposed to watchful waiting with an option to fit them later. The results showed that there was an advantage, but it was relatively small. Simplistic commentators then pin the low estimate of benefit onto the treatment itself, rather than the more complex and subtle message. Grommets are really very good at restoring hearing, but you might not need them. You might get better anyway. The research question then changes into trying to predict the children who are not going to get better, and treat those. This is not a simple concept to put across.

Interpreting trial results: doctors and spin doctors

“ Nearly everyone prefers a clear and simple message… Unfortunately, real life medicine is rarely that simple ”

Nearly everyone prefers a clear and simple message. Are grommets any good? I heard on BBC Radio 4’s respected Today Programme that they were useless and a waste of taxpayers’ money. Unfortunately, real life medicine is rarely that simple. Not many doctors have time to pore over the detail of clinical trials. Even summaries of systematic reviews are beyond most. Can we rely on the medical news? It is very easy to fall prey to the wiles of the spin doctor.

The interpretation of clinical trials is rarely simple. Demand management is a priority for health funding bodies. Rubbishing the product is one method of reducing demand. Deferring to the expertise of specialists in the field is now known as Producer Capture. For politicians and civil servants, allowing your government department to become a victim of Producer Capture is a bad career move. Spin doctors charged with reducing demand can easily cast doubt over the ethics of surgeons, paid to operate. They also trumpet recent pay rises earned by General Practitioners, neglecting to mention that they are being paid extra for meeting government targets. Some of these targets are themselves evidence based, for example the control of blood pressure. Spin doctors are the modern incarnation of propagandists. To influence mass opinion, they reduce complex issues to simple ideas.

“ Considered, thoughtful opinion is useless to the spin doctor ”

Considered, thoughtful opinion is useless to the spin doctor. Catchy slogans are needed. Active news management relies on gaining the attention of editors. Archie Cochrane himself complained about the tendency for newspapers to pander to the public’s desire for simplistic answers (Cochrane, 1972). In the era of the sound byte, even university students have the attention span of a gnat. Channel-hopping viewers skip over 24 hour rolling news broadcasts. Unless hooked within 2 or 3 seconds, web users surf on. It becomes almost impossible to put across the balanced, guarded, nuanced messages of EBM. The open admission of uncertainty, which lies at the heart of EBM, is unpalatable and risks ridicule.

“ open admission of uncertainty, which lies at the heart of EBM, is unpalatable and risks ridicule ”

Public health policy has always been a political matter. Modern politics in Western democracies consists largely of trying to stay in tune with public opinion. While media headlines determine health policy, the spin doctor over-rules the medical doctor. In many ways, truly scientific evidence-based practice is the antithesis of modern politics. To quote Tallis (2004)

“The commitment to minimizing the role of chance, of bias, or of wishful thinking, is what scientific medicine requires. Avoiding beliefs guided by delusive hope, unfounded authority, superstition and plain stupidity, it cultivates an attitude of healthy skepticism towards itself to prevent its practitioners from misleading themselves or their patients. Its permanent strategy of active uncertainty, and the humility this implies, is the distinctive virtue of scientific medicine. In the world outside of scientific medicine, however, humanity has had little time to adjust to this almost inhuman scrupulousness.”

Ethics of surgical trials

“ we can’t have surgeons making the mark of Zorro then not doing anything ”

Some surgical trials cannot be done for ethical reasons. The sham cardiac surgery procedures of the past (Cobb et al, 1959) could not be done nowadays. It is very difficult to design control groups for surgical interventions that are both ethical and scientifically valid. An obvious factor is scarring. Patients know whether they have a scar or not. But we can’t have surgeons making the mark of Zorro then not doing anything.

Ethics of the commissioner

Health commissioners have a difficult task. Resources are finite in all healthcare systems. Faced with inexorably rising demand, and fixed budgets, they look to EBM to provide some degree of objective guidance. In deciding between competing demands for funding, it is perfectly reasonable to insist on reliable evidence that

  • the burden of disease is significant
  • the treatment is effective.

But this is not always easy to prove. And failure to commission treatments, simply on the basis that high level evidence is not available, may well deny patients beneficial treatment.

An Evidence Based Approach – step by step

“ ask a clearer question ”

An evidence based approach starts with a question. You then look for evidence to help answer the question. Critical appraisal means you don’t just accept what you are told at face value. You have to use your critical faculties to appraise the evidence. Clearly, you are not going to be able to do this personally, every single time. You need to know where to find evidence that has already been critically appraised, and how to check that it really does apply to the patient in front of you. The steps are as follows:

  1. Define the Clinical Question being asked. Note that defining knowledge gaps and formulating them into questions is one of the hardest parts of an evidence based approach. In Douglas Adams’ Hitch-Hiker’s guide to the Galaxy, the supercomputer Deep Thought was constructed to answer the question of “Life, The Universe and Everything”. The answer, after millenia of computations, was 42. When the populace complained that this answer didn’t make sense, Deep Thought responded that they would have to ask a clearer question. If your question is regarding the effectiveness of a treatment, you should consider the following headings.
    • What is the population?
    • What is the intervention?
    • What is the comparison?
    • What is the outcome?
  2. Search for the Evidence – use the key terms from the defined question to search databases like
  3. Critically appraise the evidence
  4. Most importantly, decide whether the evidence really applies to the patient in front of you.

That final step requires your individual clinical expertise and judgement. Clinical expertise and judgement develop with training and experience. Clinical judgement is the part which is often neglected in discussions of EBM (this article included). Please do not mistake the lack of discussion of clinical expertise and judgement as indicating a neglect of its importance. It is the crucial final link. Without it, the whole exercise fails its purpose. We are in practice to help individual patients. EBM is only a means to an end, not an objective in itself.

Critical Appraisal

“ You shouldn’t believe everything that is written in the papers. That includes medical research papers ”

You shouldn’t believe everything that is written in the papers. That includes medical research papers. The combined vigilance of medical journal editors and the peer review process gives some quality assurance. The WAME (2008) World Association of Medical Editors ethical policy is all well and good. But if all editors were as selective as those of the major journals, they would quickly run out of stuff to publish.

Limitations of peer review

“ Most papers, however flawed, will get published somewhere ”

Traditional academic journals use peer review to quality-control content. The limitations of the process were highlighted on the alternative Web-based academic publishing blog, Scholarship 2.0.(Arms, 2007).

  • A reviewer can only comment on whether the research appears to be well done
  • In assessing this, s/he relies entirely on what the authors have written
  • The reviewer is rarely in a position to check up upon, or validate, the statements of the authors
  • The honesty of the authors has to be assumed
  • A reviewer cannot normally repeat the trial to check the results are correct

Not all rejected papers are bad. A reviewer may reject a paper for reasons of prejudice, academic jealousy, or findings which cast the reviewer’s own work a poor light. Most papers, however flawed, will get published somewhere if the authors are sufficiently persistent in submitting. Journal editors have space to fill.

Current clinical practice is often based on older research, when publication standards were less rigorous.

What does critical appraisal mean in practice?

Critical appraisal means reading the research critically, and sorting the wheat from the chaff. If you just accept the conclusions of the first couple of abstracts that came top of your PubMed search, you might be OK. But would you bet your life on it being right? Would you bet your patient’s life on it? If you can’t critically appraise evidence, you will forever rely on what other people tell you to think. That might not worry you too much. You might be happy to rely on others for knowledgeable advice. But what if two respected sources give you opposite advice? You still have to make a decision. There is plenty of help available on how to make decisions on the value of published papers.

Critical Appraisal Resources

Trisha Greenhalgh (1997) wrote an excellent series of articles on how to read a paper, subsequently published as a book. Here is a link to the first paper in the series, content available online from the British Medical Journal.

For those interested in taking the process further, tools are now available online to help interpretation of different types of study.

The NHS Public Health Resources Unit (PRHU) Critical Appraisal Skills Programme (CASP) provides a range of resources. These tools are provided free for download at They are designed to help critical appraisal of papers. The following types of research are covered

  • Systematic Reviews
  • Randomised Controlled Trials (RCTs)
  • Qualitative Research
  • Economic Evaluation Studies
  • Cohort Studies
  • Case Control Studies
  • Diagnostic Test Studies

Short courses in critical appraisal are available, such as those run by Superego Cafe.

Quantitative studies

Quantitative studies are done where we have something we can measure, for example blood pressure or rate of stroke in a population. The distinction between quantitative and qualitative studies is not absolute. Some data from qualitative studies – for example a questionnaire survey on patients’ attitudes and beliefs – can be processed in a highly quantitative way. The following types of studies are generally considered to be quantitative.

  • RCT
  • Controlled trial
  • Prospective Cohort
  • Retrospective Cohort
  • Cross Sectional Study

In quantitative studies, key questions are

  • What is the exposure (e.g. treatment)?
  • What is the outcome?
  • Is there an association between the two?
  • Is that association due to chance?
  • Is the association due to the way the study was done?
  • Is it relevant to my individual patient?

Deciding whether the association is due to chance invariably raises questions of probability and statistics.

Statistical probability theory: Are there really lies, damned lies and statistics?

“ statistics don’t lie, people lie ”

Before getting bogged down in maths, and details of which tests to use and when, one important fact should be understood above all else:

All conventional statistical tests of probability take the form of an if – then statement.

An if – then statement takes the form

If x is true, then y should happen

The if is one of the assumptions underlying the test. All statistical tests of probability rely on assumptions. A common assumption is that the different factors that may influence the result act independently of one another. These assumptions are not always explicitly acknowledged. They tend to be ignored by those who use statistics

“like a drunken man leaning on a lamppost, for support rather than illumination”

The underlying assumptions must be valid, otherwise the statistical test is not. So,

  • You should never believe the results of a statistical test without knowing what assumptions were made in applying it, and checking whether they really apply to the data.
  • It is technically incorrect to stay that “you can prove anything with statistics”.
  • Statistics don’t lie, people lie. Mostly, it comes down to a question of trust.

In fact, you can never prove anything to be true with statistics. You just reach a known level of probability, and even that known level of probability is based on assumptions that always have to be made in designing the hypothesis.

What does a p-value mean, and why has it been replaced by effect size and confidence intervals?

A p-value is the probability that the results we see have arisen by chance. p is no longer used a great deal in clinical studies. Nevertheless, the p-value is still used in basic sciences and remains an important concept to understand. The design principles of the RCT are more easily appreciated when simplified to hypothesis testing using a p-value. Significance testing using p-values was, until the 1980’s, the commonest way of establishing that results were likely to be real, and not due to chance.

Type I statistical error (false positive finding)

A conventional standard is to accept a p-value of 0.05 as statistically significant. p = 0.05 means there is a 5% chance that an observed association is due to chance. Using such a cut-off means we accept that, 5% of the time, we will conclude that there is an association when there is none. Put more simply, one in twenty papers claiming a positive association with a p-value of 0.05 will be wrong. Being wrong in this way is known as a Type I error, or a false positive finding.

“ The p-value of a lottery jackpot winner in the UK is around 0.00000007 (one in fourteen million), yet it happens by chance most weeks ”

There may be instances when we wish to be more certain. Choosing a smaller p-value of 0.01 as statistically significant means we will be wrong only one time in a hundred. With a p-value of 0.001 we will be wrong only one time in a thousand, and 0.0001 one in ten thousand. This can go on indefinitely, but we can never be entirely sure that the results have not arisen by chance. The p-value of a lottery jackpot winner in the UK is around 0.00000007 (one in fourteen million), yet it happens by chance most weeks. The medical literature is full of Type I errors. For example, if a busy medical journal’s output for the year contained 100 papers reporting positive findings at the p=0.05 level, we would expect five of them to be wrong due to Type I error. And we wouldn’t know which were the wrong ‘uns. We might have an idea that something doesn’t sound very plausible. But the only way we could really find out would be when the study was replicated, and the positive finding was not repeated. Even with a pair of positive studies at the p=0.05 level, there is still a chance that we are observing a chance effect. The probability is the product of the two p-values. 0.05 times 0.05 gives 0.0025, a one in four hundred chance. One in four hundred events do happen quite often.

How to make a Type I error: Data dredging and the Texas sharpshooter

“ The best defence against the Texas sharpshooter is to make him draw his targets in advance ”

Since computerised records have become routine, vast quantities of clinical data are held on various systems throughout the world. Some researchers make it their business to search through that data, looking for patterns and associations. This is an ideal way to produce the Type I error, to discover associations that are there due to chance. The search is known as data dredging. When the results are presented, the presenter can be described as a Texas sharpshooter. The Texas sharpshooter blasts away with his pistol on a barn door. After the bullets have hit, he walks up and chalks a target circle around each one. He then boasts how good a shot he is. The best defence against the Texas sharpshooter is to make him draw his targets in advance. This usually means a prospective study, with a predefined null hypothesis.

Effect size and confidence interval

When using the results of trials to help guide clinical practice, it is generally much more useful to know how much of an effect the treatment had. This is known as the Effect Size (ES). The measured effect size comes directly from the trial data and is known. But that measured effect size would not be exactly the same if we repeated the study again. Nor would it have been exactly the same if we happened to have recruited some slightly different patients to those we actually had. We therefore also need an estimate of how accurate the study was in measuring effect size. The spread of possible values of effect size, up to a certain level of probability, is known as the confidence interval (CI). The 95% CI is often used, by convention, and is analogous to the p-value of 0.05. It means that there is a one in twenty chance that the real effect size is greater than the upper limit, or less than the lower limit, of the confidence interval. A 99% confidence interval means that we would expect only one in a hundred repetitions of the study to give an effect size beyond the limits. Naturally, a 99% confidence interval is wider than a 95% confidence interval for the same data. When we have two sets of observations, we can calculate an effect size and confidence interval at our chosen level. If the selected confidence interval for the effect size includes zero then we cannot say that we have demonstrated a positive effect of treatment.

Type II statistical error (false negative finding)

A Type II error means the study failed to find a difference that really exists. The usual cause of a Type II error it that there weren’t enough patients in the trial.

Power calculation and the signal to noise ratio

“ If the intervention has a huge effect, it will become pretty obvious pretty quick, and we won’t need many patients to see it ”

How do we know how many patients are needed in a trial? Well, it depends on the size of the effect. If the intervention has a huge effect, it will become pretty obvious pretty quick, and we won’t need many patients to see it. If the effect size is small, we may need hundreds or even thousands of patients to show it. This is sometimes called the signal to noise ratio. The signal is the effect, the noise is all the background random variation.

Estimating effect size in advance and the minimum important difference

Mostly, we don’t know what the true effect size is, that’s what we are trying to find out. But, we can take an educated guess.

  • We can look at previous studies.
  • We can also decide, in advance, how big an effect size we would regard as worthwhile.

We can then design the study to be able to find an effect of that size. When we say we, we do mean we. Doctors and patients should decide together on what would be the minimum important difference in outcome that would lead to a change in practice. For example, the trialists might interview rhinitis patients to find out how much better they would have to feel to make it worth them taking a twice daily nasal spray.

“ if there is a lot of variability (noise) in the data, you will need a bigger sample ”

As well as having an idea of the size of the effect, you need some idea of its variability. How big are the differences between individuals due to natural variation? If there is a lot of variability (noise) in the data, you will need a bigger sample. The conventional parametric statistical method (which assumes a normal distribution – not always the case) is to estimate the standard deviation. Based on these considerations, most study protocols estimate the sample sizes needed to give an 80% chance of picking up the effect, should it exist. If we want to be more sure of finding an effect, we could go for a higher percentage chance, and that would mean more patients.

Changing sample size during a study – ethics of early study termination

Power calculations are done before the study begins, at the design stage. Once the study is under way, if it becomes obvious that we have a bigger beneficial effect than anticipated, it may be ethical to stop the trial before reaching the number of participants estimated by the power calculation. This cannot always be done. If the study is double blind, it will be necessary to break the code to discover in which group the larger than expected effects have occurred.

Power calculation resources

Free software is available to help with power calculations from the Simple Interactive Statistical Analysis SISA website

Meta-Analysis: pooling study results to increase power and avoid the Type II error

In the recent past, it was common to see studies reported which had inadequate numbers of patients. A Type II error was commonly present, but not always recognised. Often, different studies would give contradictory results. A reviewer would, typically, choose the results of a selection of favoured studies and come up with an overall conclusion. He may well give undue weight to some studies – for example those conducted by people he knew and trusted, studies he had read recently – while ignoring others. He may not have known about some relevant studies, for example in foreign language journals. A better method of reviewing was called for. In a systematic review, the reviewers

  1. Search systematically for all the relevant studies.
  2. Rate those studies according to pre-defined quality criteria.
  3. Where possible, combine the results, so as to maximize the numbers of patients considered.
“ by pooling the results of lots of studies we may be able to see the hidden effect more clearly ”

These are the principles underlying systematic reviews and meta-analysis. As outlined above, the commonest cause of a Type II error – failure to show a difference when there is one – is inadequate sample size. If the natural variation in the outcome measure is high, and the effect size is small, it will be hidden in the background noise. Background noise, being random, should cancel out with large numbers. By pooling the results of lots of studies we may be able to see the hidden effect more clearly. The commonest method for doing this is to produce a forest plot. The forest plot forms the centre of the Cochrane Collaboration logo
Cochrane logo.

The diagram shows the results of a systematic review of seven RCTs of a short, inexpensive course of a corticosteroid given to women about to give birth too early. The first of these RCTs was reported in 1972. The diagram summarises the evidence that would have been revealed had the available RCTs been reviewed systematically a decade later.

  • The vertical line is the line of no effect – the treatment does neither good nor harm.
  • The diamond mark to the left is the pooled result of all the studies taken together.
  • The diamond to the left of the line means that overall the treatment is beneficial
  • If the diamond was to the right of the vertical line, that would indicate the treatment does more harm than good
  • The result of each individual trial is represented by one horizontal line.
  • Each line represents the 95% confidence interval (CI) for the effect recorded in a study. The CI is an estimate of the range within which there would be a 95% chance of finding the effect size, if the study was repeated on the whole population of eligible patients instead of just the sample taken.
  • The narrower the confidence interval the more precise the studies result.
  • A diamond that touches the vertical line therefore represents an effect of the treatment beyond a placebo.

The forest plot indicates strongly that corticosteroids reduce the risk of babies dying from the complications of prematurity. By 1991, seven more trials had been reported, and the picture had become still stronger. This treatment reduces the odds of the babies dying from the complications of prematurity by 30 to 50 per cent.

“ The purpose of the Cochrane review is to discover what is already known ”

Because no systematic review of these trials had been published until 1989, most obstetricians had not realised that the treatment was so effective. As a result, tens of thousands of premature babies have probably suffered and died unnecessarily (and needed more expensive treatment than was necessary). This is just one of many examples of the human costs resulting from failure to have a structured programme to evaluate new health technologies. Performing systematic, up-to-date reviews of RCTs of health care does, of course, rely on the research studies being done in the first place. The purpose of the Cochrane review is to discover what is already known. It is especially valuable where the effect size is too small, amongst the other variability, to be highly obvious within one caseload.

Unethical to start new research without systematic review of previous work

The main reason for doing systematic reviews is to make maximum use of research that has already been done. The Cochrane Collaboration was not founded to encourage more RCT’s. It was founded to realise the value of work already done. In business terms, it is sweating the asset. The asset is that gigantic and ever-expanding repository of information contained in the medical literature. Checking what is already known avoids wasteful and pointless duplication of research effort. It minimises delay in getting the benefits of research based knowledge into practice. Identifying gaps in the knowledge base is another important function, to help direct future research efforts.

In 2005, The Lancet announced a policy to tackle unnecessary and badly presented research (Young and Horton, 2005). They stated that unnecessary clinical trials

  • harm patients and volunteers
  • waste resources
  • abuse the trust of participants

Conducting trials without prior systematic reviews and meta-analyses is improper, scientifically and ethically.

Authors of clinical trials are now required to include

  • a clear summary of previous research findings, and
  • to explain how the results of their trial affect this summary.

The position should be illustrated by reference to a systematic review and meta -analysis (Clarke et al, 2007). Where this does not exist, authors are strongly encouraged to produce their own, before starting any clinical trial.

When can we do without trials and statistics? A randomised controlled trial of parachutes

“ Statistics is an aid to common sense, not a substitute ”

To judge the results of a trial, we usually (but not always) need statistics. If, let us say, we were conducting a randomised controlled trial of the effectiveness of parachutes on survival when jumping out of an aeroplane at 10,000 feet, we would have the following null hypothesis:

“if parachutes are ineffective, then the mortality rate will be the same, whether or not the parachute is worn”

When the first randomly assigned participant without a parachute hit the ground at terminal velocity, we might decide that we didn’t need any statistics, perhaps not even a trial, to decide this question. The first rule of statistics is – or should be – that you don’t always need statistics. Statistics is an aid to common sense, not a substitute.

RCT’s rare in surgical practice

The parachute trial is an extreme example, but surgeons are increasingly told that there is “no evidence base” for the majority of their work. One reason is they don’t need a trial to tell them that controlling that bleeding artery is the right thing to do.

Effect size and time interval

“ where the signal to noise ratio is high, an RCT is not necessary ”

In controlling a bleeding artery, the effect size is large, and the time interval between intervention and observable result is very short. The signal to noise ratio is very high. Skill, training and judgement are needed to achieve the result, and none of these are amenable to double blind randomised controlled trial. Although this should be obvious, it has only recently been emphasized in EBM circles that, where the signal to noise ratio is high, an RCT is not necessary (Glasziou, 2007).

The RCT of bleeding arteries, like the RCT of parachutes, will never, ever, be done. If someone was foolish enough to look for, fail to find, then publish the fact that there is no RCT evidence for the benefit of controlling a bleeding artery, Archie Cochrane would turn in his grave. He was a practising doctor, who served his time burying his tuberculous patients as a prisoner in the Second World War.

When we do need trials and statistics

The sort of cases where statistics are needed are where the effect size is small, and the time interval between intervention and result is long – like most drug trials. That is what RCT’s were designed for, and that is what they are good at.

The RCT model can and should be applied to some surgical interventions. Such interventions are typically for conditions where

  • the time course is chronic relapsing and remitting
  • the underlying pathophysiology is unclear
  • hard outcome measures are lacking
  • the benefit of intervention is not immediately obvious
  • the most appropriate outcome measure is not apparent
  • there are other treatment options (including no treatment)
“ the Shamanistic rituals of surgery induce a sizeable placebo effect ”

These are good grounds for questioning the value of any intervention. Most surgeons would agree that such procedures should be subject to randomised controlled trials. Designing RCT’s for surgical interventions is, however, considerably more difficult than for medical interventions. Consequently, few surgical trials are done. Of those that are published, most fail the strict criteria for inclusion in Cochrane systematic reviews. Surgical trials are not as easy to organise as drug trials. In a drug trial, it makes little difference who writes the prescription. In a surgical trial, the skill of the individual operator is a major factor. Randomisation is possible, but concealment of treatment allocation is not. No-one wants a blindfolded surgeon operating on them. It is likely that the Shamanistic rituals of surgery induce a sizeable placebo effect (Green, 2006), yet sham surgery is not ethical as a control group. These factors combine to make the surgical literature a barren landscape when searching for high quality RCT’s. That is no reason not to look, and no reason not to try, but the absence of strong RCT-based evidence is to be expected in much of surgical practice.

What is strong evidence?

“ Strong evidence does not mean good medicine, Neither does absence of strong evidence mean bad medicine ”

When we talk about strong evidence, what we mean (in conventional statistical terms) is that

  • the null hypothesis has been falsified – to a known degree of probability
  • we have an estimate of the effect size.

Strong evidence is not the same as a big important effect. You can get strong evidence by having lots of patients in your trial, even though the size of the effect is small. Strong evidence does not mean good medicine. Neither does absence of strong evidence mean bad medicine.

The problem of bias

If there is an association which is unlikely to be due to chance, it does not automatically mean that the observed effect is due to the intervention being studied. It could be due to the way the study was done. There could be bias.

A study looking at the effectiveness of a treatment may suffer from bias in many ways.

Selection bias

“ Randomization protects against unknown as well as known sources of selection bias ”

The doctors carrying out the study could, subconsciously or otherwise, pick patients they thought would do better for the treatment group. Even if some apparently reasonable non-random method of producing a control group were used, there could be selection bias. For example, if patients attending a Monday clinic were allocated to the treatment group, and those attending a Wednesday clinic as controls, it could be that patients attending on Mondays were generally sicker than Wednesday’s patients. Randomization protects against unknown as well as known sources of selection bias.

Observer bias

Those charged with observing and recording the results of the study could, subconsciously or otherwise, give an inflated opinion of the results in the favoured treatment group, while minimizing or ignoring any adverse effects. They might, at the same time, be more assiduous in looking for poor results in the control group. This source of bias can be removed by blinding the observer to the treatment group. Blinding can’t always be achieved. If patients in the treatment group have a surgical scar and the control group haven’t, that could be a bit of a give-away.

Loss to follow-up bias

The study organizers might assume that patients never came back because all was well, when if fact they didn’t come back because they were dissatisfied or even died.

Participant bias

Patients in the treatment group may feel special and act differently than the control group. For example, they might

  • eat a more healthy diet
  • stop smoking
  • feel happier because they perceive they are getting the best treatment
  • feel supported and benefit psychologically from the extra attention
  • sleep better

Any of these mechanisms could cause effects additional to and separate from the treatment being studied.

“ It is very difficult to blind the participant to a surgical treatment ”

The way to avoid participant bias is to blind the patient as to which treatment they are getting. This can’t always be done. For example, it may become known that the real medicine has a certain taste, while the placebo doesn’t. It is very difficult to blind the participant to a surgical treatment. A scar is a scar. Sham surgery has been conducted in the past, but is now considered unethical.

Choice of Outcome Measures

“ the bulk of modern medical and surgical interventions are not to avoid death, they are to improve the quality of life ”

The choice of outcome measures plays a crucial part in determining the results of clinical trials. The development of the RCT model in medicine was based largely on drug trials in otherwise fatal conditions – especially respiratory infections such as pneumonia and tuberculosis. The outcome measure was simple – the patient was either alive or dead. But the bulk of modern medical and surgical interventions are not to avoid death, they are to improve the quality of life. Until recently, this was thought too difficult to measure, but the application of psychometric techniques to patients’ symptoms in the 1980s began to allow a more quantitative approach to soft outcome measurement (Powell, 1989).

Before the 1980’s, medical practitioners, and perhaps particularly surgeons, did not like to dwell too much on the subjective symptoms of their patients, particularly any that persisted after operation.

Lavelle and Harrison (1971) reported their results of middle meatal antrostomy purely on a technical measure of success, the continued patency of the surgical opening into the sinus. They deliberately excluded patients’ views from their analysis, stating that

“little is achieved by quoting figures and statistics, as the results depend to a great extent on subjective response of a patient”

In the 1970’s, the established medical view was that symptoms were important, but mainly as clues in a jigsaw puzzle. The aim was to establish a diagnosis and thereby institute appropriate treatment. In scientific studies of the effectiveness of treatment, most doctors would prefer objective to subjective measures of outcome. The surgeon prefers to know that the patient is cured of the disease, rather than whether he merely feels better. That requirement for objective measures of successful outcome is difficult where

  • the correlation between subjective symptoms and objective findings is poor
  • the definition of the disease is primarily or partially based on symptoms

For example, in chronic rhinosinusitis, the correlation between subjective symptom severity and a variety of objective findings is around 7% (Fairley, 1993). The disease itself is now formally defined purely on the basis of persistent symptoms (Fokkens et al, 2007). The pure symptom-based definition is restricted to epidemiologists and general practitioners. Specialists making the diagnosis are required to undertake at least one form of objective diagnostic examination.

All studies which attempt to correlate symptoms with disease severity suffer from a similar philosophical and methodological difficulty. That difficulty lies in the definition of disease, which is often tenuous. If our gold standard for diagnosing and rating disease severity rests on some objective tests, and symptoms are compared against it, we are making an implicit value judgement. Symptoms are somehow less important than signs, and need to be accounted for by physical findings. Radiology, endoscopy, microbiology, surgical exploration and histology are all examples of physical findings. Although it is of course necessary to look carefully for physical findings, especially where these may reveal serious disease or will change management, it is intellectual arrogance to conclude from the absence of detectable pathology that there is nothing wrong with the patient. That is why psychiatry was the first area of medicine to develop a methodology for making reliable and valid outcome measures based on symptom questionnaires. Since they didn’t have too many physical findings to distract them, they set about measuring what they could. It now turns out that many of our objective measures are at best loosely correlated with what the patients have come to us about in the first place – symptoms. So, if we want to find out if we are doing any good, just measure the symptoms before and after, and regard what happens in between – the medical intervention – as a black box.

“ It is essential that the outcome measures used in any particular trial are relevant to the clinical question being asked ”

An explosion of research interest in the 1990s and early 21st Century has resulted in hundreds of disease-specific outcome measures, as well as numerous validated general health outcome measures. We are now spoilt for choice. In orthopaedics, the number of symptom / questionnaire based outcome measures available now exceeds the number of joints in the human body. Comparative evaluations are needed to decide which measure to use (Beaton et al, 2001, Roach 2006). Despite their widespread application, there remain significant difficulties in defining suitable outcome measures. Although a great deal of time and effort has gone into developing reliable, valid and responsive outcome measures, the choice which to use is, in the end, subjective. It is invariably influenced by the sponsors of the trial. It is essential that the outcome measures used in any particular trial are relevant to the clinical question being asked.

How to choose an outcome measure to get the answer you want: Crutches for broken legs

Health insurers and governments funding health expenditure worldwide are looking to EBM to cut expenditure on self limiting conditions. They might save money by not paying for crutches for patients with broken legs. How about an RCT of crutches? Of course, the patients denied crutches would not be able to walk for a while, but, once the leg had healed, and certainly by one year, they should be walking again. By choosing an outcome measure

“ability to walk one year following the injury”

and comparing patients randomly allocated either to receive or not receive crutches, the trial would probably conclude “no evidence of benefit” from crutches in the treatment of broken leg, a self limiting condition. But surely no one would take such a trial seriously. Or would they? Look at the outcome measures chosen in trials of grommet insertion, sponsored by the UK Government, for children with hearing loss due to glue ear. Following grommet insertion, most children get a dramatic improvement in hearing. The average grommet lasts nine months, during which hearing remains good. Once the grommets come out, a minority will get further glue ear. Meanwhile, a large proportion of the children who did not receive grommets will slowly clear the fluid and their hearing will improve. Those who don’t are often given grommets anyway, but the results are reported on the basis of “intention to treat” – so the benefit accrues to the non-treatment group. The trials report hearing results at one and two years, when most of the grommets have fallen out. Dramatic and consistent short term improvements are ignored in the conclusions.

“ choice of outcome measures for trials remains subjective ”

Conventions have evolved in recent years to improve the reliability and validity of outcome measures. Yet choice of outcome measures for trials remains subjective. Interested parties may well seek to introduce bias at the design stage. A drug company would naturally like to choose an outcome measure which shows a positive effect for their product, even if that measure is only indirectly related to patient-perceived benefit. Increasingly, there is a formal regulatory apparatus with semi-public consultation and justification. An a priori declaration of interests is one way to avoid the bias of only reporting the one outcome measure that shows what you wish to prove.

How to choose an outcome measure to get the answer your patients need: Reliability, Validity and Responsiveness

A good outcome measure will be reliable, valid, and responsive to change in the group to which it is to be applied. The three properties of reliability, validity and responsiveness are related, with some overlap, but distinct.

A reliable but invalid measure

a 30 cm ruler will reliably measure the head diameter but is not a valid measure of what he's thinking
A 30 cm ruler will reliably measure the head diameter but is not a valid measure of what he’s thinking

A 30 cm ruler gives a reliable measurement of the diameter of a thinker’s head. We can show that

  • it gives a reproducible result
  • test-retest variation is reasonable
  • it is sensitive to change (responsive) when differing diameter heads are measured

But it is not a valid measure of what the head is thinking. It is measuring a different domain altogether.

That example may appear trite. A child would spot the error. It is obviously wrong to try and measure what someone is thinking with a ruler. But this type of mistake is rife in clinical research. It is not always so obvious that the chosen outcome measure is invalid. Indeed, the error may be embedded in our collective medical culture, in our limited understanding and flawed concept of the disease.

An example of a reliable but invalid outcome measure in clinical research

Berg and Carenfelt (1988) studied 155 patients with suspected acute sinusitis. An algorithm based on symptoms and signs was compared with the “gold standard” of maxillary sinus empyema versus not empyema, as established by antral aspiration (sinus washout). 68 patients were found to have an empyema and 87 not. Purulent flow from the middle meatus seen on rhinoscopy was pathognomonic for empyema when seen, but only occurred in 6 cases. Severe cacosmia was also of high positive predictive value, but only occurred in 12 cases. Of symptoms that occurred frequently, unilateral predominance of pain or purulent rhinorrhoea were found to be strongly predictive of empyema. By combining the analysis of symptoms with a high ESR, they found that “diagnostic reliability” of their algorithm could reach 80%. However, this begs the question as to whether the gold standard they used – namely antral aspiration – was valid. From a perspective of 20 years later, it almost certainly was not.

The misapplication of reliable, yet invalid, outcome measures is increasingly likely as outcome measures proliferate.

Assessing reliability of an outcome measure

“ judges at a dance competition ”

A reliable outcome measure will give the same answer when the same thing is measured repeatedly. This is known as test-retest reliability. If the outcome measure is an observer rating, inter-rater reliability is important. Different observers, ostensibly applying the same rules, may well give different ratings – like the judges at a dance competition. Intra-rater reliability (consistency) can also be assessed. For example, in grading the severity of facial palsy using the House-Brackmann scale, the same observer is asked to rate the same series of clinical photographs a few weeks apart.

Cronbach’s alpha

Another measurement of reliability for summed questionnaire scales is Cronbach’s alpha. This is a measure of internal consistency of the scale. Cronbach’s standardized item alpha coefficient is a generalised measure of reliability. Alpha is based on internal consistency of the scale. It is calculated from the average inter-item correlation and the number of items in the scale. Alpha behaves as a squared correlation coefficient and ranges from 0 (none) to 1 (perfect). If the number of items in the scale is large, the inter-item correlations do not have to be so high to obtain high reliability scores. Reliability in this context means the extent to which the total symptom score is likely to give the same result as another similar measurement of symptom severity. If each item on the questionnaire is measuring some part of a related concept (overall severity of the condition) then individual items should be correlated with one another to the extent that they are measuring the common entity. The result can be interpreted as the extent to which the scale tested would be expected to correlate with all other possible k-item scales, constructed from a hypothetical universe of questions on the subject of interest. Another interpretation is that alpha times 100% of the variability in a hypothetical test, composed of all possible questions on the subject of the questionnaire, would be accounted for by the results of the k-item test used. In the senior author’s study of the reliability and validity of a symptom scoring scale for rhinosinusitis (Fairley, 1993) Cronbach’s alpha was calculated on a series of 411 patients attending ENT out-patient clinics with a variety of conditions and found to be 0.78. This is reasonably good. Alpha should certainly be over 0.5, and ideally over 0.8.

Assessing validity of an outcome measure: face, content, criterion-related and construct validity

Face validity

“ read the questions ”

The first test of validity is simply to look at the questions and consider them at face value, to see whether they make sense. In most clinical studies involving questionnaires, up until the 1990’s, this “face validity” was the only kind of validation that took place. Face validity is usually enough to spot a gross error. If you are planning on using information based on questionnaire outcome measures, it is a good idea to read the questions.

Content validity

The next test of validity is to consider whether questions cover all aspects of the concept being measured. Various points may or may not be important depending on the use to which the scale is to be put. It must be borne in mind that a more complex and time consuming questionnaire is less likely to be of general use. Formal methods of establishing content validity start out with a very large number of questions. These are culled from other studies of the problem, expert opinion, and unstructured interviews with patients of the Grounded Theory type (Glaser and Strauss, 1967). Questionnaires based on these are tested on groups of patients, and by techniques such as cluster analysis (Norusis, 1988) independent dimensions are discerned and redundant questions can be eliminated progressively.

Criterion-related validity

This means testing your proposed outcome measure against another, already known to be valid (Powell, 1989). Unfortunately, the commonest reason to introduce a new measure is precisely because no such “gold standard” is available.

Construct validity

“ Does it do what it says on the tin? ”

The final and most difficult test of validity is “Construct validity”. Simply put, construct validity is whether the measure really measures what it is supposed to measure. Does it do what it says on the tin? That question may be easy to answer for a wood preservative, but not so for a questionnaire. In using an outcome measure that tries to quantify something from the patient’s point of view, we are trying measure an abstract concept or construct. As an example, here is some of the discussion that went into evaluating the validity of an early questionnaire for nasal symptom severity, the subject of a Master’s thesis by the senior author (Fairley, 1993)

  • Nasal symptom severity is not a simple physical property.
  • It is an artificial “construct”, made up of numerous individual symptoms and their interactions with the subjective opinion of the patient.
  • Does the questionnaire result provide a genuine representation of this construct?
  • Is this scale really measuring nasal symptom severity?

There is no simple test to establish construct validity. To some extent it only becomes established over time, with repeated use of the scale in many different studies. Evidence for construct validity available could include:

  • Content validity (the content of the questionnaire pertains to the construct)
  • Demonstrated internal consistency of the scale (if it could be broken down into two or more unrelated groups of items, it could not really be measuring a single construct)
  • Diagnostic group differences.

If the nasal symptom scores really are measuring nasal symptoms, it would be reasonable to expect higher scores in patients suffering from nasal conditions.

Responsiveness of an outcome measure

Responsiveness is how accurately an outcome measuring instrument detects change when it happens. Responsiveness can be checked by applying the scale before and after a treatment which is known to work. Responsiveness will vary according to the patient group. Floor and ceiling effects restrict the range of applicability. For good responsiveness, an outcome measure needs to be reliable. If there is a lot of measurement error (test-retest reliability is poor) it cannot detect small changes, because they will be swamped by the background noise. Responsiveness is also part of the validity for an outcome measure.

General vs. condition specific outcome measures

“ you can’t really compare apples with oranges, let alone with a fillet steak ”

A great deal of research effort has gone into developing general measures of Quality of Life (QoL) such as the 36-Item Short Form Health Survey (SF-36® Ware, 2003). Such measures are inherently flawed when applied to individuals, each of whom will have their own specific health problems. If you are deaf, the fact that you can tick a box on a QoL questionnaire that you climb a set of stairs without getting out of breath doesn’t really make your deafness any less of a problem to you. The main use for general QoL measures is in deciding where to place healthcare resources. Insurance-based systems, whether run by the state or private companies, remove the need for individuals to pay for treatment at the time of need. They spread the risk of having some horribly expensive disease. But they also remove the incentive for the individual to seek value for money. The price a scheme member pays for peace of mind, for not having to worry about medical bills, is more than just the premium. The price includes the fact that, when it comes to your claim, somebody else decides what will and will not be covered. And when you have paid into the scheme, but develop an expensive health problem that does not score highly on the general quality of life scale, you might feel somewhat aggrieved to discover you aren’t covered. A health commissioning organization, whose job it is to decide priorities for resource allocation, will be biased in favour general outcome measures. A general measure of benefit helps them compare the value of one treatment with another. But you can’t really compare apples with oranges, let alone with a fillet steak. A general measure of food benefit is clearly not very sensible. Each foodstuff has its own contribution to make. Yet we have the SF-36 and similar general outcome measures being used (misused) to show how much benefit we get from a cochlear implant compared to a hip replacement. Dr John E Ware, writing on the website, states that

“clinical trials to date demonstrate that the SF-36 is very useful for descriptive purposes such as documenting differences between sick and well patients and for estimating the relative burden of different medical conditions” (our emphasis)

But everything depends on the questions asked. A cochlear implant will not help you get up the stairs without getting out of breath. Neither will an apple give you your recommended daily allowance of protein.

Responsiveness and general vs. condition specific outcome measures

The biggest difference between condition-specific and general outcome measures is likely to be in their responsiveness. General QoL measures all take the form of a weighted shopping basket, just like the official estimates for economic inflation. If you happen to suffer from / need an item that ain’t in the basket, it won’t show up in your score. You need a condition specific outcome measure. And don’t let them fob you off with the wrong shopping basket.

Putting numbers on the effectiveness of treatments: Absolute Risk Reduction and Number Needed to Treat

Most countries have laws restricting extravagant and unfounded claims for medical treatments. But there are always loopholes. Marketing departments of drug companies are very good at finding them. Some pharmaceutical companies spend more money marketing their products than they do on research and development (Gagnon and Lexchin, 2008). There are lots of ways to make your product look good. Glossy ads in medical magazines are just a small part of it. But surely the presentation of the underlying dry figures and statistics can’t mislead. Well, yes it can, it does, and it’s all legal.

“ don’t get taken in by risk reduction figures for uncommon events ”

If a drug rep told you their new molecular engineered prostacyclin analogue gave a 25% reduction in the risk of stroke compared to plain old aspirin 75mg, you’d probably be impressed. But you need to know something else, that s/he didn’t tell you. The 25% risk reduction in favour of the new (expensive, potential late side effects unknown) drugs is true. But that is only part of the story. To decide whether to prescribe, you need to know what is the risk of stroke in the patients you plan on prescribing – presumably those whom you already had on 75mg aspirin. If their risk of stroke is still high, a 25% difference is impressive. But if their risk is already low, it is less so. A 25% reduction in something that is already very small is something very small indeed.

When you ask to see the data from the trial, you see that 3 percent of the patients on the new drug had a stroke over a five year period, compared with 4 percent of the patients on aspirin.

The Relative Risk (RR) of new vs old is therefore 3/4 = 0.75 = 75%. Therefore the reduction in risk by 25% appears correct.


  • The Absolute Risk of stroke in the control group is 4/100 = 0.04
  • The Absolute Risk of stroke in the treatment group is 3/100 = 0.03
  • The Absolute Risk Reduction is the Absolute Risk in the Control Group, minus the Absolute Risk in the Treatment Group, which is 0.04 minus 0.03 = 0.01 = 1 percent.

Somehow, that doesn’t seem quite so impressive as 25%.

The Number Needed to Treat is the reciprocal of the ARR = 1/0.01 = 100.

That means, in order to see a difference in outcome, you would have to treat one hundred patients with the new medication in order to prevent one stroke. Well, stroke is a very serious disease, and you may well take the view that it is worth it. But don’t get taken in by risk reduction figures for uncommon events.


  • Absolute Risk
  • Relative Risk
  • Absolute Risk Reduction
  • Number Needed to Treat

Absolute Risk (AR) = the event rate for a given individual

Relative Risk (RR) = the event rate in the exposed group / event rate in the control group

Absolute Risk Reduction (ARR) = event rate in control group, minus event rate in treatment group

Number Need to Treat (NNT) = the number need to treat to avoid 1 bad outcome = 1 / ARR.

The Absolute Risk and Relative Risk terms are particularly relevant to the epidemiology of a disease, and can also be applied to treatment and control groups in studies. The Number Needed to Treat is the reciprocal of the Absolute Risk Reduction. The ARR and NNT are important to know when making evidence based treatment recommendations.

“ If you know the NNT, you can offer your patient a better informed choice ”

Example 1: Absolute risk of a smoking related disease

A disease has an Absolute Risk for an individual of 8 in 100. That means 8 out of every 100 get it.

If smoking increases the Relative Risk of this disease by 50% compared to someone who does not smoke, the 50% applies to the 8 percent.

Therefore Absolute Risk for that individual goes up to 8 + 4 = 12 in 100.

Example 2: A treatment with a small effect size

A trial found that 12 /100 suffering from a certain disease had a bad outcome in a control group, compared with 10 /100 in the treatment group.

The ARR from treatment of that disease would be (12/100) – (10/100) = 0.02.

Therefore the number of patients you would need to treat to stop 1 patient getting a bad outcome would be 1/0.02 = 50

You would need to treat fifty patients in order to prevent just one bad outcome. If the treatment was expensive or had a lot of side effects, that may not be a very good deal for the forty nine patients who paid for treatment, and ran the risk of side effects, without gaining any benefit. Deciding on whether a treatment does represent a good deal is a value judgement. It depends on the seriousness of the adverse outcome we are trying to prevent. If the adverse outcome is death or major disability, then a NNT of 50 might be acceptable. If the adverse outcome is a minor nuisance, and the treatment onerous or expensive, you – or your patient – might decide it’s not worth the trouble. If you know the NNT, you can offer your patient a better informed choice than if you just know the treatment has some beneficial effect.

NNT Calculator

Recently published papers in major medical journals usually include the NNT. However, not all do so. Bandolier have an online NNT calculator. Their downloadable worksheet provides a template to calculate NNTs from papers and systematic reviews. It is a useful educational exercise to try filling this in yourself, from the information provided in a published paper. Even if the NNT is already given, you can check the workings for yourself, and thereby have a better understanding of how the figure is arrived at.
link to Bandolier NNT calculator

Is the evidence relevant?

If a randomised controlled trial shows there does appear to be positive effect, that is not due to chance, or the way the study was done, can it be applied to your population? This is a crucial question. Even within the same geographical area, different population groups exist. Individuals each have their own specific situation. It is the responsibility of the individual doctor to explore, with the individual patient, whether the evidence really applies in each case.

Qualitative studies

“ some of the things that matter most to patients can’t be reduced to figures and statistics ”

We cannot measure everything. Some of the things that matter most to patients can’t be reduced to figures and statistics. Often, a combination of common sense, empathy and gut feeling provide enough practical guidance. But we cannot always rely on our own perspective. Professionals can, all too easily, fall into a rut of blinkered complacency. If we want to find and apply best evidence in certain areas, we need to look at qualitative studies.

Qualitative studies to guide research

Qualitative studies are valuable in

  • Preliminary definition and scoping of a research problem
  • Refocusing earlier research that has not produced any clear answers

Preliminary definition and scoping of a research problem

“ in clinical research, the choice of outcome measures is crucial ”

We have to ask a clear question. Only then do we have any chance of discovering the answer. The research question in clinical medicine is not always obvious. In clinical research, the choice of outcome measures is crucial. Formal qualitative studies, looking at what really matters to patients, are helpful in deciding which outcome measures to use in subsequent quantitative studies of treatment effectiveness.

Refocusing earlier research

Qualitative studies can also be valuable in finding out why previous studies have produced contradictory or uninterpretable findings. They can help refocus research.

Specific areas of application for qualitative studies

Areas where qualitative studies are likely to provide the best evidence include

  • patient satisfaction surveys – especially their free text / blank canvas elements, where patients can mention things the researchers may not have realised were important
  • giving patients a voice in how services are provided
  • involving users in service development

Critical appraisal of qualitative study methodology

As with any form of evidence, we need to know which qualitative studies give us results we can rely upon.

The methodology for qualitative studies is less well established than for quantitative. Some principles have emerged and become generally accepted in recent years. Various authors (Pope 2000, Cutliffe 1999, Boulton 1996, Popay 1998, Beck 1993) have identified indicators to help distinguish good qualitative studies. To assess the methodological quality of a qualitative study, consider the following

  1. Did the study use open questions? If yes, did the study give privileged place to subjective meaning? Were subjects in the study given a true opportunity to describe how things appear from their perspective?
  2. Did the sample reflect a diverse range of individuals relevant to the topic? This would increase generalisability.
  3. Is there evidence of triangulation i.e. using more than one method to get the information used?
  4. Were the findings of the study fed back to the participants to check the investigators interpretation of the patients views are correct?
  5. Is there evidence of saturation of the data? i.e. did the investigators continue to interview patients until no new themes were detected?
  6. Were deviant cases followed up? The most important information may be gained from a few patients whose views were not in agreement with the others (a minority report). Were these cases further investigated or ignored?
  7. How similar to your own patients are those in the study?

Further resources for those interested in qualitative studies are available at

Levels of Evidence and Systematic Reviews

“ levels of evidence should not be regarded as a sole and universal indicator of quality ”

It has become almost de rigueur to rank evidence in Levels. The aim – laudable enough – is to identify the strength of the evidence underlying any given management recommendation. At present, the hierarchy of evidence places the well conducted systematic review of RCT’s at the top. However, we have seen that the most appropriate evidence varies, depending on the question being asked. Hierarchies placing systematic reviews of RCT’s at the top are specifically about effectiveness of interventions. These levels of evidence should not be regarded as a sole and universal indicator of quality. Other schemes, similar in spirit, but differing in the detailed assignment of numbers, are also based primarily on the study design. Studies which reduce bias the most are placed at the top of the hierarchy. Reduction in bias is an important aspect of the quality of a study, but it is by no means the only thing that matters. In many important areas, we are unlikely ever to have high level evidence. As mentioned earlier, the absence of high level evidence of effectiveness is not evidence of absence of effectiveness. The hierarchy of evidence, and the recommendation gradings, are primarily based on the trial design rather than the clinical importance of the results. Careful thought, with skilled and judicious application to individual questions in individual cases, is essential. Otherwise, bad advice and bad decisions will result.

Levels of evidence for effectiveness of treatments

Table 1. Hierarchy of EBM (Evidence Based Medicine)
simplified from CEBM Oxford 2001
Level Type of Evidence
I Systematic Reviews of well controlled Randomized Controlled Trials (meta-analysis) or single RCT with narrow CI (confidence interval)
II Systematic review cohort studies or lesser quality RCTs
III Case controlled studies (non randomized)
IV Case series (no control group)
(V) Expert opinion (GOBSAT – Good Old Boys Sat Around Table)

Grading of Recommendations

Recommendations can also be graded depending on the level of evidence they were based. Grading recommendations more-or-less follow the hierarchy

Evidence Level Recommendation Grade

Influence of EBM on Education and Problem Based Learning

Using EBM is closely allied with Problem Based Learning (PBL). Problem based learning is now incorporated into many medical student and continued professional development (CPD) courses. Problem based learning (Miller 1966)

  • helps individuals identify own learning needs
  • gives them the tools to meet those needs

Compared to didactic teaching, PBL takes longer and requires more effort. But knowledge obtained this way may be retained for longer. There is some evidence (Shin 1993) it leads to more lifelong use of core EBM skills such as

  • defining questions
  • searching for evidence
  • appraising evidence

Modern medicine does not stand still. Continuous advances mean that all healthcare staff need lifelong learning skills.


Avoiding abuses of EBM – the potential for tyranny in evidence based guidelines and protocols

“ cookbook medicine ”

It is up to individual doctors, in consultation with individual patients, to decide which evidence applies. Sackett (1996) identified the risk that, without clinical expertise, practice could become tyrannized by evidence. Even the best external evidence will not apply to all cases. Worldwide, healthcare commissioners, whether government or insurance based, look to evidence based clinical protocols and pathways to ensure value for money. Deviance from such guidelines can result in financial and even legal penalties for doctors. But strict adherence means a dumbed down, monodimensional, mechanistic and unthinking approach, which has been dubbed “cookbook medicine”. In 1996, Sackett stated that EBM could not result in a slavish, cookbook approach. But in 2007, the EBM cookbook has become very popular with health commissioners. Its recipes are chosen by committee. Health economists decide what is best for the population. The chosen dishes are then served up franchised, quality controlled to ISO 9000, Mcdonalds fashion. Newly created regiments of specialist practitioners, unburdened by the skepticism that follows a broad medical education, are taught to follow the rules, follow the guidelines, and all will be well. The public like it because they are getting their treatment quicker. But is this fast medicine good for your health in the long run? The best defence against this misapplication of EBM is to be skilled and confident enough to know when the guidelines apply, and when they do not.

Guidelines are for guidance of the wise, and the blind obedience of fools

Hampton (2003) wrote that

“Guidelines for medical management are now part of medical life. A fool – loosely defined as someone who does not know much about a particular area of medicine – will do well to follow guidelines when treating patients, but a wise man (again, loosely defined as someone who does know about the disease in question) might do better not to follow them slavishly. The problem is that the evidence on which guidelines are based is seldom very good. Clinical trials have a variety of problems which often make their relevance to ‘real world’ medicine dubious. The interpretation of trial results depends heavily on opinion, and a guideline that purports to be evidence based is actually often opinion based. A guideline will depend on the opinions of those who wrote it, and the wise man will use his judgement and give due weight to his own opinions and expertise.”

Guidelines and the Pareto principle (the 80:20 rule)

“ A guideline is like a simple outline drawing, a flat cartoon character, a pixellated simplification of reality ”

A guideline has to be reasonably simple otherwise it is impractical. Cut-off points and categories of patients have to be specified. Management algorithms have to be drawn. A good guideline will cover around 80% of the cases in its area of application. If it tries to cover much more than that, it will become unwieldy to the point of being useless. It will become a textbook. By the time you’ve learned the textbook, you won’t need to look up a guideline. A guideline is like a simple outline drawing, a flat cartoon character, a pixellated simplification of reality. If it has been well designed, it will fit reasonably well over most real life complex patients. If it has not, or if attempts are made to apply it on a population other than that for which it was designed, it will not fit. It will chafe. It will pinch, like a badly fitting shoe, and it will either be discarded or cripple the unfortunate wearer. Statistically, it is inevitable that doctors will not always follow guidelines. But they should be aware of them, and be able to justify deviations. In statistical modelling, we start with as complete and accurate a representation of reality as we can get. All factors and variables are taken into account, and the model is as near to perfect as can be. It is a fully saturated model. But it is far to complex to understand. It hasn’t helped simplify anything. We then start removing elements from the model, one at a time, starting with those that least affect the representation of reality. When we reach as simple a model as we can, that doesn’t differ too wildly from the fully saturated model, we call it quits. That is how modern evidence based guidelines can be made. Of course, we are deliberately removing complexity. When come to use the model in practice, we have to accept that we may have to put some complexity back in.

Differing perspectives on guidelines and the Pareto principle

A good doctor is often more concerned about the 20% of difficult cases than the 80% of routine. A good manager will tend to have the opposite priority, especially when that 80% is the main source of income. Very few managers have any real depth of knowledge of the subject of the guideline. In UK, NHS managers are simply given targets to implement. Central bureaucrats produce a five year plan, much like the former Soviet Union. This explains many of the reservations we have over guidelines, especially when they begin to congeal into enforceable rules. When they become rigid, like the Procrustean bed, the wise will stay away from that institution. Procrustes (Greek mythology) lived in the hills. He tempted passing travellers to lay in his iron bed. But they had to fit exactly. Taller guests would have their protruding extremities cut off. Shorter folk were stretched on the rack to fit the size of the Procrustean bed. As medicine becomes more protocol-driven, we must beware doing the same to patients who don’t quite fit the mould.

Avoiding further abuses of EBM – Denying treatments for lack of high level evidence

“ It is just too tempting to misuse EBM to justify rationing ”

Some conditions and treatments are difficult to study by RCT. If the level of evidence is used as a criterion of clinical effectiveness, we risk denying patients effective treatments. Governments, health insurers and health maintenance organizations genuinely need to prioritize and ration limited resources. It is just too tempting to misuse EBM to justify rationing.

UK Department of Health misuses EBM to rubbish tonsil and grommet operations

“ propaganda ”

In the UK, the Department of Health and the Chief Medical Officer (CMO) have rubbished grommet and tonsil operations for years. They point out the lack of high level RCT evidence, and unexplained variations in the numbers of operations done. On 21 July 2006, the CMO’s annual report (covering the year 2005) highlighted the clinical waste of unnecessary tonsil operations (Dept of Health, 2006). The point was rather crudely illustrated. A yellow clinical waste bin was shown stuffed with cash. Banknotes were falling over the edges, onto the operating theatre floor. The public were spared an X-rated version of blood-spattered money, soaking in a gory puddle. Spin doctors seem to have been involved in the production of this propaganda. The British Association of Otolaryngologists were not. The results of their national audit of tonsil operations (RCS 2005) – covering over 33,000 operations and the biggest cooperative study of the subject in the world to date – were quoted with extreme selectivity. The CMO’s message, that taxpayers’ money is wasted on unnecessary tonsil operations, was immediately picked up by the press and widely reported as fact. A letter of protest from the President of the BAOL to the CMO was fruitless. In 2007 the BMJ published the results of one of the first RCT’s of tonsillectomy from Finland (Alho et al 2007). The results were clearly in favour of the operated group. The study has been criticised for its short follow up and use of throat swabs as an objective outcome measure. Further randomised controlled trials using longer follow-up and more patient-centred outcome measures will be needed to convince the paymasters. It will be difficult to persuade patients who have suffered severe and frequent tonsillitis that they should not have surgery. By the time UK patients reach an ENT surgeon, they seldom need persuasion of the merits of operation, though some will decline when informed of the risks.

Knowing your own patients

External evidence can only be properly applied if you understand how it relates to your own local population, and to the individual patient.

The final step in critical appraisal of any external evidence is to check whether it is really relevant to your patient.

Bridging the gap between evidence and practice

“ obstacles ”

Where relevant evidence is available there many obstacles to getting it into practice (Haynes and Haines, 1998).

  • Doctors are busy people with little time to look up the latest evidence (Donald 1998)
  • Funding – medical innovations tend to be expensive
  • Skill mix and habits
“ a comfortable rut ”

Doctors, nurses and allied health care workers train for years to establish their practice. They develop skills and expertise. But they can also get into a comfortable rut. When advances occur, and evidence suggests their is a better way of doing things, it is wasteful to delay. Introducing change takes time. People have to learn new skills. Procedures must be in place to protect patients during the learning curve. There are no magic bullets to bring about change in doctors’ behaviour. A mixed approach is often best. Methods that have been shown to work in combination are

  • money
  • peer leaders (Key Opinion Leaders – KOL)
  • work place based education

There is strong evidence that a didactic lecture on its own does not work. There is only weak evidence that money on its own works (Scott, Sims). A combination of educational initiatives is usually needed.

Future developments in EBM

Making access to evidence easier

CONSORT: Consolidated Standards of Reporting Trials

“ get the message quickly ”

The CONSORT statement is an evidence-based set of recommendations for reporting RCTs. Standardising the format of reports

  • helps authors produce a complete and transparent account
  • helps readers get the message quickly by headlining the findings in the titles of papers
  • helps readers with critical appraisal and interpretation

The CONSORT statement, which is an evolving document, currently comprises a 22-item checklist and flow diagram, with brief text description. The checklist items cover

  • trial design
  • analysis
  • interpretation

The flow diagram displays the progress of all participants through the trial. The Statement has been translated into several languages.

Making sure all the evidence is available

Prospective registers of studies to prevent publication bias

“ register studies before they are done ”

Researches are less likely to submit for publication studies which do not have positive exciting results (Olson 2002). PubMed therefore has an overall positive bias, when in fact negative results may be very important.

Publication bias can result in serious harm to patients, even deaths. Chalmers (2006) points out persistent biased under-reporting of research undertaken by the pharmaceutical industry. The worry is that unwelcome side-effects, picked up during pre-marketing drug trials, can be quietly swept under the carpet, and only the positive findings published.

The most effective way to avoid publication bias is to register studies before they are done. Several databases of protocols for studies, and studies under way, are available. These help researchers to find all the evidence.

The metaRegister of Controlled Trials (mRCT) aims to identify ongoing and unpublished RCTs. It is open to trials of all types of intervention in all healthcare specialties. The mRCT is built by pooling databases of ongoing (and some completed) trials from public, charitable and commercial organisations.

The International Standard Randomised Controlled Trial Number (ISRCTN)

“ a unique number for each RCT ”

This scheme helps identify which trial is which. Confusion arises because

  • different trials may have the same name
  • different names may be given to the same trial
  • titles of publications often differ from the original trial name

Multiple reports based on the same patients can cause further confusion.

The solution is a unique number for each RCT. The number is issued when the researchers register the trial. This should be done as early as possible, preferably before patients are recruited. The ISRCTN then stays with a trial, throughout its life cycle. A searchable database of ISRCTN registered trials is freely available online to clinicians, researchers, patients and public at Registering trials helps avoid

  • ignorance about ongoing and unpublished trials
  • confusion about which trial is which
  • over-reporting (undeclared duplicate reporting)
  • patients being unnecessarily recruited to new trials when evidence is imminent

The register provides opportunities for

  • collaboration between different research groups
  • reducing duplication of research effort

Failure to report RCTs is increasingly seen as scientific and ethical misconduct. There is growing pressure to register all trials. Some countries have legislation requiring registration of trials. Registration is recommended by many funding agencies and official bodies.

Once these registers have been in place for some time, a protocol driven systematic reviews will be able to group together all RCT’s on the subject in question, published or unpublished. Unpublished data currently has to be found by painstaking hand-searching, for example abstracts of meetings, and making enquiries of researchers known to be active in the field. Identification of such data should, in future, be much easier from prospective trial registers.

These studies would then ideally have their original patient data pooled, effectively making each smaller study into one huge study.

Such a gold standard systematic review of prospectively identified studies using original patient data has been achieved for Tamoxifen and breast cancer (Early Breast Cancer Trialists’ Collaborative Group 2001) but few other subjects.

Developing the tools

The methodology of randomised controlled trail was invented in 1940s out of necessity when devising a study for TB with limited amounts of streptomycin (Medical Research Council 1948). The technology for pooling them (in protocol driven systematic reviews) has been developed since then to a high standard. The Cochrane Collaboration pools evidence about research methodologies. The Cochrane Methodology Register forms part of the Cochrane Library. It provides resources on

  • how to pool prospective cohort studies
  • developing the methodology for systematic reviews of qualitative studies
  • developing validated patient outcome measure
  • standardising the way studies are written up. This includes rules on writing the title of papers to make it easier to get the message quickly.

A drift toward the Dark Side? Bayesian Statistical techniques in EBM

“ a way of calculating the odds of something happening, if you already know something else ”

Bayesian statistical methods are increasingly being used to make decisions about healthcare, particularly at an economic level. (Luce and O’Hagan, 2003) They are also coming into drug trials because they can reach conclusions quicker, with fewer trial participants. Although Bayesian statistics is regarded as something new, The Reverend Thomas Bayes theorem was first published in 1763. It is quite simple. It is a way of calculating the odds of something happening, if you already know something else. We use the principle, naturally and often subconsciously, all the time in making decisions about what to do.

“ we already know, or at least believe, something before we start a trial ”

Bayesian statistics is based on an acknowledgement that we already know, or at least believe, something before we start a trial. This is known as the “prior”, expressed as a probability distribution. Probability is defined as a degree of individual belief, ranging from 0 (complete disbelief) to 1 (certainty). Any uncertainty is located philosophically in the human mind, not in the data itself. The trial then provides some new information. Bayes theorem – a simple mathematical formula – is then used to synthesize the new information with the prior, to form the “posterior”, also expressed as a probability distribution. Bayes theorem uses simple addition and multiplication to operate probability laws familiar to any gambler. Bayesian statistical inferences are based on the posterior, which combines information from both the prior and the trial data.

“ evidence can be included or discounted from the prior belief ”

Bayesian statistics is focussed on answering the question “How does this new evidence change what we believe?” It explicitly allows for all available evidence to be taken into account. Because it acknowledges that prior belief can vary, various forms of evidence can be included or discounted from the prior belief, and given different weightings. Using a variety of different priors means the same data can be interpreted in many ways. This allows numerous models to be run, testing the effects of different assumptions and viewpoints on the observed data.

“ mathematical models .. Markov Chain Monte Carlo simulation .. so complex as to be impenetrable ”

Although Bayes theorem itself is simple, the mathematical models built up to incorporate all these different factors quickly become so complex as to be impenetrable. The computations involved in the Markov Chain Monte Carlo simulation are such that Bayesian statistics in practice becomes a “black box” exercise. Various inputs result in various outputs, and it is almost impossible to work out what is going on in the middle. This induces a mistrust of the process in those accustomed to working with the relatively simple, transparent and fixed formulae of conventional statistical tests.

There are two major differences between Bayesian and conventional statistics.

First, probability itself. Both Bayesian and conventional statistics consider probability as varying from 0 to 1. In Bayesian statistics, probability is defined as the degree of personal belief, where 1 equals certainty or truth. Conventional probability is defined in terms of a converging frequency, based on an infinite number of repetitions of the event that could have occurred. For example, tossing a coin gives a probability of 0.5 for either heads or tails, but you would have to toss the coin many times to see the proportion of heads and tails converge toward that figure. The uncertainty is considered to reside in the data, as an imperfect sample, not in the mind of the observer. The observer is considered neutral and impartial at all times. He does not need to have any belief, he simply observes, objectively, and his place could be taken by any other neutral observer. The difference between the two approaches could be illustrated by a two headed coin. The Bayesian statistician would just look at the coin and say “it’s a two headed coin – I’m certain”. The conventional statistician would refuse to cheat by using the prior information. He would just keep tossing it, getting head after head after head until he got bored with the repetition. He would then say that, if this was a normal coin, there was only a 0.00000-whatever chance that he would have got the number of heads he got. Of course, no real statistician, conventional or otherwise, would, in reality, be that daft, and both would simply state that this problem was not suitable for any form of statistical analysis.

“ fudge factor ”

Second, the explicit incorporation of the prior belief in the Bayesian method. Conventional statistics tries to stick to objective data, and does not attempt to incorporate any prior belief. Some elements of prior belief are, however, implied, for example when deciding to use a one-tailed or two-tailed test. The Bayesians point out that conventional statisticians, while deluding themselves that they are being objective, do in fact admit prior knowledge, but in a patchy, inconsistent way that is not formally acknowledged and quantified. It is generally agreed that conventional statistical tests have less power to provide inferences, but are more repeatable, and more objective. Scientists value objectivity and repeatability very highly. They are always suspicious, and often disparaging, of anything that smacks of the fudge factor.

The most fundamental criticism of the Bayesian method is that it is subjective. In particular, the prior belief is subjective. One person’s prior belief is different from that of another. Therefore the posterior distribution will also vary one person from another, making any inference derived from it subjective. Proponents of the Bayesian method retort that scientists aren’t as objective as they think, they are deluding themselves, they aren’t so objective either, they just pretend they can be, and so they may as well use the more powerful method. In 2007, the tide is flowing in favour of the Bayesians. The appeal of the dark side is increasing.

Evolution of EBM toward KBM (Knowledge Based Medicine)

“ a brand makeover ”

EBM has, unfortunately, been misinterpreted, misrepresented, and acquired some negative connotations since the early 1990’s. Many experienced practising doctors sigh at the thought of these narrow-minded train-spotting anoraks, with their intolerant, blinkered insistence on RCT evidence for everything we do. Real EBM is not and never was like that. But, in common with many medical terms, EBM has begun to accrete pejorative layers. Perhaps it needs a brand makeover.

“ better ways of interpreting, evaluating, and incorporating lower levels of evidence ”

Haggard (2007) proposes the term KBM (Knowledge Based Medicine) to indicate a broader acceptance of evidence, and its more intelligent application to clinical practice. If systematic reviews are too strict over what evidence can be accepted, we risk doing without evidence altogether. Even where high level evidence is available, it may not the most relevant piece of information to a particular problem. Evidence based guidelines may be enforced simply because they carry an imprimatur of scientific authority, when in fact they don’t really apply to the patients in question. We need better ways of interpreting, evaluating, and incorporating lower levels of evidence, to make better use of what we have. We also need to guard against simplistic misuse of EBM by those who seek black and white answers to questions that have many shades of grey.

Haggard points out that, by automatically rejecting studies which do not comply with the RCT methodology, we may deny ourselves potentially valuable information.

“The most rigorous methodology and the most sophisticated measurement both contribute value. But so does the wisdom to know when an admirable approach should be suspended because it is not appropriate. We need to know

  • what the particular strengths and weaknesses of a study are
  • which of these are germane to a conclusion on a particular issue
  • in what specific direction they would have biased the result”
“ traditional beliefs rarely arise by accident ”

KBM could be regarded as a correction of the excesses of radical EBM. Early advocates of EBM had to drag the medical profession away from the comfort zone of unbridled clinical freedom. Before EBM (and indeed today in many parts of the world), professional ignorance was cloaked rather than acknowledged. Reviewers and textbook authors typically marshalled evidence to support an institutional viewpoint. Controversial areas were rarely approached with an open mind. The deconstruction of traditionally held beliefs was a necessary step, but traditional beliefs rarely arise by accident. There is usually some element of validity. If the past really is another country, zealots misrepresenting EBM could be accused of cultural elitism, if not outright racism, towards its inhabitants.

“ Extremist interpretations ”

Early EBM defined what was reliable evidence. Extremist interpretations then simply excluded all that which did not comply from the knowledge base. This radical approach carried an implicit criticism of, and thereby alienated, many practising doctors. Those who daily make important decisions often have no choice but to rely on evidence from the lower levels of the hierarchy.

KBM aims to rehabilitate and make best use of evidence from a wide variety of sources. Evidence from

  • basic medical sciences
  • individual experience
  • traditional teaching

can all be legitimately incorporated into the clinical decision. KBM also takes into account the cultural and economic context of different healthcare environments. It recognises that there is a lot more to medical decision making than a narrow consideration of the reliability of estimates of effectiveness.

The tools for assimilating those other sources of knowledge are still being developed. In bridging the gap between evidence and practice, KBM will put guidelines in their proper place as guidance for the wise. There is little discernible difference between KBM and EBM as promoted by those with a broad understanding of the term. A broad, informed, nuanced use of evidence, as reflected in the text above, would fit with either term. Whatever we call it, we need to move on and implement its principles more widely.

EBM and the UK NHS

UK NHS doctors are in a privileged position to practice EBM. Unlike many colleagues working in other health economies, they are paid the same, whichever treatment option their patients are given. This puts us in a position of relative equipoise in our decision making.

Universal web access in the clinical workplace to databases like TRIP (Turning Research Into Practice) gives everyone the opportunity to access evidence.

“ help individual patients come to rational decisions ”

Combining critical appraisal of the evidence with clinical expertise, we can help individual patients come to rational decisions. The evidence based approach takes us toward Archie Cochrane’s vision of a Rational Health Service (Cochrane 1972).


Alho O-P, Koivunen P, Penna T, Teppo H, Koskela M, Luotonen J. Tonsillectomy versus watchful waiting in recurrent streptococcal pharyngitis in adults: randomised controlled trial BMJ 2007; 334: 939

Arms, WY. What are the alternatives to peer review? Quality Control in Scholarly Publishing on the Web. Thursday, December 13, 2007. Downloaded 08 Jan 2008 from

Beaton DE, Katz JN, Fossel AH, Wright JG, Tarasuk V, Bombardier C. Measuring the whole or the parts? Validity, reliability, and responsiveness of the Disabilities of the Arm, Shoulder and Hand outcome measure in different regions of the upper extremity. J Hand Ther. 2001 14: 128-46

Beck C. Qualitative Research, The evaluation of its credibility, fittingness and auditability Western J of Nursing Research. 1993, 15(2), 263-266.

Berg O, Carenfelt C. Analysis of symptoms and clinical signs in the maxillary sinus empyema. Acta Otolaryngol (Stockh) 1988 105: 343-349

Boulton M, Fitzpatrick R. Quality in qualitative research. Critical Public Health. 1994, 5(3),19-26

Burton MJ. Systematic reviews, Cochrane and EBM – Why I prize prudent scepticism above indefensible certainty. Clin Otolaryngol 2007; 30(1):60-3

Burton MH. Evidence-based medicine and Otolaryngology-HNS: Passing fashion or permanent solution? Otolaryngology-Head and Neck Surgery 2007: 137; S47-S51

Chalmers I. From optimism to disillusion about commitment to transparency in the medico-industrial complex. J Royal Soc Med 2006; 99:337-341 full text

Clarke M, Hopewell S, Chalmers I. Reports of clinical trials should begin and end with up-to-date systematic reviews of other relevant evidence: a status report. J R Soc Med. 2007 Apr;100(4): 187-90

Cochrane AL. Effectiveness and Efficiency. Random Reflections on Health Services. London: Nuffield Provincial Hospitals Trust, 1972. (Reprinted in 1989 in association with the BMJ, Reprinted in 1999 for Nuffield Trust by the Royal Society of Medicine Press, London see book review by Trisha Greenhalgh, BMJ 2004

Cutliffe JR, McKenna HP. Establishing the credibility of qualitative research findings,the plot thickens. J of Adv Nursing 1999, 30(2), 374-380.

Cobb LA, Thomas GI, Dillard DH, Merendino KA, Bruce RA. An evaluation of internal mammary artery ligation by a double-blind technic. N Engl J Med 1959; 260:1115-8

Department of Health. On the state of the public health: Annual report of the Chief Medical Officer 2005. Published July 2006 UK Crown copyright. Download from

Doll R, Hill AB. Study of aetiology of carcinoma of the lung. BMJ. 1952;2:1271-1286.

Donald A. The front line evidence based medicine project final report. NHS Executive North Thames Regional Office 1998

Early Breast Cancer Trialists’ Collaborative Group. Tamoxifen for early breast cancer. Cochrane Database of Systematic Reviews 2001, Issue 1.

Fairley JW. Reliability and validity of a nasal symptom questionnaire for use as an outcome measure in clinical research and audit. Chapter 4 in: Correlation of nasal symptoms with objective findings and surgical outcome measurement – Thesis submitted for the degree of Master of Surgery, University of London, 1993. Published (excluding Chapter 9) 1996. downloaded from 04 January 2008.

Fitzpatrick R, Boulton M. Qualitative research in health care 1. The scope and validity of methods. J of Eval in Clin Practice 1996, 2(2), 123-30.

Fokkens W, Lund V, Mullol J. European Position Paper on rhinosinusitis and nasal polyps 2007. Rhinology Suppl. 20: 1-136

Gagnon MA, Lexchin J. 2008. The Cost of Pushing Pills: A New Estimate of Pharmaceutical Promotion Expenditures in the United States. PLoS Med 5(1) downloaded 9 January 2008 from

Gifford F. Uncertainty about clinical equipoise. Clinical equipoise and the uncertainty principles both require further scrutiny. BMJ. 2001 March 31; 322(7289): 795

Glaser, BG, Strauss, AL. The Discovery of Grounded Theory: Strategies for Qualitative Research. 1967 Chicago, Aldine Publishing Company. ISBN 0202302601

Glasziou P, Chalmers I, Rawlins M, McCulloch P. When are randomised trials unnecessary? Picking signal from noise. BMJ. 2007 Feb 17; 334(7589): 349-51

Green, SA. Surgeons and Shamans: The Placebo Value of Ritual. Clin Orthop Relat Res 2006, 450: 249-54

Greenhalgh T. How to read a paper: The Medline database BMJ 1997; 315: 180-183

Haggard, M. The relationship between evidence and guidelines. Otolaryngology – Head and Neck Surgery 2007; 137 (Supplement 4S) : 72-77

Hampton JR. Guidelines – for the obedience of fools and the guidance of wise men? Clin Med 2003 3:279-84

Haynes B, Haines A. Getting research findings into practice. Barriers and bridges to evidence based clinical practice. BMJ 1998; 317:273-6

Light D. Uncertainty and Control in Professional Training. J of Health and Social Behaviour 1979 20 310-322

Luce BR, O’Hagan A. A primer on Bayesian Statistics in Health Economics and Outcomes Research. 2003 MEDTAP® International, Inc.

Maw R, Wilks J, Harvey I. Peters TJ, Golding J. Early surgery compared with watchful waiting for glue ear and effect on language development in preschool children: a randomized controlled trial. Lancet 353(9157):960-3,1999 Mar 20.

Miller, G. Continuing Education for what. Journal of Medical Education 1966 42 320-326

Medical Research Council 1948. A Medical Research Council Investigation. Streptomycin treatment of pulmonary tuberculosis. London Saturday October 30 1948 769-782

National Collaborating Center for Chronic Conditions 2004. Hierarchy of evidence and grading of recommendations Thorax 2004;59;13-14

Norusis MJ. (1988) SPSS/PC+ Advanced Statistics V2.0 for the IBM PC/XT/AT and PS/2. SPSS Inc, Chicago

Olson CM, Rennie D, Cook D, Dickersin K, Flanagin A, Hogan JW, Zhu Q, Reiling J, Pace B. Publication bias in editorial decision making JAMA 2002 287 21 2825-2828

Oxman A et al. No magic bullets, a systematic review of 102 trials of interventions to improve professional practice . CAMJ 1995 153 1423-31

Popay J et al. Rationale and standards for the systematic review of qualitative literature in health services research. Qualitative Research. 1998, 8(3), 341-351.

Pope C, Mayo N. Qualitative Research in Health Care. BMJ Books 2000

Popper KR. The logic of scientific discovery. Hutchinson & Co, London. 1959 8th Impression 1975 ISBN 0 09 111721 6

Powell GE. Selecting and developing measures. in Parry G and Watts FN (eds) Behavioural and mental health research: A handbook of skills and methods. Laurence Earlbaum Associates, Hove & London (UK) 1989. pp 27-53

Retsas S. Treatment at Random: The Ultimate Science or the Betrayal of Hippocrates? Journal of Clinical Oncology 2004: 22; 24-20

Roach KE. Measurement of Health Outcomes: Reliability, Validity and Responsiveness. J Prosthetics Orthotics 2006. 18(1S): 8-12. full text downloaded 04 Jan 2008 from

Royal College of Surgeons of England. National Prospective Tonsillectomy Audit. Final report of an audit carried out in England and Northern Ireland between July 2003 and September 2004. Published May 2005 ISBN 1-904096-02-6 downloadable from

Sackett DL et al. Evidence based medicine: what it is and what it isn’t. BMJ 1996 312 71-72

Sackett DL. Evidence based medicine: how to practice and teach EBM. Edinburgh: Churchill Livingstone, 1997

Scott A . Designing incentives for GPs. A review of the literature on their preferences for pecuniary and non pecuniary job Characteristics. University of Aberdeen, Economics Research Unit, Department of Public Health and Economics 1997

Shin et al Effect of problem based self directed undergraduate education on lifelong learning. Can Med Ass J 1993 148 6 969-976

Sims P et al. The Incentive Plan: An approach for modification of physician behaviour. American Journal of Public Health 1984 74 2 150-151

Tallis R. Hippocratic Oaths. Medicine and its Discontents. 2004. Atlantic Books, London. ISBN 1843541262

WAME Publication Ethics Policies for Medical Journals. World Association of Medical Editors. Downloaded 4 Jan 2008 from

Ware JE. Conceptualization and measurement of health-related quality of life: comments on an evolving field. Arch Phys Med Rehabil 2003;84 (S2):S43-S51

Ware JE. SF-36® Health Survey Update. Downloaded 04 January 2008 from

Young C, Horton R. Putting clinical trials into context. Lancet July 9 2005; 366: 107-8

Further Evidence Based Medicine pages by JW Fairley

Philosophical, scientific & statistical
background to evidence based medicine

Historical perspective and critique

Limitations of EBM in surgical specialities

The concept of focal sepsis in the sinuses: An
historical caveat

Thesis submitted for the degree of Master of Surgery, University of London,

Correlation of nasal symptoms with objective findings
and surgical outcome measurement.


All information and advice on this website is of a general nature and may
not apply to you. There is no substitute for an individual consultation. We
recommend that you see your General Practitioner if you would like to be