Reading the evidence: how research is graded and why it matters

Spend any time reading about supplements and you will encounter the word "studies" used as though it settles the question. A product is said to be "clinically proven" or "backed by research" and the implication is that no further scrutiny is needed. But studies are not all the same. A cell culture experiment and a large randomised controlled trial are both studies. So is a survey of what people report feeling after taking a supplement. The word covers an enormous range of evidence quality, and treating it as a single category is one of the most reliable ways to be misled.

This is the first in a series of articles about how to read health and nutrition evidence critically. It is not aimed at researchers. It is aimed at anyone who wants to make more informed decisions about what they take and why, and who would rather understand the basis of a claim than simply trust or distrust it.

Why evidence needs to be graded

The central problem in nutrition and supplement research is that the question we want to answer — does this compound improve this outcome in people like me — is genuinely difficult to answer well. Human biology is complex, people differ from one another in ways that matter, and the outcomes we care about most take years or decades to measure. Research designs exist on a spectrum from those that control for these difficulties to those that do not attempt to at all.

Grading evidence is not about dismissing research that sits lower in the hierarchy. It is about being honest regarding what each type of study can and cannot tell us. A study that shows a mechanism by which a compound might reduce inflammation does not tell us that taking that compound reduces inflammation in people. A study that shows people who eat more of a food tend to have better health outcomes does not tell us that eating more of that food causes the better outcome. These distinctions matter enormously in practice, and they are routinely ignored in supplement marketing.

A practical way to think about this is to arrange study designs in order of their general ability to answer questions about cause and effect in humans. This is the basis of the evidence hierarchy that Evidentia uses, and that evidence-based medicine has developed as a teaching framework over the past three decades. It is worth being clear that this hierarchy is a useful starting point, not a complete theory of evidence. The appropriate study design depends on the question being asked: for long-latency harms, rare outcomes, or questions that cannot ethically be randomised, observational evidence may be the best available human evidence rather than a lesser rung awaiting trial confirmation. The pyramid below illustrates the five main tiers as they apply to the intervention questions most relevant to supplements. Click each one to understand what it means and how it is used.

The base: expert opinion and anecdote

The base of the pyramid is not without value, and it is worth distinguishing two things that often get grouped together here. Anecdote is uncontrolled individual observation: a person takes a supplement and feels better, a clinician notices that several patients improve. These observations are where signals often first emerge and they matter as hypothesis-generators. The problem is that human memory is selective, placebo effects are real and substantial, and without a control group there is no way to know whether the improvement would have happened anyway.

Expert interpretation is a different function. It does not constitute independent evidence, but experienced clinicians and researchers play a real role in synthesising ambiguous evidence, identifying when mechanistic and clinical lines of evidence cohere, and recognising practical limitations that trial data alone may not capture. The issue is not that expert judgement is worthless, but that it should be anchored in the evidence base rather than substituting for it.

Anecdote is also the primary currency of supplement marketing. Testimonials, before-and-after accounts, and influencer endorsements operate at this level. They are compelling because individual stories are vivid and emotionally resonant in ways that population-level data are not. But a compelling story is not evidence of a general effect, and the stories that get shared are not a representative sample of everyone who tried the same thing.

Mechanistic and preclinical evidence

One tier up sits mechanistic and preclinical evidence. Mechanistic research examines how a compound interacts with biological systems at the molecular or cellular level. Preclinical research tests effects in cell cultures or animal models. This tier answers the question of how something might work, and it is genuinely important for generating hypotheses and understanding biology.

The gap between this tier and clinical evidence is larger than most supplement content acknowledges. A compound that activates a pathway in isolated cells may not reach the relevant tissue at supplemental doses in humans. An effect seen in mice may not translate to people. The history of medicine is full of interventions that looked compelling in preclinical work and failed in human trials. This does not make mechanistic research unimportant, but it does mean that a mechanistic rationale is the beginning of an evidence case, not its conclusion.

Evidentia labels all mechanism-based claims explicitly. When a supplement is described as proposed to support a biological process, that framing reflects the evidence available. When the language shifts to established or demonstrated, it reflects something more.

Observational studies

Cohort studies and case-control studies sit in the middle of the hierarchy. These are observational designs that examine associations between exposures and outcomes in real populations without intervening. A cohort study might follow thousands of people over decades, recording what they eat and what health outcomes they experience. A case-control study might compare people who developed a condition with people who did not, looking back at what differed between the groups.

These studies are valuable for studying outcomes that take a long time to develop, for examining questions that cannot ethically be randomised, and for identifying associations that then warrant further investigation. The Nurses Health Study and similar large cohorts have generated enormous amounts of useful nutritional epidemiology.

The limitation is confounding. People who take a particular supplement, eat a particular diet, or have a particular lifestyle tend to differ from those who do not in many other ways simultaneously. These differences, not the intervention of interest, may explain the observed association. Researchers use statistical methods to adjust for known confounders, but unmeasured confounders remain a persistent problem. Observational associations generate hypotheses. For intervention questions, they usually cannot establish causality on their own, though they can substantially strengthen or weaken a case when combined with other lines of evidence.

Randomised controlled trials

For most intervention questions, a well-conducted randomised controlled trial provides the most reliable available evidence of causal effect in humans. Participants are randomly assigned to receive the intervention or a control condition, which distributes both known and unknown confounding factors between the groups. If randomisation is done well and the trial is large enough, any difference in outcomes between groups can reasonably be attributed to the intervention itself.

Not all RCTs are equal. A well-designed trial is pre-registered, adequately powered, blinded where possible, uses an appropriate control, and is conducted independently of commercial interests. A poorly designed trial may be too small to detect a real effect, may use a dose or formulation that does not reflect real-world use, or may be funded and conducted by parties with a financial interest in a particular result.

Outcome selection matters as much as design. A trial measuring a surrogate marker, a blood test result, a biomarker, or a physiological variable, rather than a patient-relevant outcome such as reduced symptom burden, improved function, or avoided disease, may be formally well-conducted but still provide weak evidence for real-world benefit. The surrogate may not reliably predict the outcome that matters. This distinction is central to reading supplement research critically, because many trials in this space measure biomarker changes and present them as evidence of clinical benefit. Evidentia distinguishes these explicitly throughout the library.

Evidentia assesses each RCT for these quality dimensions rather than simply counting the number of trials as though each one carries equal weight. A single large, well-designed, independent trial measuring patient-relevant outcomes often provides stronger evidence than several small, industry-funded trials showing surrogate marker changes.

Why hierarchy alone is not enough

Understanding where a study sits in the hierarchy is necessary but not sufficient. Two systematic reviews can exist at the same tier of the pyramid while supporting very different levels of confidence in their conclusions. The reason is that certainty of evidence depends on more than study design.

The GRADE framework, developed by an international working group and now used by the World Health Organisation, Cochrane, and most major clinical guideline bodies, formalises this distinction. It rates certainty of evidence across four levels: high, moderate, low, and very low. A body of evidence can start at high certainty if it consists of well-conducted RCTs, but that rating can be downgraded based on serious risk of bias in the included studies, inconsistency across results, indirectness of the evidence to the population or outcome of interest, imprecision due to small sample sizes or wide confidence intervals, or suspected publication bias.

In practice this means that a systematic review of RCTs on a supplement question can yield a very low certainty conclusion. This happens regularly in nutrition research, where trials are short, samples are small, outcomes are heterogeneous, and publication bias is well-documented. It means that "a meta-analysis exists" does not end the evaluative question. It means asking what certainty the meta-analysis actually supports.

Evidentia reflects this distinction throughout its evidence ratings. A rating of Moderate or Emerging does not simply mean fewer trials exist. It reflects the totality of confidence that can reasonably be placed in the evidence base, accounting for design, execution, consistency, and outcome relevance.

Systematic reviews and meta-analyses

At the top of the hierarchy sit systematic reviews and meta-analyses. A systematic review uses a pre-specified, reproducible method to identify and synthesise all available research on a question. A meta-analysis goes further by pooling the numerical data from multiple studies to produce a single overall effect estimate with greater statistical power than any individual trial.

These methods exist because individual trials, even well-designed ones, are subject to chance variation. A single trial finding a positive effect might have been lucky. A systematic review of ten trials on the same question gives a clearer picture of whether the effect is real and how large it is.

The important caveat is that a meta-analysis is only as good as the trials it includes. If the underlying trials are small, short, heterogeneous in their populations and doses, or at high risk of bias, pooling them mathematically does not resolve those problems. Evidentia notes the quality of the underlying evidence base when reporting meta-analytic findings, rather than treating a meta-analysis as automatically definitive.

Why this hierarchy is relevant to supplements specifically

Supplement marketing operates almost entirely in the lower tiers of the pyramid while presenting claims as though they reflect the upper tiers. A compound is shown to activate an enzyme in cell culture, and the marketing copy says it supports cellular energy production. A study in elderly people with a deficiency shows improvement in a biomarker, and the product is marketed for general health enhancement in everyone. An observational association between dietary intake and an outcome is presented as justification for supplementation.

None of these are necessarily dishonest in their individual facts. The mechanistic study may be real. The biomarker trial may be genuine. The observational association may be robust. The problem is the inferential leap from what the evidence actually shows to what the product claims to do.

Understanding where a piece of evidence sits in the hierarchy is the first step to evaluating whether that leap is justified. The second step is assessing the certainty that can reasonably be placed in the total body of evidence: how large and well-conducted the trials are, how consistent the findings are across studies, how directly the evidence applies to the population and outcome of interest, and how wide the gap is between what was measured and what we actually care about. Those questions are the subject of the articles that follow in this series.

Where evidence is limited or outcomes are uncertain, conclusions should be treated as provisional and subject to revision as the evidence base develops.

Key references

Sackett DL et al. (1996). Evidence based medicine: what it is and what it isn't. BMJ, 312(7023), 71-72.

Ioannidis JP. (2005). Why most published research findings are false. PLOS Medicine, 2(8), e124.

Higgins JPT et al. (2011). The Cochrane Collaboration's tool for assessing risk of bias in randomised trials. BMJ, 343, d5928.

Guyatt GH et al. (2008). GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ, 336(7650), 924-926.

For individual supplement evidence reviews, see the Evidence library on Evidentia Nutrition.