Articles / Reading the evidence: how to draw calibrated conclusions from a mixed evidence base
Reading the Evidence — Part 5 of 5
Evidence Review13 April 2026

Reading the evidence: how to draw calibrated conclusions from a mixed evidence base

JJ
Professor Jatin Joshi · BDS MBBS MSc(Oxon) MFDS FRCS(Plast), Hon. Professor of Surgery (Translational Research), University College London

The previous articles in this series have built a framework for reading individual studies: how they are graded, how to assess their quality, how surrogate markers can mislead, and why trials on the same supplement often produce different results. These are necessary foundations. But reading individual trials is not the same as reading a body of evidence. At some point, a judgement is required: given everything available, what does the evidence actually support?

That question is harder than it appears. It is not answered by counting studies on each side and awarding the verdict to the majority. It is not answered by finding one large positive trial and treating that as definitive. And it is not answered by finding contradictions and concluding that nothing can be known. Drawing a calibrated conclusion from a mixed evidence base requires understanding how evidence is synthesised, what that synthesis can and cannot tell you, and how to name uncertainty honestly when it is the most defensible position available.

What a forest plot shows and how to read it

When a systematic review pools the results of multiple trials into a meta-analysis, the standard way to present those results is a forest plot. Reading a forest plot is a practical skill that any person engaging seriously with the supplement literature will benefit from developing.

Each row in a forest plot represents one included trial. A square marks the point estimate for that trial — the effect size found in that study. A horizontal line through the square shows the 95 percent confidence interval for that estimate. The size of the square reflects the weight given to the trial in the overall analysis: larger squares represent trials that contributed more to the pooled estimate, typically because they enrolled more participants or had lower variance.

A vertical line runs through the plot at the point of no effect — zero for continuous outcomes, one for relative risk or odds ratio measures. A trial whose confidence interval crosses this line could not rule out no difference between the intervention and control group. A trial whose interval sits entirely on one side of the line showed a statistically significant effect in that direction.

At the bottom of the plot, a diamond represents the overall pooled estimate from all included trials. The centre of the diamond is the pooled point estimate. The width of the diamond is the confidence interval around it. A narrow diamond centred well away from the line of no effect represents a precise, statistically significant pooled result. A wide diamond that touches or crosses the line of no effect represents an uncertain or non-significant overall finding.

What the plot also reveals, immediately and visually, is heterogeneity. If the individual trial squares and their intervals are clustered tightly together, the trials are telling a consistent story. If they are spread widely apart, with some showing large effects and others showing none, the trials are not consistent — and the pooled diamond may be hiding more than it reveals. The I-squared statistic, usually reported alongside the plot, quantifies this: it estimates the proportion of variability not explained by chance alone, but does not distinguish between true biological differences between populations and other sources of variability such as bias or measurement differences. Some degree of heterogeneity is expected even in well-conducted evidence bases — the key question is whether it materially alters interpretation. A high I-squared value, conventionally above 50 percent, signals substantial unexplained variability that warrants investigation and cautious interpretation of the pooled estimate, rather than simply accepting it as a reliable summary. In some cases no clear explanation emerges, and the appropriate conclusion is that the evidence is genuinely inconsistent.

The interactive diagram below provides an annotated forest plot you can explore, with examples showing what low and high heterogeneity look like and how the diamond should be interpreted in each case.

What publication bias does to the picture

Even a well-constructed meta-analysis can present a distorted picture if the body of literature it draws from is itself distorted. Publication bias — the tendency for trials with positive or statistically significant results to be published at higher rates than trials with null or negative results — is one of the most consequential and well-documented problems in clinical research.

The mechanism is straightforward. Journals have historically been more likely to accept papers reporting positive findings. Researchers are more likely to pursue publication when their results are significant. Sponsors of industry-funded trials have commercial incentives to ensure favourable results reach publication and to allow unfavourable ones to remain unpublished. The cumulative effect is a published literature that systematically over-represents positive findings.

For meta-analysis, this is a fundamental problem. If null results are systematically absent from the pool of published trials, the pooled estimate will be biased upward. A meta-analysis of six published trials showing positive effects may be missing four unpublished trials showing nothing, which would substantially change the conclusion if included. Bias also occurs within published trials themselves, where outcomes may be selectively reported or redefined based on results — an issue covered in Part 2 of this series — meaning that even the data within published papers may not represent everything that was measured.

A precise pooled estimate does not necessarily imply high certainty. If the underlying evidence is biased, heterogeneous, or indirect — drawn from surrogate outcomes, unrepresentative populations, or trials at high risk of bias — the pooled number may be precise in a mathematical sense while remaining highly uncertain as a guide to clinical reality. The pooled estimate itself also depends on the statistical model used: random-effects models give more weight to smaller studies and typically produce wider confidence intervals when heterogeneity is present, while fixed-effect models assume all trials estimate the same underlying effect, which is rarely realistic in supplement research. In some cases — particularly when heterogeneity is extreme, bias is likely, or study designs are not comparable — pooling may produce a mathematically correct but clinically misleading result.

The funnel plot is the standard visual tool for detecting this. In a symmetric, unbiased body of literature, plotting each trial's effect size against a measure of its precision produces an inverted funnel shape: large, precise trials cluster near the true effect, and smaller, less precise trials scatter more widely around it but symmetrically. When small trials with null results are missing — the scenario that publication bias predicts — the funnel develops a gap in one corner. Asymmetry in a funnel plot raises the suspicion of publication bias, though it can also reflect genuine heterogeneity or small-study effects, and interpreting it requires caution, particularly when the number of studies is small.

In supplement research, the conditions that generate publication bias are consistently present. Industry funding is common, independent replication is limited, and the commercial pressure to publish positive findings is substantial. A meta-analysis in this space should be read with an awareness that the underlying literature is likely skewed, and that pooled estimates probably overstate the true effect to some degree.

The problem of small-study effects

Closely related to publication bias is the small-study effect: the tendency for small trials to report larger effect sizes than large trials on the same question. This arises partly from publication bias — small null trials go unpublished while small positive trials do — and partly from the statistical reality that small trials can only reach significance when the observed effect is large, whether because the true effect is large or because chance produced an unusually extreme result.

The consequence is that in a field dominated by small trials, as supplement research is, effect estimates from individual studies and early meta-analyses tend to be inflated. As larger, better-powered, independently funded trials are conducted, effect sizes typically shrink. Reading a meta-analysis that is built primarily on small, industry-funded trials should prompt the question: would this finding hold if larger independent trials were included?

How to hold uncertainty without collapsing into either confidence or nihilism

The framework built across this series leads to a practical question: given what you now understand about evidence quality, surrogate markers, heterogeneity, and publication bias, what is the right epistemic stance toward a supplement with a mixed or limited evidence base?

Two unhelpful responses are common. The first is false confidence: selecting the positive evidence, ignoring the negative, and concluding that the supplement works. The second is epistemic nihilism: concluding that because the evidence is imperfect or conflicting, nothing can be known and no judgement is possible. Neither is intellectually honest, and neither is useful.

The calibrated response is to characterise what the evidence actually supports, at what level of confidence, for which population, at which dose, for which outcome. This means being specific about scope rather than retreating to vague generalities in either direction. A statement like "evidence suggests modest effects on a specific outcome in a specific deficient population at a specific dose, with moderate confidence, from trials with some methodological limitations" is more honest and more useful than either "this supplement is proven" or "the evidence is too mixed to say anything."

This is not a counsel of paralysis. Calibrated conclusions still direct action. A compound with consistent biomarker evidence but no clinical outcome trials warrants different behaviour from one with replication across large independent trials in clinically relevant populations. Naming that difference clearly is the point.

What Evidentia's ratings reflect

The two-axis evidence model used throughout the Evidentia library — rating both the strength and the type of evidence separately — is an attempt to apply exactly this framework. A rating of Moderate (Clinical) means something different from Moderate (Biomarker), and both mean something different from Emerging (Mechanistic). The intention is to prevent the kind of inferential collapse that supplement marketing depends on: the slide from "there is some evidence" to "this works."

The evidence ratings in Evidentia are conservative by design. Strong ratings are reserved for evidence that has been replicated across large, independent, well-conducted trials measuring patient-relevant outcomes. Moderate ratings reflect a meaningful but imperfect evidence base — enough to support a considered position, not enough to warrant high confidence. Emerging ratings reflect early signals worth monitoring but not yet sufficient to justify confident claims. Insufficient ratings reflect the honest position that no meaningful conclusion can currently be drawn.

These distinctions do not tell you what to take. They tell you what the evidence supports and at what level of confidence. What to do with that information involves individual circumstances, baseline status, dietary context, and tolerance for uncertainty that no evidence review can resolve on someone's behalf.

A practical reading framework for supplements

The five articles in this series have built a set of questions that, taken together, constitute a practical framework for reading any supplement claim:

What type of study is this, and where does it sit in the evidence hierarchy? How well was the study conducted — was it adequately powered, properly blinded, pre-registered, and independently funded? What was actually measured — a patient-relevant outcome or a surrogate marker, and if a surrogate, has it been validated? Does the evidence apply to me — was the population, dose, and formulation comparable to my situation? Why might this study conflict with others — is the disagreement explained by baseline status, dose, formulation, duration, or outcome differences, or does genuine uncertainty remain? And finally, what does the body of evidence as a whole support — at what level of confidence, for which outcome, in which population?

None of these questions requires statistical training to apply. They require the habit of asking them. Supplement marketing depends on the reader not asking them — on the weight of a claim being assessed by its confidence and repetition rather than by its evidential basis. Developing the habit of asking changes that dynamic.

The evidence base for any supplement will continue to develop. Trials that have not been conducted yet may strengthen or undermine what is currently believed. Conclusions that feel solid today may be revised by better evidence tomorrow. That is not a reason for paralysis — it is the normal condition of evidence-based reasoning. The appropriate response is to hold conclusions proportionately, update them when better evidence arrives, and be honest about the uncertainty that remains.

Where evidence is limited or outcomes are uncertain, conclusions should be treated as provisional and subject to revision as the evidence base develops.

Key references

Higgins JPT, Thomas J, Chandler J et al. (eds). (2023). Cochrane Handbook for Systematic Reviews of Interventions, version 6.4. Cochrane.

Sterne JAC et al. (2011). Recommendations for examining and interpreting funnel plot asymmetry in meta-analyses of randomised controlled trials. BMJ, 343, d4002.

Ioannidis JPA. (2005). Why most published research findings are false. PLOS Medicine, 2(8), e124.

Guyatt GH et al. (2008). GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ, 336(7650), 924–926.

Turner EH, Matthews AM, Linardatos E, Tell RA, Rosenthal R. (2008). Selective publication of antidepressant trials and its influence on apparent efficacy. New England Journal of Medicine, 358(3), 252–260.

For individual supplement evidence reviews, see the Evidence library on Evidentia Nutrition.

← Back to articles