Forest plots, publication bias, and calibrated conclusions
Click any row to learn what each element of a forest plot means. The diamond at the bottom represents the pooled estimate across all trials.
Example meta-analysis — five trials, low heterogeneity (I² = 18%)
Study
Effect (favours intervention →)
Weight
ES [95% CI]
Smith 2019 n=120, RCT
18.4%
0.42 [0.18, 0.66]
Chen 2020 n=84, RCT
14.2%
0.38 [0.12, 0.64]
Patel 2021 n=210, RCT
32.1%
0.35 [0.18, 0.52]
Rossi 2022 n=66, RCT
11.8%
0.12 [−0.26, 0.50]
Kim 2023 n=180, RCT
23.5%
0.40 [0.18, 0.62]
Pooled effect
100%
0.37 [0.24, 0.50]
I² (heterogeneity)
18%
Pooled p-value
p = 0.001
No. of trials
5
Total N
660
Click a row above to learn what each element means.
Box size = weight
Larger boxes represent trials that contributed more to the pooled estimate — usually because they enrolled more participants or had lower variance.
Diamond = pooled result
The diamond's centre is the pooled point estimate. Its width is the confidence interval. A diamond that crosses the zero line means the pooled result is not statistically significant.
The I² statistic quantifies the proportion of variability that reflects genuine between-study differences rather than chance. Compare these two scenarios to see what low and high heterogeneity look like in practice.
Consistent findings — trials tell a similar story
Study
Effect
Weight
ES [95% CI]
Trial A
28%
0.38 [0.20, 0.56]
Trial B
22%
0.42 [0.22, 0.62]
Trial C
35%
0.33 [0.18, 0.48]
Trial D
15%
0.36 [0.14, 0.58]
Pooled
100%
0.37 [0.28, 0.46]
I²
12%
Interpretation
Low unexplained variability
The trials are telling a consistent story. The squares cluster in the same region, their intervals largely overlap, and the pooled diamond is narrow. An I² of 12% means most of the variability is attributable to chance rather than genuine between-study differences. The pooled estimate is a reasonable summary.
The trials are not telling a consistent story. The squares are spread across a wide range, some showing benefit and others showing harm. The I² of 84% means most of the variability is not explained by chance — though whether that reflects true biological differences across populations, bias, or measurement differences requires investigation. The pooled diamond crosses zero and is wide. Accepting it as a reliable summary would be misleading; the appropriate response is to investigate what is driving the spread.
A funnel plot shows each study's effect size against its precision. In a symmetric, unbiased literature, studies form an inverted funnel. When small null studies are missing — the pattern publication bias predicts — the funnel develops a gap.
Studies are distributed symmetrically around the centre line. Large precise studies cluster at the top; smaller studies scatter evenly on both sides at the bottom. No systematic absence of studies in any quadrant.
Small studies with null or negative results are absent from the lower-left region. Only positive small studies appear. This asymmetry suggests that unpublished null results may exist, and that the pooled estimate from published literature is likely inflated. Asymmetry should be investigated, not ignored.
Important caveat. Funnel plot asymmetry does not prove publication bias — it can also arise from genuine heterogeneity, small-study effects unrelated to bias, or chance when the number of studies is small (generally fewer than 10). It is a signal to investigate further, not a definitive finding. Formal tests such as Egger's test provide a quantitative complement to visual inspection but are not definitive either.
Drawing a conclusion from a body of evidence requires specifying what it supports, at what level of confidence, for which outcome, in which population. These examples show how the same evidence can support very different statements depending on scope.
Strong
Strong (Clinical)
Replicated across multiple large, independent, pre-registered trials measuring patient-relevant outcomes. Results are consistent, effect sizes are clinically meaningful, confidence intervals are narrow. Example: creatine for strength and lean mass in resistance training contexts.
Moderate
Moderate (Clinical)
Meaningful evidence base with some limitations — smaller trials, less independent replication, or modest effect sizes. Supports a considered position but not high confidence. Example: magnesium for sleep quality in people with low dietary intake.
Moderate (Biomarker)
Moderate (Biomarker)
Consistent evidence for a biomarker change, but the relationship to clinical outcomes has not been established. The evidence supports the biomarker effect, not the clinical implication marketing may attach to it. Example: berberine reducing fasting glucose — consistent, but clinical outcome translation is less certain.
Emerging
Emerging
Early signals from limited or preliminary evidence. Insufficient for confident claims but worth monitoring. Mechanistic rationale may be strong; human trial evidence is thin or inconsistent. Example: many newer longevity-focused compounds with small early trials.
Insufficient
Insufficient
The honest position when no meaningful conclusion can be drawn. This is not a failure — it is accurate representation of where the evidence currently sits. It protects readers from overconfident claims in either direction.
Calibrated uncertainty is not the same as saying nothing. A statement that distinguishes what is supported from what is not — that names the population, the outcome, the dose, and the confidence level — is far more useful than either "this works" or "we can't know." The goal is honesty about the current state of the evidence, not false resolution in either direction.