Why Studies Conflict: Sources of Heterogeneity

When trials on the same supplement produce different results, the disagreement usually has a source. Some sources reflect genuine biological differences in who benefits; others reflect distortions introduced by study design and measurement. Click each card to explore the detail.

Biological differences — real variation in who benefits

◎

Baseline status

Deficient vs replete populations

One of the most consistent predictors of whether a nutritional supplement produces benefit is whether the person taking it actually needs it. Trials enrolling people with a genuine deficiency are measuring a fundamentally different situation from trials enrolling people with adequate status. Pooling these two groups, or comparing one trial in each type of population, will produce different results — neither trial is wrong.

Vitamin D in deficient adults vs vitamin D in replete adults. Iron in deficient women vs iron in replete adults. Folate in low-intake populations vs folate in well-nourished ones.

♟

Population characteristics

Age, sex, genetics, comorbidities

Beyond baseline nutritional status, genuine biological differences between populations modify how interventions work. Age affects absorption and metabolism. Sex affects hormonal context. Genetic variants in metabolic pathways create real variation in response — for example, MTHFR variants affect folate metabolism, and CYP2R1 variants affect vitamin D hydroxylation. Comorbidities change the context in which an intervention operates.

Creatine supplementation producing larger lean mass gains in older adults with sarcopenia than in young athletes. Magnesium supplementation showing stronger sleep effects in people with low dietary intake.

⏱

Duration and timing

Too short to see the outcome

Many supplement trials run for weeks or months while the outcomes they purport to address develop over years. For outcomes that develop over long timeframes, a short trial may simply be too brief to detect a real effect in either direction. Short trials may also not allow sufficient time for tissue saturation or physiological adaptation.

Vitamin D and bone density outcomes requiring years of follow-up to detect meaningful changes. Omega-3 and cognitive outcomes in ageing populations requiring decade-scale observation.

Study design and measurement — distortions that produce apparent conflict

⚗

Dose and formulation

Different amounts, different forms

Trials using different doses may be testing fundamentally different interventions. A dose that produces a measurable physiological effect may differ substantially from what a standard supplement provides. Formulation differences — different salt forms, different bioavailability, different delivery vehicles — can produce different results even at nominally similar doses.

Omega-3 cardiovascular trials at 4 g EPA daily vs 1 g EPA+DHA. Magnesium bioavailability varying across glycinate, citrate, and oxide forms.

⊡

Outcome selection

Different endpoints, different answers

Trials measuring different outcomes from the same intervention are not directly comparable. A trial measuring a biomarker and one measuring a clinical endpoint may produce different conclusions even if both are well-conducted. A trial measuring self-reported fatigue and one measuring performance on a cognitive task are measuring different things. Treating them as replications of the same question produces apparent contradictions that are not real conflicts.

Omega-3 trials measuring triglyceride levels vs measuring cardiovascular events. Magnesium trials measuring serum magnesium vs measuring sleep quality.

⚖

Trial quality differences

Not all trials carry equal weight

Trials differ in randomisation quality, blinding, sample size, pre-registration, and funding independence. When a small, industry-funded, unblinded trial and a large, independent, double-blind trial reach different conclusions, this is not symmetric evidence on both sides. The quality difference matters and should be reflected in how much weight each trial receives.

A 30-person industry-funded trial showing a positive result vs a 500-person independently funded trial showing null — these are not equivalent evidence in both directions.

Vitamin D is the most extensively studied example of how baseline status drives apparent contradictions between trials. The same supplement, the same dose range, consistently different results — because the trials enrolled fundamentally different populations. Click each trial to see the detail.

VITAL (Manson et al., 2019)

Mostly replete adults

No skeletal benefit

▾

Population

25,871 healthy US adults, mean age 67. Not selected for vitamin D deficiency. Mean baseline 25(OH)D adequate.

Dose

2,000 IU vitamin D3 daily for median 5.3 years.

Primary outcome

Cancer and cardiovascular events. No significant reduction in fractures, falls, or cardiovascular outcomes.

No benefit seen because participants were largely replete at baseline. Supplementation in people who are not deficient cannot correct a deficiency that does not exist.

ViDA (Scragg et al., 2017)

Mostly replete New Zealand adults

No fracture benefit

▾

Population

5,110 New Zealand adults aged 50–84. Mean baseline 25(OH)D 63 nmol/L — within adequate range.

Dose

100,000 IU monthly (bolus dosing). Follow-up 3.3 years.

Primary outcome

Cardiovascular disease, cancer, falls, fractures. No significant benefit on any primary outcome. Subgroup analysis in deficient participants suggested possible fracture benefit.

Bolus monthly dosing may produce different tissue effects than daily dosing. Population again largely replete at baseline.

D-Health (Neale et al., 2022)

Australian adults, largely replete

Mixed findings

▾

Population

311 adults aged 60–84 in Queensland, Australia. Largely replete given high sun exposure in the region.

Dose

60,000 IU monthly bolus over 5 years.

Primary outcome

All-cause mortality — no significant difference. Some signals in cancer mortality subgroup analyses, not replicated definitively.

High baseline status limits the potential for supplementation to add further benefit. The geographical context (sun exposure) is a meaningful modifier rarely discussed in marketing of vitamin D products.

Deficiency-targeted meta-analyses

Trials in genuinely deficient populations

Consistent benefit

▾

Evidence

Meta-analyses stratifying by baseline 25(OH)D consistently find larger and more significant effects in participants who were deficient at baseline, particularly for falls in older adults, bone density, and respiratory infection risk.

Key finding

A 2021 Nature Reviews Endocrinology consensus concluded that supplementation in vitamin D-replete adults does not provide measurable benefit, while correction of genuine deficiency is clinically warranted.

The evidence is not contradictory — it is coherent once baseline status is accounted for. Trials show benefit where deficiency exists; they show little benefit where it does not.

The vitamin D literature looks contradictory only if you treat all trials as measuring the same thing. Stratified by baseline status, the picture becomes largely coherent: supplementation addresses deficiency where it exists, and adds little where it does not. Some heterogeneity remains even within strata, reflecting dosing strategy, adherence, and outcome definition — but the main driver of apparent conflict is population mismatch, not fundamental disagreement about the compound's effects.

Subgroup analysis — examining whether an intervention works differently in different groups of participants — is a legitimate and important tool when done correctly. It is also one of the most misused tools in supplement research. The difference is in how the analysis was planned and what it requires to be credible.

Lower credibility

Analysis conducted after data was collected (post-hoc)

Subgroup not pre-specified in trial registration

Many subgroups tested without correction

No formal test of interaction — only within-group p-values compared

No biological rationale for why this subgroup should differ

Not replicated in any subsequent trial

Higher credibility

Pre-specified in trial registration before recruitment

Clear biological rationale for the subgroup hypothesis

Formal interaction test conducted and significant

Subgroup analysis was primary, not exploratory

Sample size sufficient within the subgroup

Finding replicated in independent trials

The interaction test matters. Finding a significant effect in one subgroup and a non-significant effect in another does not establish that the treatment works differently across those groups. The correct test is whether the effect sizes are significantly different from each other — the test of interaction. This is a harder statistical hurdle and is far less often reported. A review published in JAMA Internal Medicine in 2017 found that 61% of trials claiming subgroup heterogeneity in their abstracts were not supported by formal interaction testing.

When you encounter a supplement with a mixed evidence base, these questions help determine whether the conflict is apparent or genuine.

Are the conflicting trials testing the same population?

Trials in deficient populations and replete populations are not measuring the same thing. Neither is a trial in elderly clinical patients equivalent to one in healthy middle-aged adults. Population mismatch is the most common source of apparent contradiction.

Are they using the same dose and formulation?

A null result at 1 g EPA+DHA does not contradict a positive result at 4 g pure EPA. A negative result with magnesium oxide does not contradict a positive result with magnesium glycinate. Dose and formulation differences are real differences, not noise to be averaged away.

Are they measuring the same outcome?

A biomarker trial and a clinical outcome trial are not replications of the same question. A self-reported questionnaire and an objective performance measure are different things. Ensure the conflicting trials are actually in conflict, not just addressing adjacent questions.

Are they of comparable quality?

A large, independent, pre-registered, double-blind trial does not carry the same weight as a small, industry-funded, open-label one. When trials of unequal quality conflict, the higher-quality trial should generally be weighted more heavily, not treated as one vote against one vote.

Were they long enough to measure the outcome?

A short trial failing to detect a long-latency effect is not evidence that the effect does not exist. Duration mismatch is particularly common in nutrition research, where meaningful outcomes may take years to develop and most trials run for weeks or months.

After accounting for these factors, is genuine uncertainty what remains?

If after working through these questions genuine uncertainty remains — well-matched trials in comparable populations reaching different conclusions — that is the honest epistemic position. It means the evidence is uncertain, and that should be stated rather than resolved by picking a side.

Working through these questions resolves more apparent contradictions than it leaves unresolved. Trials that look contradictory at the surface often turn out to be measuring related but distinct questions. Genuine uncertainty — where this process still leaves conflicting signals — warrants explicit acknowledgement rather than false resolution in either direction.

Why studies conflict and what it means

Lower credibility

Higher credibility