Reading the evidence: why studies conflict and what it means

You read two randomised controlled trials on the same supplement. Both are properly randomised, adequately powered, and independently funded. One finds a significant benefit. The other finds nothing. A third finds a benefit in one subgroup and not another. The natural response is to ask which trial is right. The more useful question is why they disagree — because often neither is wrong, and understanding the source of the disagreement is more informative than trying to pick a winner.

This is the problem of heterogeneity: genuine variability in treatment effects across different populations, doses, formulations, baseline states, and contexts. It is one of the most consistently mishandled issues in both the scientific literature and the supplement industry. Studies that conflict are either used to dismiss an entire evidence base or cherry-picked selectively — the positive ones cited as proof of efficacy, the negative ones quietly ignored. Neither approach is intellectually honest, and both lead to worse decisions.

What heterogeneity actually means

In clinical research, heterogeneity refers to variability in results that exceeds what would be expected from chance alone. When a meta-analysis pools multiple trials and finds that their results differ more than the play of chance would explain, that excess variability is called statistical heterogeneity. It signals that something other than random variation is driving the differences — that the trials are not, in an important sense, measuring the same thing.

The causes of heterogeneity fall into a few broad categories. Clinical heterogeneity arises from genuine differences between the populations studied: different ages, baseline health status, nutritional status, comorbidities, genetic backgrounds, or dietary contexts. Methodological heterogeneity arises from differences in how trials were designed and conducted: different doses, formulations, treatment durations, outcome measures, and risk of bias. These two types often compound each other, making it difficult to isolate any single source of variability.

The important implication is that a statistically heterogeneous body of evidence is not simply a noisy collection of measurements of the same underlying truth. It may be a collection of measurements of genuinely different things — different effects in different people under different conditions. Pooling them into a single overall estimate may obscure more than it reveals. It is also worth noting that not all variability reflects true biological differences — some arises from measurement error, chance, or bias — and distinguishing these sources is part of interpreting a heterogeneous evidence base.

Why the same supplement can produce different effects in different people

Effect modification — the technical term for what happens when the size or direction of a treatment effect varies across subgroups — is the core reason why supplement trials conflict. It is not a methodological failure. It is a biological reality.

Consider what determines whether a nutritional supplement produces a benefit. One of the most consistent predictors across many micronutrients is baseline status. A compound that corrects a genuine deficiency reliably produces benefit in deficient individuals. The same compound administered to people whose status is already adequate may produce little or nothing, because there is no deficiency to correct. This is the deficiency-repletion distinction that runs through Evidentia's evidence library, and it explains a large proportion of the apparent contradictions in supplement research.

Vitamin D is perhaps the most extensively documented example of how this plays out in practice. Large trials in mostly replete populations — including VITAL, ViDA, and D-Health — consistently find little to no skeletal or cardiovascular benefit from supplementation. Smaller trials and subgroup analyses focusing on participants with genuinely low baseline 25-hydroxyvitamin D concentrations tend to find more consistent signals, particularly for bone density, falls in older adults, and respiratory infections. A review published in Nature Reviews Endocrinology in 2021 concluded that supplementation alone in vitamin D-replete adults does not provide measurable health benefits, while correction of genuine deficiency is warranted. While baseline status explains much of the variation in trial outcomes, some heterogeneity remains even within stratified analyses, reflecting additional factors such as dosing strategy, adherence, and outcome definition. The picture is substantially clearer than the surface literature suggests, but not entirely resolved. The trials are not contradicting each other in any fundamental sense. They are largely measuring different things in different people.

This pattern appears across many other nutrients, with varying strength of evidence. Iron supplementation is highly effective in iron-deficient individuals and has no rationale in replete adults. Folate is critical in people with inadequate intake and in pregnancy, where deficiency has well-documented consequences, but its effects in well-nourished replete populations are different. Magnesium supplementation consistently produces effects in people with low dietary intake or documented insufficiency, and produces more modest effects in those whose intake is already adequate, though the evidence here is less definitive than for iron. The pattern is consistent across many nutrients: the magnitude of benefit often correlates with the degree to which supplementation addresses an actual gap, though the strength of this relationship varies across compounds and contexts.

Beyond baseline status, other characteristics genuinely modify how an intervention works. Age matters because absorption, metabolism, and physiological context change across the lifespan. Sex matters for several supplements, with different hormonal environments producing different responses. Body composition matters for fat-soluble vitamins, where bioavailability and distribution are affected by adiposity. Genetic variants affecting metabolism of folate, vitamin D, and omega-3 fatty acids are well-documented and create real variation in individual response. Diet and concurrent supplementation create context effects that are rarely fully controlled for.

The subgroup analysis problem

Identifying the sources of heterogeneity would be straightforward if every trial were designed to test them. In practice, identifying what modifies a treatment effect requires examining subgroups — and subgroup analysis is one of the most abused tools in clinical research.

The core problem is statistical. A trial powered to detect an overall treatment effect typically does not have enough participants within any individual subgroup to reliably detect whether the treatment effect differs across that subgroup. Testing many subgroups multiplies the probability of finding an apparently significant difference by chance. A trial with twenty subgroup analyses has a substantial probability of finding at least one significant result even if no true effect modification exists.

The correct approach is to pre-specify the subgroups of interest before the trial begins, based on a genuine biological rationale, and to test them as formal interaction analyses rather than simply comparing results within each subgroup separately. Interaction testing asks whether the treatment effect is significantly different across groups, which is a considerably harder statistical hurdle than finding a significant effect within one group and a non-significant effect in another. A review published in the European Journal of Clinical Investigation in 2019 found that in the majority of trials claiming subgroup heterogeneity in their abstracts, the results were not supported by formal interaction testing.

Post-hoc subgroup analyses — those not pre-specified, conducted after the data are collected, and selected because they produced interesting results — are hypothesis-generating at best. In supplement marketing, they are routinely presented as evidence. A trial that found no overall effect but found a positive result in one subgroup of participants is frequently cited as evidence that the supplement works in that group. Without replication in a trial specifically designed to test that subgroup, it is not.

What dose and formulation differences explain

Even in well-matched populations, trials can produce different results because they tested different versions of the intervention. Dose is the most obvious dimension: a compound that produces a meaningful effect at a therapeutic dose may produce no detectable effect at the lower doses that standard supplements provide. Formulation matters because bioavailability varies substantially between forms of the same nutrient.

The practical implication is that a trial conducted at a high dose does not validate the effectiveness of a product sold at a fraction of that dose. A clinical finding in a pharmaceutical-grade formulation does not establish that an over-the-counter supplement with different manufacturing standards, different excipients, and potentially different bioavailability will produce the same result. These distinctions are almost never acknowledged in supplement marketing, where the evidence from optimal conditions is presented as justification for products that operate in quite different ones.

Duration matters too. Many trials in supplement research are short — weeks rather than months, months rather than years. For outcomes that develop over long timeframes, short trials may simply be too brief to detect a real effect in either direction. A twelve-week trial showing no effect on a cardiovascular outcome does not establish that there is no long-term effect. It establishes that no effect was detectable in twelve weeks in that population at that dose.

How to think about a conflicting literature

When the evidence base for a supplement contains trials pointing in different directions, the useful response is not to pick the trials that support a preferred conclusion and dismiss the others. The more productive approach is to ask a series of questions about the sources of variability.

Are the conflicting trials testing the same population? Trials enrolling deficient individuals are not measuring the same thing as trials enrolling replete ones. Are they using the same dose and formulation? A trial using a therapeutic dose is not directly comparable to one using a standard supplement dose. Are they measuring the same outcome? A trial measuring a biomarker is not directly comparable to one measuring a clinical outcome. Are they of comparable quality? A small, industry-funded trial with a high risk of bias does not carry the same evidential weight as a large, independent, pre-registered trial.

Working through these questions usually resolves more apparent contradictions than it leaves unresolved. Trials that look contradictory at the surface often turn out to be measuring related but distinct questions. It is worth noting, however, that not all conflicting evidence resolves into a coherent underlying structure. Sometimes, after accounting for population, dose, formulation, outcome, and quality differences, genuine uncertainty remains — not because there is hidden heterogeneity yet to be explained, but because the evidence is genuinely insufficient to reach a confident conclusion. That residual uncertainty is not a failure of analysis. It is the honest epistemic position, and it should be stated as such rather than papered over with a confident claim in either direction.

The interactive diagram below illustrates the main sources of between-trial variability and provides a worked example of how the vitamin D literature looks when trials are stratified by baseline status.

Why this matters for supplement decisions

Understanding heterogeneity changes how you read a product claim. When a supplement company says "clinically studied," the relevant questions are not just whether trials exist but who they were conducted in, at what dose, with what formulation, and whether the population resembles you. A compound that produced measurable effects in deficient elderly adults in a clinical setting, at doses several times higher than the product provides, is not straightforwardly "clinically proven" for a replete younger adult buying it off a shelf.

This is not scepticism for its own sake. It is the application of a consistent interpretive standard: that evidence applies within its scope conditions, and that those scope conditions matter enormously in a field where marketing routinely treats the best possible evidence as general evidence.

The final article in this series addresses how to draw all of this together: reading a forest plot, accounting for publication bias, and arriving at calibrated conclusions from a mixed evidence base.

Where evidence is limited or outcomes are uncertain, conclusions should be treated as provisional and subject to revision as the evidence base develops.

Key references

Yusuf S, Wittes J, Probstfield J, Tyroler HA. (1991). Analysis and interpretation of treatment effects in subgroups of patients in randomized clinical trials. JAMA, 266(1), 93–98.

Wallach JD, Sullivan PG, Trepanowski JF, Sainani KL, Steyerberg EW, Ioannidis JPA. (2017). Evaluation of evidence of statistical support and corroboration of subgroup claims in randomized clinical trials. JAMA Internal Medicine, 177(4), 554–560.

Bouillon R et al. (2022). Consensus statement on vitamin D status assessment and supplementation. Endocrine Reviews, 43(5), 528–628.

Bolland MJ, Grey A, Gamble GD, Reid IR. (2018). Assessment of research waste: wrong study populations — an exemplar of baseline vitamin D status of participants in trials of vitamin D supplementation. BMC Medical Research Methodology, 18(1), 101.

Brookes ST, Whitely E, Egger M, Smith GD, Mulheran PA, Peters TJ. (2004). Subgroup analyses in randomised controlled trials: quantifying the risks of false-positives and false-negatives. Health Technology Assessment, 8(16), 1–56.

For individual supplement evidence reviews, see the Evidence library on Evidentia Nutrition.