Reading the evidence: how to read a study critically

The previous article in this series explained why evidence needs to be graded and how study designs sit in a hierarchy from anecdote to systematic review. Understanding the hierarchy is a useful foundation, but it answers only the first question about any body of evidence: what type of research is this? The second question is harder and more important: how well was it done?

Two randomised controlled trials can sit at the same tier of the evidence pyramid while differing enormously in the confidence they warrant. One might be a large, independent, pre-registered trial measuring patient-relevant outcomes in a well-defined population. Another might be a small, industry-funded study measuring a blood test result in a group of participants that bears little resemblance to the people the product is marketed to. Both are randomised controlled trials. Neither is the same as the other in terms of what it actually tells us.

This article explains the factors that determine trial quality. It is not an exhaustive methodological guide. It is a practical framework for anyone who wants to read a study with more discernment than simply noting whether it reached statistical significance.

What the trial was designed to measure

The most fundamental question to ask about any trial is what outcome it was actually measuring and whether that outcome is one that matters.

In clinical research, a distinction is drawn between patient-relevant outcomes and surrogate markers. Patient-relevant outcomes are things that matter directly to the person taking the intervention: reduced symptoms, improved function, avoided hospitalisation, better quality of life, longer survival. Surrogate markers, also called biomarkers or intermediate endpoints, are measurable variables that are thought to be related to those outcomes: a blood test result, a physiological measurement, an imaging finding.

Surrogates are used in research because the outcomes we really care about often take years to develop and are difficult and expensive to measure. Tracking changes in a blood marker over twelve weeks is far easier than following a cohort for a decade to see whether they develop disease. The problem is that surrogate changes do not always predict clinical outcomes. Some surrogate markers are well-validated predictors of clinical outcomes, but many used in nutrition and supplement research are not, and their relationship to meaningful outcomes remains uncertain. A compound might reduce a biomarker reliably without reducing the incidence of the disease that biomarker is thought to reflect. This has happened repeatedly in clinical medicine, including cases where interventions that looked compelling based on surrogate data turned out to cause harm when clinical outcomes were eventually measured.

In supplement research, this distinction is routinely ignored. A trial shows that a compound changes a blood marker and the marketing copy states that it supports the relevant body system. The logical gap between "this changes a number on a blood test" and "this prevents disease or improves health" is rarely acknowledged. Reading this distinction actively, in every study, is one of the most valuable habits in evidence appraisal.

How participants were assigned to groups

Randomisation is the defining feature of an RCT, but it is easy to describe randomisation in a paper without actually having done it properly. Inadequate randomisation is one of the most common sources of bias in clinical trials.

When randomisation works as intended, participants with better and worse prognoses are distributed evenly between the intervention and control groups. Differences in outcome between the groups can then be attributed to the intervention rather than to pre-existing differences between participants. When randomisation is inadequate, these pre-existing differences remain, and the results may reflect them rather than the intervention.

A related concept is allocation concealment: whether the person enrolling participants into the trial knows which group each participant will be assigned to at the time of enrolment. If allocation is not concealed, it is possible, whether consciously or not, to enrol participants who are likely to respond well into the intervention group and those who are likely to do less well into the control group. This is distinct from blinding and is important even in trials where blinding is not possible. Trials with inadequate allocation concealment consistently show larger apparent treatment effects than trials where concealment was properly implemented.

The interactive diagram below walks through the anatomy of a well-designed RCT. It shows where randomisation and allocation concealment sit in the trial process, and where the main sources of bias enter.

Whether participants and researchers were blinded

Blinding refers to whether participants, researchers, or outcome assessors know which group a participant has been assigned to. Its importance is often underestimated.

Participant blinding matters because of the placebo effect, which is well-documented, real, and substantial. People who know they are receiving an active intervention tend to report improvements regardless of whether the intervention has any physiological effect. In supplement research, where many primary outcomes are self-reported measures of energy, mood, or cognitive performance, this effect can easily produce an apparent benefit that disappears when a properly blinded trial is conducted.

Researcher blinding matters because investigators who know which participants received the active intervention may, again whether consciously or not, interact with those participants differently, record ambiguous observations in ways that favour the intervention, or make differential decisions about follow-up and data handling.

Outcome assessor blinding matters when the primary outcome requires human judgement. If the person measuring the outcome knows which group the participant is in, that knowledge can influence the measurement.

The standard approach in high-quality trials is double-blinding, where neither participants nor researchers know group assignment until after the analysis. Where blinding is not feasible, for example in trials of behavioural interventions or surgical procedures, this limitation should be acknowledged and its potential impact considered when interpreting the results.

In supplement research, blinding is often described but sometimes imperfect. Compounds with distinctive sensory properties, notable side effects, or visible physical changes can break blinding in practice even when it was intended. A well-designed trial addresses this by testing whether blinding was maintained.

Sample size and statistical power

A trial can be designed, randomised, and blinded correctly and still fail to detect a real effect simply because it enrolled too few participants. This is the problem of statistical power.

Power is the probability that a trial will detect a true effect if one exists. It is determined primarily by sample size, the size of the effect being looked for, and the variability of the outcome measure. A trial designed to detect a large effect in a relatively homogeneous population can be adequately powered with a few hundred participants. A trial looking for a modest effect in a heterogeneous population may require several thousand.

In supplement research, underpowered trials are common. Many studies enrol fewer than a hundred participants, run for weeks rather than months, and use outcome measures with high variability. The consequence is that such trials have limited ability to detect real effects of modest magnitude. When they do produce a positive result, that result carries an elevated risk of being a false positive inflated by chance. This is sometimes called the winner's curse in research methodology: the effect sizes reported by small, underpowered trials that happen to reach significance tend to be substantially larger than the true effect, because only the largest chance fluctuations carry a small trial over the significance threshold.

A trial should specify its sample size calculation before participants are enrolled. This calculation states the assumed effect size, the acceptable probability of a false positive, and the acceptable probability of a false negative, and derives from them the required number of participants. Absence of a sample size justification in a published trial is a warning sign.

The difference between statistical significance and effect size

Statistical significance and effect size are different things, and conflating them is one of the most reliable ways to misread a trial.

Statistical significance, expressed as a p-value, is a statement about probability. A p-value of 0.05 means that if the intervention had no effect at all, you would expect to see a result at least as extreme as the one observed in approximately five percent of trials by chance alone. Crossing the conventional threshold of p less than 0.05 does not mean an effect is real; it means it is unlikely to be purely due to chance at that probability level. It says nothing about whether the effect is large, clinically meaningful, or relevant to the person deciding whether to take a supplement.

Effect size is a description of how large the difference between groups actually is. A trial might find a statistically significant reduction in a fatigue score of 1.2 points on a scale that runs from 0 to 100, with a p-value of 0.02. The result is statistically significant. Whether a 1.2-point difference represents a change that a person would notice or that would influence their quality of life is a separate question entirely. This is the distinction between statistical significance and clinical significance.

The problem runs in both directions. A small trial may fail to detect a real effect of meaningful size simply because it lacks power, producing a non-significant result that is misleading in the other direction. A very large trial may detect a tiny difference that has no practical importance at all, and report it as a significant finding.

Confidence intervals make this relationship clearer than p-values alone. A 95 percent confidence interval describes the range of values within which the true effect plausibly lies, given the trial data. A narrow confidence interval means the estimate is precise; a wide one means there is substantial uncertainty about the true effect size. An interval that crosses zero, meaning it includes both possible benefit and possible harm, indicates that the trial cannot rule out no difference between the intervention and control. The diagram above illustrates how to read a confidence interval alongside an effect estimate.

The American Statistical Association, in a 2019 statement led by Wasserstein, Schirm and Lazar, formally challenged the scientific community to move beyond the binary of significant versus non-significant, recommending that researchers report effect sizes with confidence intervals and interpret them in context rather than treating a p-value threshold as a decision rule.

Whether the trial was pre-registered

Pre-registration is the practice of publicly recording a trial's design, primary outcome, and analysis plan before data collection begins. It is now a requirement for publication in most major medical journals and is a basic condition of methodological credibility.

Its importance lies in what it prevents. A trial that measures twenty outcomes and then reports only the three that reached significance is not providing twenty independent pieces of evidence for its conclusions. It is, in effect, asking twenty questions and then presenting the answers to only the ones it liked. This practice, sometimes called outcome switching or p-hacking when it happens at the analysis stage, produces results that look more compelling than they are. Pre-registration creates a public record against which the published results can be compared.

The evidence that this problem is widespread is substantial. A 2004 study by Chan and colleagues, comparing protocols submitted to a Canadian ethics board with the eventual published papers, found that 88 percent of trials failed to report at least one pre-specified outcome. A more recent audit of trials published in five major journals between 2015 and 2016 found that only 13 percent had consistent primary and secondary outcomes across their registry entry, protocol, and published paper.

In supplement research, pre-registration rates are lower than in pharmaceutical research, and the practice of registering a trial only after data has been collected, sometimes called retrospective registration, offers little of the protection that genuine pre-registration provides. When reading a trial, checking whether it was registered before recruitment began, and whether the published primary outcomes match those in the registration, is a meaningful quality check.

Whether the evidence applies to the person asking

A trial can be well-designed, properly randomised, adequately powered, and independently funded and still provide limited guidance for a specific individual if the population studied does not reflect that person's situation. This is the question of external validity, sometimes called generalisability or applicability.

In supplement research, population mismatch is common. Many trials are conducted in elderly adults with documented deficiency, in clinical populations with an established condition, or in specific demographic groups. The results of those trials may not apply to younger healthy adults with normal status who represent the primary market for the product. A compound that corrects a deficiency state does not necessarily produce benefit in people who are not deficient. A finding in a particular disease population does not generalise to healthy people. These are not failures of the individual trials; they are limitations on how far the evidence can travel.

Dose and formulation specificity compound this problem. A trial conducted at a specific dose, with a specific form of a compound, does not establish that a different dose or formulation will produce the same effect. In supplement marketing, evidence from clinical doses is frequently used to justify products with doses that bear little resemblance to what was tested. Evidence from one formulation is transferred to another without acknowledgement that bioavailability or efficacy may differ substantially.

When reading a trial, it is worth asking not only how well it was conducted but how closely the population, dose, and formulation match the situation in question.

Who funded the trial and who conducted it

Funding source does not determine a trial's conclusions, but it is a documented predictor of which direction those conclusions tend to fall in.

Trials funded by the manufacturer of the product being studied tend to produce more favourable results on average than independently funded trials on the same interventions. This effect has been demonstrated across pharmaceutical research, nutrition research, and supplement research specifically. It operates through multiple mechanisms that are difficult to disentangle: study design choices that favour positive outcomes, selective reporting of outcomes, choice of comparators, early termination when results look favourable, and publication practices that bury negative findings.

The relationship between industry funding and positive findings does not mean every industry-funded trial is biased or that independently funded trials are necessarily reliable. It means that funding source is relevant information when assessing the overall confidence to place in a body of evidence, and that a field dominated by manufacturer-funded trials requires particular scepticism. It is also worth noting that the researchers who conduct trials are sometimes the same individuals who are paid to consult for the companies whose products they are studying. Declared conflicts of interest are worth reading carefully rather than treating as a formality.

In supplement research, the proportion of trials with industry involvement is high, and the volume of independent replication is often low. A finding that rests primarily on industry-funded trials, particularly small ones, warrants considerably more caution than one that has been replicated by independent groups in different populations.

Putting it together: what a critical reading actually looks like

Reading a trial critically does not require statistical expertise. It requires asking a consistent set of questions.

What outcome was measured, and is it one that matters to the person who might take this supplement? Was randomisation done properly, and was allocation concealed? Were participants and researchers blinded, and was blinding maintained? Was the trial large enough to detect the effect it was looking for? Is the effect size clinically meaningful, and does the confidence interval exclude zero? Was the trial pre-registered before data collection, and do the published outcomes match the pre-specified ones? Does the study population, dose, and formulation actually reflect the person asking the question? Who funded the research, and were conflicts of interest declared?

No single answer to any of these questions is conclusive. A positive answer to all of them provides substantially more confidence than a positive answer to one or two. Most trials in supplement research will have weaknesses in several of these dimensions simultaneously, which is precisely why effect size inflation is so common in this area and why individual positive trials should rarely be taken at face value.

The distinction between what a trial shows and what it is claimed to show in marketing materials is almost always located somewhere in this list of questions. A compound that reduces a biomarker in a small, industry-funded trial with no pre-registration and no blinding check has not been shown to provide the clinical benefit the product label implies. Understanding why requires exactly this kind of reading.

The next article in this series examines the specific problem of surrogate markers in more depth: what they can and cannot tell us, and how the gap between biomarker change and clinical outcome is routinely exploited in supplement marketing.

Where evidence is limited or outcomes are uncertain, conclusions should be treated as provisional and subject to revision as the evidence base develops.

Key references

Schulz KF, Altman DG, Moher D for the CONSORT Group. (2010). CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials. BMJ, 340, c332.

Higgins JPT et al. (2011). The Cochrane Collaboration's tool for assessing risk of bias in randomised trials. BMJ, 343, d5928.

Wasserstein RL, Schirm AL, Lazar NA. (2019). Moving to a world beyond "p < 0.05". The American Statistician, 73(sup1), 1–19.

Chan AW, Hrobjartsson A, Haahr MT, Gotzsche PC, Altman DG. (2004). Empirical evidence for selective reporting of outcomes in randomized trials: comparison of protocols to published articles. JAMA, 291(20), 2457–2465.

Lexchin J, Bero LA, Djulbegovic B, Clark O. (2003). Pharmaceutical industry sponsorship and research outcome and quality: systematic review. BMJ, 326(7400), 1167–1170.

For individual supplement evidence reviews, see the Evidence library on Evidentia Nutrition.