How Simpson's Paradox Can Reverse the Story in Data

Averages can feel reassuring because they turn messy information into one clean number. A school posts a graduation rate, a survey reports one percentage, a dashboard ranks two programs, and the result looks easy to compare. Simpson’s paradox is the warning label on that kind of simplicity. It happens when a pattern inside separate groups disappears or even reverses after the groups are combined, leaving readers with two answers that both seem mathematically correct but point in different directions.

The paradox is not a trick in the arithmetic. The totals really do say what they say. The problem is that a combined rate can quietly mix groups of very different sizes, difficulty levels, risks, or starting points. Once that happens, the larger groups can pull the average so strongly that the overall comparison hides what happened inside each subgroup. For students learning statistics, Simpson’s paradox is one of the clearest reminders that data questions are rarely answered by the biggest number alone.

The Basic Pattern Behind the Paradox

Simpson’s paradox is usually described as a reversal. One option looks better in every meaningful subgroup, but the other option looks better after all the data is pooled. A simple example can make the surprise easier to see.

Imagine two online review programs, Program A and Program B. Students are split into beginners and advanced learners because those groups start from different places. Among beginners, Program A helps 45 out of 100 students pass, a 45% pass rate. Program B helps 8 out of 20 beginners pass, a 40% pass rate. Among advanced learners, Program A helps 91 out of 100 students pass, a 91% pass rate. Program B helps 144 out of 160 advanced learners pass, a 90% pass rate.

Inside each group, Program A is slightly better. It beats Program B for beginners and for advanced learners. But now combine the totals. Program A helped 136 out of 200 students pass, or 68%. Program B helped 152 out of 180 pass, or about 84%. The overall comparison says Program B looks much better.

Nothing has been calculated incorrectly. The reversal happens because Program A worked with many more beginners, while Program B worked mostly with advanced learners. The combined rate is not a simple fairness test. It is a weighted average, and the weights come from how many people are in each group.

Printed charts used to compare grouped data before drawing a conclusion.

Why Combining Groups Can Change the Answer

The key is that rates have denominators. A percentage is not just a score floating by itself. It is a fraction: successes divided by attempts, admitted students divided by applicants, recovered patients divided by treated patients, or correct answers divided by total questions. When groups are combined, their numerators and denominators combine too.

That means a small subgroup and a large subgroup do not have equal pull. If one program has most of its students in a harder group and another has most of its students in an easier group, the total can become more about group mix than program quality. The same logic can affect comparisons by school, department, hospital, neighborhood, age group, income level, prior preparation, or time period.

One way to spot the issue is to ask what the groups represent. If the groups differ in a way that strongly affects the outcome, combining them too quickly can blur the comparison. A tutoring program serving mostly students who are far behind may have a lower overall pass rate than a program serving mostly students who are nearly ready, even if it does better within each readiness level. The combined rate answers a different question: not simply which program worked better for similar students, but what happened across the particular mix of students each program served.

This is why Simpson’s paradox often appears in arguments about fairness, performance, and policy. The overall number is attractive because it is simple. The grouped numbers are often more informative because they reveal the structure underneath the average.

A Famous Real-World Case

One of the best-known examples comes from graduate admissions data at the University of California, Berkeley in 1973. The overall numbers appeared to show that men were admitted at a higher rate than women. That raised a serious question about possible bias. But when statisticians P. J. Bickel, E. A. Hammel, and J. W. O’Connell examined admissions by department in a 1975 Science paper, the picture changed.

The department-level data showed that applicants were not spread evenly across departments. Women had applied in larger numbers to departments with lower admission rates, while men had applied more often to departments with higher admission rates. The overall gap was therefore shaped heavily by department choice and department selectivity, not just by admission decisions within each department. In several departments, the within-department comparison did not match the simple overall story.

That does not mean every apparent reversal proves that nothing unfair happened. It means the right comparison depends on the question. If the question is whether similar applicants were treated differently inside departments, department-level data matters. If the question is why the overall admitted class looked unequal, application patterns and department selectivity also matter. Simpson’s paradox does not make judgment unnecessary; it makes careful judgment more necessary.

People reviewing printed charts and data reports before interpreting an overall trend.

The Role of Confounding Variables

A confounding variable is a third factor that affects the comparison being made. In the tutoring example, student starting level is a confounder because it affects pass rates and is unevenly distributed between programs. In the Berkeley example, department choice and department selectivity were central to understanding the overall admissions pattern.

Confounding variables are powerful because they can hide in plain sight. A chart may compare two groups while leaving out the reason those groups are not truly comparable. If one hospital treats many more severe cases, its overall recovery rate may look worse even when its outcomes are strong within each severity level. If one school enrolls more students who are learning English, a single test average may miss the progress those students make compared with similar peers. If one product is used mostly by beginners and another mostly by experienced users, a satisfaction score may reflect the user mix as much as the product itself.

The difficult part is that there is no automatic rule saying the grouped result is always better than the combined result. Sometimes the combined number is exactly what matters. Sometimes the subgroup numbers answer the fairer question. Often readers need both. The point is not to distrust every average, but to ask what has been averaged together.

Edward H. Simpson’s 1951 paper on contingency tables helped give the paradox its modern name, though related ideas had appeared earlier in the work of statisticians Karl Pearson and Udny Yule. The topic has lasted because it captures a common human mistake: assuming that one clean summary must be the whole story.

How to Read Data More Carefully

Simpson’s paradox gives readers a practical habit: pause before treating an overall percentage as the final answer. If the outcome could depend on starting level, age, location, department, income, health status, course difficulty, or any other major difference, subgroup data may be essential.

Check the denominator. Ask how many people, cases, or observations are behind each percentage.
Look for uneven group sizes. A large subgroup can dominate the combined result.
Ask whether the groups were comparable. If one group started with easier cases, the overall rate may be misleading.
Separate description from explanation. A total can describe what happened without explaining why it happened.
Compare like with like when possible. Subgroup comparisons often show whether a pattern holds under similar conditions.

These habits matter far beyond math class. News reports, school dashboards, research summaries, sports statistics, public health updates, and business charts all use averages and rates. A reader who understands Simpson’s paradox is less likely to be fooled by a dramatic total and more likely to ask what the data is grouping together.

Charts on a screen showing patterns that may need subgroup analysis.

What the Paradox Teaches About Evidence

Simpson’s paradox is memorable because it feels almost impossible at first. How can one option be better in every subgroup but worse overall? The answer is that percentages carry the shape of the data behind them. Change the mixture of groups, and the combined rate can change its meaning.

The deeper lesson is about evidence. Data does not interpret itself. A table can be accurate and still incomplete. A graph can be clear and still leave out an important grouping factor. A percentage can be useful and still answer a narrower question than readers assume.

Good statistical thinking does not reject summaries. It uses them carefully. Overall numbers help reveal broad patterns, while subgroup numbers show whether those patterns hold under different conditions. When the two disagree, the disagreement is not a nuisance to ignore. It is a clue that the structure of the data matters.

That is why Simpson’s paradox remains one of the most useful ideas in statistics. It turns a strange reversal into a practical reading skill: before believing the big average, look for the groups inside it.

Have any questions or need more information on the topics covered? Get quick answers, further details, or clarifications by chatting with our AI assistant, Novo, at the bottom right corner of the page.

How Simpson’s Paradox Can Reverse the Story in Data

The Basic Pattern Behind the Paradox

Why Combining Groups Can Change the Answer

A Famous Real-World Case

The Role of Confounding Variables

How to Read Data More Carefully

What the Paradox Teaches About Evidence

Akshay Dinesh

Add comment

Cancel reply

How Confidence Intervals Show the Range Behind a Result

How Effect Size Shows Whether a Result Really Matters

What Expected Goals (xG) Shows About Soccer Chances

📘 Free Tutoring – By Students, For Students

Like what we do?

Your Support Matters

Advertisement

Advertisement

Advertisement

Advertisement

Advertisement

Advertisement

Like what we do?

Follow Us

How Simpson’s Paradox Can Reverse the Story in Data

The Basic Pattern Behind the Paradox

Why Combining Groups Can Change the Answer

A Famous Real-World Case

The Role of Confounding Variables

How to Read Data More Carefully

What the Paradox Teaches About Evidence

Akshay Dinesh

Add comment

Cancel reply

You may be interested in

How Confidence Intervals Show the Range Behind a Result

How Effect Size Shows Whether a Result Really Matters

What Expected Goals (xG) Shows About Soccer Chances

📘 Free Tutoring – By Students, For Students

Like what we do?

Your Support Matters

Advertisement

Advertisement

Advertisement

Advertisement

Advertisement

Advertisement

Like what we do?

Follow Us

Edit Profile