Your verification ID is: guDlT7MCuIOFFHSbB3jPFN5QLaQ Big Computing: Simpson paradox - My first post will always be a paradox

Tuesday, March 15, 2011

Simpson paradox - My first post will always be a paradox

Simpsons's Paradox - When Big Data Sets Go Bad
It's a well accepted rule of thumb that the larger the data set, the more reliable the conclusions drawn. Simpson' paradox, however, slams a hammer down on the rule and the result is a good deal worse than a sore thumb. Unfortunately Simpson's paradox demonstrates that a great deal of care has to be taken when combining small data sets into a large one. Sometimes conclusions from the large data set are exactly the opposite of conclusion from the smaller sets.  Unfortunately, the conclusions from the large set are also usually wrong.
To understand this effect we'll use a set of simulated data. Table 1 shows the average physics grades for students in an engineering program. This is a difficult class used for weeding out weaker students. Most of these students prepared for college by taking high school (HS) physics. The data illustrates that there is a ten point advantage for those with HS physics. Table 2 shows the average physics grades for students in a liberal arts program. This class is designed as an elective course for the enrichment of students who would otherwise avoid physics. Few students have prepared for this class by taking HS physics. However, those few who took HS physics have a 10 point grade advantage. In both classes taking physics clearly produced an advantage.
We now combine the data sets. The combined results for students who took physics are shown in table 3. The average college physics grade has been determined by adding all the grade points (4475) and then dividing by the total number of students (55). Table 4 shows the same results for the students without HS physics. The results of tables 3 and 4 indicate that students who take physics perform worse than those who don't by 2.3 points. This is the opposite conclusion from the conclusion of tables 1 and 2. 
Obviously, combining the data sets gives a misleading picture but why? The answer lies in two parts. First, the data sets for the two major groups (engineering and liberal art students) were influenced by a lurking variable, course difficulty. The engineering students received a rigorous course. The liberal arts students a less demanding enrichment course. Second, the groups in the data sets were not the same size.  This caused the average of college physics grades to be weighted toward engineering student grades for those who had taken HS physics. Since the engineering students' course was more rigorous it lowered the average. The opposite was true for the combined results of those who didn't take HS physics. 
HS PhysicsNoneImprovement
Ave Grade807010
Table 1. Average college physics grades for students in an engineering program.
HS PhysicsNoneImprovement
Ave Grade958510
Table 2. Average college physics grades for students in a liberal arts program.

# StudentsGradesGrade Pts
Lib Arts595475
Table 3. Average college physics grades for students who took high school physics.
# StudentsGradesGrade Pts
Lib Arts50854250
Table 4. Average college physics grades for students who didn't take high school physics.
There were four separate groups in the study as follows:
  1. Engineering students with HS physics
  2. Engineering students without HS physics
  3. Liberal arts students with HS physics
  4. Liberal arts students without HS physics
If all the four groups had been the same size, the results would have indicated that students with HS physics had a 10 point advantage in their college physics grades regardless of the type of college physics they took. Likewise if an average had been calculated which was not weighted toward group size, the results would also have  indicated the same 10 point advantage.
Simpson's Paradox is caused by a combination of a lurking variable and data from unequal sized groups being combined into a single data set. The unequal group sizes, in the presence of a lurking variable, can weight the results incorrectly. This can lead to seriously flawed conclusions. The obvious way to prevent it is to not combine data sets of different sizes from a diverse sources. 
Simpson's Paradox will generally not be a problem in a well designed experiment or survey if possible lurking variables are identified ahead of time and properly controlled. This includes eliminating them, holding them constant for all groups or making them part of the study.


  1. Interesting take on the subject - Eric Falkenstein has another take on the subject here:

    In the financial markets, finding the confounding variable is usually the tricky part - we know they're there, we just don't always know what they are!

  2. Also interestingly, you can increase your data size enough to find effects that don't exist.

  3. I found another nice article on this paradox. I will add the link here: