Sunday, March 05, 2006

Statistics 101: Uncertainties in Surveys

In my previous post, I showed the results of a MIT survey on Democrat and Republican average opinions of when the U.S. should use force. One thing bothers me when I see survey data like this - most of the time, survey results are shown without stating eiher the sample size that was used or the standard deviation associated with the data. In my mind, these are mosre important than the results that are shown, because the sample size and standard deviation (which is calculated from the sample size) tell us if the results are significant or not.

Suppose someone publishes a survey before an election. Suppose all that is given is 51% of Americans, on average, will vote for Bush, and 49% will vote for Kerry. A reader, the vast majority of whom likely know nothing about statistics, reads this and must conclude that Bush will have the election won. But here is the part many polls do not present, or if they do this information is usually in the fine print that many people miss. What if the sample size used for the poll is 100 likely voters? Statistically, there is an uncertainty in results that must exist when one is using a subset of the population to draw conclusions for the entire population. The range of likely results is determined by the standard deviation, which in a counting experiment such as a survey, is found by the square root the sample size, N. If N = 100, then the standard deviation is sqrt(100) = 10. The percent uncertainty is then 10/100 = 0.10, or a 10% spread. All of the sudden the results look much less significant in the poll: 49% for Kerry versus 51% for Bush with a 10% uncertainty means one can draw no conclusions at all, that the results predicting who will win the election are insignificant.

For the MIT survey, I would want to know what the sample size is in order to get a feel for how much of a difference really exists in some of the categories the survey addresses. For example, the results for coming to the defense of an ally if the ally is attacked is 91% for Repblicans, and 75% for Democrats. But if the sample size was, say, 100, a 10% uncertainty suddenly means there is no significant difference between the two parties since the results overlap. Now, if 1000 people were surveyed, sqrt(1000) = 31.6 and sqrt(1000)/1000 = .0316, or 3.16% for a percent difference (or percent margin for error). The Dems could be as high as 78% and the Republicans as low as 88%, so the fact that there is no overlap means that there is a statistically significant difference between the two parties. But we will not be able to determine this unless the sample size is reported.

I am not a statistician, nor do I have experience with polling experiments, but this is a basic statistical method for determining the range of uncertainty in polling data. The point is to not take survey data too seriously unless it is possible to make statistically significant conclusions from the data.

2 comments:

Curtis Gale Weeks said...

Vonny,

The Boston Review pdf lists the sample size as 1170 people — 45% Democrat, 42% Republicans.

One thing to take into consideration, also, is that a "partisan divide" is some part fantasy, since even 61% (for instance) means that 39% in any party had a different opinion. Trends in thought/opinion might be discerned but fuzzy, but finding absolutes in the results would be inappropriate.

vonny said...

Thank you, Curtis. It is amusing when, too often, the spin doctors on both sides try to speak in terms of absolutes based on polling data. As you say, this is inappropriate. What can be disheartening is when a good portion of the public buy into the spin. Thank you for the data.