Suppose you’re interested in answering a simple question: how effective is aspirin at relieving headaches? If you want to have conviction in the answer, you’ll need to think surprisingly carefully about how you approach this question. Your first idea might simply be to take aspirin the next time you get a headache, and see if it goes away. But upon reflection, that won’t be enough.
First of all, since all headaches go away eventually, whether it goes away isn’t the relevant question. It would be better to ask how quickly the headache goes away. But even this question may not pin down what you care about, because even if aspirin doesn’t relieve your headache completely, a significant reduction is still worthwhile.
To deal with this issue, you might decide that when you next have a headache you’ll make a record of how you feel every half hour for two hours. You’ll write things like “dull, throbbing pain of low intensity” or “sharp, searing pain over one eye”.
Unfortunately, just examining how you feel after taking aspirin a single time won’t be adequate, since the aspirin may be more helpful some times and less helpful others. For example, it could be the case that it works on moderate headaches but not on severe ones, so if your next headache happened to be really severe, it would look like aspirin was useless. To solve this problem, and give yourself more data, you might resolve to make these records of how you’re feeling for each of the next 20 headaches you get.
There is still a problem though. These subjective descriptions of headaches are difficult to compare to each other. It you take aspirin and your headache goes from a sharp pain over one eye to an intense ache over the entire head, have you made things better or worse? It would be difficult to aggregate the information from these varied descriptions over 20 different headaches to make a final assessment of how well aspirin is working.
Your analyses would be a lot simpler if you scored how unpleasant each headache was on a simple scale from 1 to 5 (1 meaning slight unpleasantness, 3, moderate unpleasantness, and 5, extreme unpleasantness). That way, you can simply look at all the scores you got just before taking aspirin and average them together. You can then compare this to the average of the scores 30 minutes after taking the aspirin and 60 minutes after taking it. That way, you can see if the amount of unpleasantness you feel really does drop substantially after taking aspirin.
Recall that our goal here is to determine how effective aspirin is at relieving headaches, but all you’ve done so far is measure how good it is at relieving your own headaches. Perhaps you are more or less sensitive to aspirin than other people, or perhaps your headaches are more severe and harder to treat than most other people’s. To solve this problem, you enlist 40 people who are frequent headache sufferers. You get them to agree that, over the next 6 months, any time they begin to notice that they have a headache they will record how they feel on your 1 to 5 scale. They will then take aspirin and record how they feel 30 minutes later.
But what if people take different doses? You might think that the aspirin isn’t working for some of them, but it’s only because they haven’t taken enough. To fix this problem, you hand them each an identical bottle of pills and tell them to take two whenever they get a headache (the maximum recommended dose). This also has the added benefit that everyone will be taking the same exact brand. That way if you find out that the aspirin really does work, other people can try to replicate your results by using the same brand that you did. To make your experiment even better, you also provide everyone with a timer, to help them be more accurate about recording their pain 30 minutes later.
There is still a problem though. You know that headaches often become less severe within an hour or so even when you don’t take aspirin. That means that even if someone’s pain score tends to have fallen 30 minutes after taking the aspirin, you don’t really know whether it is the aspirin that caused the reduction in pain or if the reduction would have occurred regardless. To remedy this, you come up with the idea of having only half of the 40 participants take aspirin when they have a headache, though everyone will still keep a record of their headache’s progression. Then, to see how well the aspirin worked, you can compare the average pain scores of the 20 people who take the aspirin with the scores of the 20 people who don’t. If the aspirin group’s pain fell a lot more than the non-aspirin group, then the aspirin probably was the cause.
But is it possible that the pain levels people record could be influenced by the act of taking a pill, independent of the chemical effect of the active ingredients? Perhaps since people expect aspirin to work, the people in the group taking the pills are more aware of signs of improvement.Or perhaps people’s expectations of improving can even influence how much pain they experience. If either of these are true, the aspirin would seem to work better than it really does, because some of the reported improvement would be coming from expectations people have about the chemical, rather than just the chemical itself.
Fortunately, these problems are easily remedied. Instead of giving half the group nothing, you instead give them pills in an aspirin bottle that look just like aspirin, but which have no effect on headaches. Sugar pills are a reasonable choice, because a small amount of sugar is unlikely to effect a person’s headaches.
This raises ethical considerations, however. People might agree to take aspirin, but that doesn’t mean they are willing to take sugar pills. So before you start your study, you’ll need to inform everyone that there is a 50% chance they won’t be getting medication. You can’t let them know which kind of pill they got until after the analyses is over. You’ll also want to get them to agree to not take any other headache medication during the experiment, and to record any medication that they do happen to take.
So half of your group will be taking aspirin, and the other half will get sugar pills. But who should get which? If, for example, the 20 people getting aspirin have headaches that typically last much longer than those of the 20 people that aren’t getting anything, it could make aspirin seem less effective than it really is. So, if possible, you don’t want there to be any substantial differences between the two groups of people. A simple way to make sure this sort of thing is unlikely is to use a computer to assign individuals to the two groups at random.
It is even better if someone else is in charge of this te randomization (secretly recording which person is assigned to which type of pill). That way, when you talk to the subjects about the experiment, there is no chance that you accidentally tip them off (with body language, or subtle cues) to which type of pill they are getting. Furthermore, when you analyze the final results, you won’t have any temptation to make the data come out a particular way (since you won’t know until you are done which subject was taking the aspirin and which was taking the sugar pill).
Unfortunately, even with all the precautions, if you carried out this experiment on multiple occasions, you’d get somewhat different results every time. The people you would be able to recruit might would probably be different, and might respond differently to the medicine. What’s more, even if you used the same people each time, the intensity of their headaches might vary from one 6 month period to the next, which could also influence how the results turn out.
But, if the results fluctuate randomly, that implies that sometimes, just by luck alone, the aspirin might seem to be effective even if it isn’t. Likewise, it might, through bad luck, seem to be ineffective, even though it does work. So whatever your experiment shows, how can you be sure you’re getting the right answer?
Unfortunately, since chance is involved, certainty is not possible. But a statistician can easily calculate for you the probability that you would get results (in favor of aspirin working) that are at least as strong as the ones that you got, if in fact aspirin is no more effective than the sugar pill. If this probability is large, then based on your experiment you do not have sufficient evidence to conclude that aspirin helps headaches. If this probability is small (say, less than 5%), then aspirin very likely is effective. In order to increase the likelihood that the results of your test are conclusive, you would need only to increase the number of participants involved.
We see that in order to answer questions with a high degree of certainty, a well thought out methodology is necessary. Most elements of good study design become obvious when we reflect logically on the ways that data may mislead us. When possible, experiments should be double-blind (both the participant and researcher don’t know who is receiving which treatment) with a control (either a placebo, or another treatment whose effectiveness is already know). Studies should have large sample sizes (20 participants in each treatment group, at the bare minimum), should use standardized dosages, should apply a standardized (and predetermined) method for measuring results, and should employ careful statistical analysis. Without these things in place, studies are hard to trust, for reasons that become apparent as we think our experiment through. Even when we’re just answering a simple question about aspirin.