The Lament of the One-Sample T-Test

Below are the contents of my first blog entry over at PrintedScholar, a very interesting, new, multidisciplinary site. The entry is a bit long and rather specific to some issues I’ve been having with data analysis and interpretation in response to reviewers of my work. It feels really good to get it off my chest!

It has recently come to my attention that many of my colleagues in Psychology (and Biology and Neuroscience) are rather attached to a few, sometimes limiting, types of inferential statistical tests. Okay, that’s not entirely accurate: I’ve known about the bias for a while, but now that I (try to) publish independently of my (terrific) mentors, I suddenly find myself the target of intense criticism with even the slightest modification to the “preferred’ statistical plan. The bias I refer to has to do with an over-reliance on parametric tests that, most often, are used to compare data from different groups (between subjects) or compare data from the same group but sampled at different times (within subjects). My issues are with the former and less with the latter.

An excellent example of a parametric test that psychologists love is the analysis of variance or ANOVA. I totally get the adoration—I love this test too! Used correctly, it is elegant and awesome and useful and powerful and reliable. Sadly it is often used incorrectly and sometimes there are just better options that suit the data and the hypotheses more appropriately. Of late, I find myself confronted with those kinds of data and hypotheses. And I have yet to present my statistical approach, on which I will elaborate momentarily, without being told, in no uncertain terms, that it is wrong, weird, and must be changed (I’m paraphrasing).

I am a behavioral neuroscientist and I am on the faculty of a Psychology department. All my training and all my mentors were also behavioral neuroscientists reared in the tradition of psychology. However, now that neuroscience has emerged as a discipline in and of itself, the target readers of my papers are quite varied in background and expertise. Perhaps this is part of the problem I am facing? In any case, I work with rats and due to a ‘language’ barrier I have with them I am forced to make inferences about the content of their knowledge based on the structure of my behavioral assays and, from there, my statistical tests. Accordingly, a bunch of the behavioral assays that I rely on in my research rest on the assumption that if a rat is doing something “more” because of some experience I have given it, then it was affected by that experience in a meaningful, and thus measurable way.

I can gather evidence of the existence of the “more doing of something” by comparing the rat’s behavior to that of other rats that have not had the experience. That would be a between group comparison and while it can be useful, it is probably not the best place to focus in the examples I give below. Much more important in many cases, is the comparison of each rat’s behavior to what I would expect to see in the test hypothetically if the rat was operating in a predominately “random” way. “Doing more of something” would mean, in essence, that my rat was not being random. Statistically, then, I take great care in ensuring that the behavior of my rat is actually lining up with my expectations for the task. This is particularly important in my control rats because they are my gold standard. They should respond in the non-random way that I am aiming for: I am asking: “Dear rat, did you have experience X?” And the rat is saying back to me: “Why yes, experimenter, I did.” Excellent! Now equipped with this knowledge from my control rats I can now move on to the rats that make up my experimental condition(s). These are the rats I did something to and now I am predicting that because of what I did they might have a different answer to my question. And so, I ask: “Dear rat, did you have experience X?” And, if my experiment panned out as I expected, it may say back to me: “Why no, experimenter, I did not.” Now that is useful information. How then do I deal with issues of degree? Let me give you some examples first and then we can discuss degree.

I have 3 tasks that I use regularly in my lab that are relevant here. One is a test of spatial memory in a water maze, another is a test of object recognition, and the third is a test of conditioned place preference. All 3 have the question and answer situation I described above in common.

Let’s take a look at our water maze task. If I put a rat in a large pool of water and hide an escape platform under the water that rat will learn about the spatial configuration of the room that the pool is in and will be able to find the hidden platform quicker and quicker with each trial, even if it is placed in the pool from different start points on every trial. Thus, it is using spatial learning and memory to navigate as quickly and efficiently (usually) to the hidden platform to escape from the water, which is a little cool and not that pleasant for the rat. Now rats are very clever animals and even if they have been subjected to an experimental manipulation that renders them unable to use spatial memory (say a lesion to their hippocampus) they will still potentially improve on this test as indexed by how quickly they can locate the hidden platform. What then, is an experimenter to do to be sure that a rat actually knows where the platform is and isn’t relying on some non-spatial strategies to find it. Well thanks to Richard Morris, we have ways of making them “talk”. We can give the rats a trial in the pool without the platform. If the rat knows where the platform is, in space, it will preferentially search that area of the pool: their behavior is non-random. Rats without hippocampuses are mostly random. Sure they may try to swim a certain distance from the edge or make loops, but they do not show a bias for a place in the pool: their search pattern is random.

How to glean random from non-random? Well the pool is a circle. We can slice it from top to bottom and side to side to form 4 equally-sized quadrants. If the platform is always located in one of those quadrants then non-random behavior would be disproportionate amounts of time spent swimming there. Random would be spending approximately equivalent times in all 4 quadrants, or 25% in each. Based on this, I could say to the rat: “Do you know where the platform is usually located?” And the rat could say: “Why yes, I do.” The answer actually looks more like the rat spending 40% or 50% or perhaps even more of its time in the appropriate quadrant: whichever, it will be more than you would expect based on random searching, or >25%. To know what a meaningful amount more than 25% is, I need a statistical test that is designed to assess the rat’s “response” to my question. The one-sample t-test is just that kind of test. With this nifty little treasure I can take the time spent in that quadrant for each group of rats and compare it to the value that would be generated by a random-searching rat and Viola! I have statistical confirmation that my rat knows where the platform is in the pool of water. I would obviously expect my control rats, to which I’ve done nothing other than love and care for them, to offer me a resounding yes. Other groups that may have been subjected to an experience designed to intrude on their learning or memory would then answer no.

The same principle applies to my two other examples. In the object recognition test the rat studies an object and then later is presented with the studied object and a new one. Rats like novel things and like to investigate them. This tendency is quite reliable and not unlike the behavior of humans, particularly human infants for which there is a similar task so designed because, like with rats, there is a communication barrier. The inference here is that rats that spend more time exploring the novel object do so because it is new to them and thus the other object gets less of their attention because it is old news—they remember it! Once again percentages are at work. Comparing the time spent exploring the novel object amongst different groups of rats is not going to be very useful in this case unless we also take into consideration their attention to the studied object. We therefore calculate a percentage or ratio that reflects the preference or bias of the rat toward the novel object. Here again is our question: “Dear rat, do you recognize this object?” “Why yes, I do.” The answer really: the rat spends more time exploring the novel than the familiar. Differences in rates of exploration among different rats (they are individuals too after all) can be held in check by comparing novel time to total exploration time. A rat might thus spend, say, 30 seconds exploring the novel object and 10 seconds with the old object yielding a value of 75% with the novel. If I exercise my assumption about the rat’s memory by setting the value of 50% as characteristic of the rat with little memory for the old object and thus checks out both objects about the same, I can once again use my nifty little one-sample t-test and compare each percentage yielded with that set point.

Same is true for conditioned place preference. Present the rat with an awesome experience while it is confined to one side of a large box. Also give it a ho-hum experience while confined to the other side of the box. What you’ll find, not surprisingly, is that when given free access to the entire box the rat will spend more of its time on the side that was previously paired with the awesome experience. “Dear rat, was that awesome?” “Yes it was!” And so of the total time with free access the rat spent more than 50% on the awesome side. “Dear rat, was that awesome?” “Huh? What? Something awesome happened?” Or maybe: “The sides of the box are different?” Either way, that rat will split its time about 50-50.

Okay, I’ve belabored my premise long enough, yes? What about the stats bias, ANOVAs, the intense desire from every reviewer and critic I meet that I do a between subjects comparison on these data, and the oh so important issue of degree?? Let’s start with degree. The fundamental question, and problem, is whether degrees of rats’ behavior, because let’s face it they are not going to actually divide their time evenly, is meaningful or not. Now for me, that is precisely why I prefer to rely on my beloved one-sample t-test. It is the statistical tool I need to tell me that indeed this is statistically more than the 25% or the 50% from my examples above. Okay, now let’s add in between subjects comparisons. I am continuously slammed for not including them. Yes, they could be useful. A group of rats that on average spends 52% of their time on one side of the box is probably going to be statistically different from a group of rats that spends 75% of their time there. And lucky you for having such strong and tight results! Congratulations, take your medal, and don’t let the door hit you on the way out. In my lab, and this does depend on what we are doing of course, it usually looks more like 52% and 65%, or 60%. That might be statistically different. More often than not, it isn’t. Here’s the thing though: I actually don’t care! My reviewers care enormously. Me, I do not. Why? Because we haven’t gotten the very important answer from each group: do you remember? Yep. Nope. That’s what I want to know. What if one group is at 60% and the other group is at 75%. Now what if I also tell you that I used a one-sample t-test and have evidence that both groups are statistically more than ‘chance’: both groups are behaving non-randomly. And then I did a between group comparison and one group has, statistically, a higher percentage than the other. Does that group remember more? Better? What does that difference mean? Does a rat that spends 40% of its swim time searching where the platform once remember the location better than the rat that spends 80% of its time there? Or is there something else different about these rats that led to that outcome? To me, that’s important. And sometimes, it’s distracting.

Here’s another scenario. On the test of object recognition one group of rats spends just 20% of its exploration time with the novel object and a second group spends 50% with the novel object. If I relied strictly on the between subjects comparison there would be a big problem here. Sure they are statistically different, but what if one of those groups is my control group? They are not behaving as I expect and thus something is up with my test. How can I make any sense of the other data now?

As a final mention, I will say that one sort of ridiculous way around this it turns out is to just do a within group comparison. I’ve discovered that I can do a paired t-test that addresses my question by comparing the same group of rats’ time with the old object against their time with the new object. I can also do a mixed factorial ANOVA—combining a between subjects comparison (control versus the experimental group or groups) with a within-subjects comparison (time in target quadrant with the times in the other quadrants; times with novel and sample objects, time in both sides of the box). At least for the paired t-test the basic premise of the tests is assessed in the control group, but why there is such a hullabaloo over it versus the one-sample t-test (the tests are unlikely to arrive at different conclusions) is a mystery to me. And that, for the record, starts to be very challenging to show visually in a figure. The mixed ANOVA is still plagued with how to interpret the between-subjects results.

And so concludes my first (really long) blog posting for PrintedScholar. I hope that sharing my frustrations of trying to use statistics flexibly and fittingly might resonate with others out there and I welcome any and all feedback.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s