Do Student Evaluations of Teaching Really Get an "F?"

This post was originally posted to the Blog of Rice's Center for Teaching Excellence. The original post can be found here.

It's been a bad year for student evaluations.

In the space of two weeks in October, both NPR and the Harvard Business Review published pieces summarizing studies that were critical of their use. With provocative titles like "Student Course Evaluations Get an F" and "Better Teachers Receive Worse Student Evaluations," these pieces were (and continue to be) widely shared and much discussed among academics.

Slate then joined the conversation in December with a piece entitled, "Best Way for Professors to Get Good Student Evaluations? Be Male." Here we learned of a study that suggested these instruments were irredeemably biased against women. And when Ben Schmidt created his fascinating Rate My Professor Gender Visualization in February, the presence of this bias seemed to be confirmed (and was further chronicled in the periodicals academics like to read).

When the Iowa state legislature attempted to institute something of an academic Hunger Games at their public institutions in April, it was only natural that debate about student evaluations would be revived yet again. NPR republished its October piece under the new title, "What if Students Could Fire Their Professors?," and another series of blog posts sprung to life, culminating in a May "defense" of student evaluations, which--in a somewhat odd approach for a defense--began by arguing that they are "biased, misleading, and bigoted."

In June, the hate for student evaluations had grown so large that it began to bleed into reporting on other, seemingly unrelated, issues in higher education. In a piece for the Washington Post on the "Least Powerful People in the Academy," we learn that there is not only "a mountain of evidence that these evaluations do not properly measure teaching quality," but that their use is part of the reason we should be worried about Title IX investigations and the current climate of political correctness on campus. 

Finally, late last month, Inside Higher Ed reported on a long-awaited AAUP survey on student evaluations in a piece simply titled "Flawed Evaluations." And less than a week later, a Nobel Laureate proposed an alternative measure of teaching effectiveness that would free us from the "tyranny" of student evaluations that, apparently, "everyone complains about."

With press like this, it's not surprising that some faculty have argued, and many faculty believe, that when it comes to student evaluations, literally "everyone recognizes the data is worthless" [emphasis mine].[1]

Sure, the AAUP report cited above suggests this assessment of faculty attitudes is off by around 47%. But don't the above reports make it clear that student evaluations are, as a point of fact, invalid measures of teaching effectiveness? Aren't we all justified in believing that everyone who has actually studied student evaluations will recognize that the data is worthless?

This is certainly what the dominant narrative in higher education reporting would have us believe. And, up until this year, I was inclined to believe it.

This all changed when, for the first time in my academic career, I finally had the time to explore the literature for myself.

My deep dive into this unexpectedly expansive area of research began in September, when I was appointed co-chair of a university committee charged with re-assessing Rice's current system for evaluating teaching effectiveness. As a faculty member who has always been interested in student evaluations and a staff member of a teaching center committed to sharing the latest scholarship on teaching and learning, I was eager to immerse myself in the literature to help make the important work of our committee as systematic and scholarly as it could be.

I learned a great deal as I read and many of my views have changed rather dramatically as a result. This post is not the place to work through the details of what I learned and how I have changed. But I do want to highlight, as briefly as possible, the six most surprising insights I took away from the formal research literature on student evaluations (if you're interested in more details, including discussions of the way gender, grades, and workload interact with evaluation results, feel free to check out a screencast I created for Rice faculty earlier this year).

  1. Yes, there are studies that have shown no correlation (or even inverse correlations) between the results of student evaluations and student learning. Yet, there are just as many, and in fact many more, that show just the opposite.[2]

  2. As with all social science, this research question is incredibly complex. And insofar as the research literature reflects this complexity, there are few straightforward answers to any questions. If you read anything that suggests otherwise (in either direction), be suspicious.

  3. Despite this complexity, there is wide agreement that a number of independent factors, easily but rarely controlled for, will bias the numerical results of an evaluation. These include, but are not limited to, student motivation, student effort, class size, and discipline (note that gender, grades, and workload are NOT included in this list).

  4. Even when we control for these known biases, the relationship between scores and student learning is not 1 to 1. Most studies have found correlations of around .5. This is a relatively strong positive correlation in the social sciences, but it is important to understand that it means there are still many factors influencing the outcome that we don't yet understand. Put differently, student evaluations of teaching effectiveness are a useful, but ultimately imperfect, measure of teaching effectiveness.

  5. Despite this recognition, we have not yet been able to find an alternative measure of teaching effectiveness that correlates as strongly with student learning. In other words, they may be imperfect measures, but they are also our best measures.

  6. Finally, if scholars of evaluations agree on anything, they agree that however useful student evaluations might be, they will be made more useful when used in conjunction with other measures of teaching effectiveness.

These six insights are some of the least controversial ideas in a 100-year old literature on student evaluations. Yet, they are nevertheless surprising to those (myself included) who have only encountered this literature secondhand.

And this brings me to what I am most interested in writing about today: the troubling and severe disconnect between what I have been reading in the research literature and what is currently being reported in the popular academic press.

I fully recognize the journalism is not the same genre as peer-reviewed academic scholarship, and that the former must often simplify claims from the latter to reach a wider audience. But my concern is that, when it comes to student evaluations, the gap between the press and the research literature is not simply a matter of oversimplification. 

The problem is that many pieces, including those highlighted at the beginning of this post, make grand, sweeping, and blatantly false claims about research findings, the nature of the literature, or scholarly opinion. And many provide zero support for their most controversial claims (on the assumption that "everyone" recognizes their truth?). On the rare occasion we are referred elsewhere, we are far more likely to be sent to the most recent piece the author has read on the internet than to a systematic summary of the research literature. This then leads to a situation, like we saw this year, where almost every published report refers back to a single new study, completely disconnected from the context of the larger literature.[3] 

As such, faculty who rely on higher ed reporting to summarize this issue end up completely unaware of the thousands of studies produced over the last 100 years. And, more damningly, they have often come to believe that "everyone" believes truths that have been plainly rejected or, at the very least, seriously and systematically challenged, within that literature.

I am the first to admit that the complexity of the literature is at times overwhelming, and that we are setting the bar too high if we expect everyone to become an expert before they report on a subject. I also think one could argue that, given how fascinating and well-designed some of the most recent studies are, they deserve to be reported in their own right, whether or not they're placed within the context of the wider literature. But the fact that so many of these pieces make sweeping, categorical claims on the basis of a single study suggests that their authors want to make larger claims about the literature, but simply don't want to do the requisite work to get there responsibly. Likewise, it is remarkable that only one of the above cited authors took care to include the dissenting voice of a scholar willing to suggest that student evaluations might not be as invalid as some would have us believe. Is it any wonder that faculty reading these reports assume there is no disagreement in the literature? 

My worry is that the primary reason higher education reporting has failed on this issue has less to do with expertise than with a desire to publish pieces that will be read and shared widely. Because faculty have a personal (and often deeply personal) interest in student evaluations, almost anything written about them is bound to be clicked and shared across social media platforms as quickly as it is produced. And we also know that pieces with provocative titles and claims are more likely to be shared than measured descriptions of complicated statistical meta-analyses. So pieces on student evaluations that make provocative claims (and have provocative titles) are likely to be the coin of the realm in the world of higher education reporting.[4]

I get these pressures, and am generally a fan of sharing things on the internet that people actually enjoy. The problem is that, unlike adorable cat videos, reports on student evaluations can have real and lasting implications for the state of student learning on our campuses. Research that assesses the tools we use to improve our teaching, and by extension student learning, is simply too important to be distorted into click-bait. We should work hard to better understand these instruments, their biases, and how to use them most effectively. But  to do so, we must first abandon or ignore hyperbolic reporting on the issue.

Luckily for us, there is a 100-year history of careful, measured analysis ready for us to explore. It is a literature largely unknown to those working outside the narrow specialty of higher education research, but we can all do a better job engaging this literature with as much scholarly effort as possible. And, for the sake of improving student learning on our campuses, we should.

[1] There is even an entire blog devoted the theme:

[2] To keep things simple, I'm going to elide over considerable complexity and assume that all researchers are studying identical "student evaluations" and that they mean the same thing by "results" and "student learning." As you might imagine, the literature is not this consistent, so one could argue (and many have) that these sorts of broad comparisons are not appropriate. Yet, in even the most rigorous meta-analyses that take account of these differences, an overwhelming majority of the studies suggest a modest positive relationship.

[3] Because almost every piece linked at the beginning of this post cites a single working paper by Stark and Freishtat, it's worth directing readers to Part I and Part II of Steve Benton's point by point response to their argument.

[4] One great example of the the artificial sensationalization of claims occurs in the first few sentences of the Inside Higher Ed report on the AAUP survey. In a piece where we learn that 47% percent of faculty consider teaching evaluations to be effective, the author still chooses to begin with the following words: "They’re almost universally loathed by professors." How else can we make sense of interpreting 53% of faculty as "almost universal" than by assuming that the author was simply hoping to get her reader's attention?