This post was originally posted to the Blog of Rice's Center for Teaching Excellence. The original post can be found here.

Three years ago I wrote a short post on student ratings. That post, along with a follow-up and a screencast I created, have had an impressively long shelf life. Faculty and administrators continue to share them, and each semester I'm invited to help improve the way these instruments are used on a variety of campuses.

In reporting my own surprise that the literature was more complicated than I assumed, I hoped to convince everyone to read all oversimplified reports about student ratings with a healthy dose of skepticism. The central thesis was one I hoped all academics could get behind: evidence matters--even (and especially) when we have strong incentives to ignore it or explain it away.

Given my commitment to that conclusion, I've continued to read new research on student ratings over the last three years. Like many of our readers, I've had to squeeze this reading into an already packed schedule. That means my reading has not been as systematic as it was when I was completing my initial literature review. But I've seen enough new evidence to know the 2015 resources are in need of an update.

Although a variety of studies made headlines in the intervening years, they were not all worth taking seriously. But at least two have made significant new contributions to our understanding of the validity of student ratings. I will do my best to quickly summarize their main conclusions below, but I encourage you to download each and give them a careful read. I'll then end with some brief reflections on what these results mean for our use of student ratings in higher education, noting the ways they have (or have not) altered my views.

Uttl, Bob, Carmela A. White, and Daniela Wong Gonzalez. “Meta-Analysis of Faculty’s Teaching Effectiveness: Student Evaluation of Teaching Ratings and Student Learning Are Not Related.” Studies in Educational Evaluation 54 (2017): 22–42.

Those who watched my 2014 screencast know that I identified Peter Cohen's 1981 meta-analysis of multi-section validity studies to be the "gold standard" for research on the relationship between student ratings and student achievement. In this piece, Cohen reviewed 68 different studies that compared student ratings of instructors to student performance on common exams. Though a few of these well-designed studies showed an inverse correlation, the vast majority were positive. And he found the overall correlation to be around .43.

Those who read my blog post will also know that my biggest complaint about most recent work on student ratings is that it invariably ignores Cohen's work and the larger literature on student ratings. Newer studies are important, but they should only change our minds if the authors can also explain why their single study (often without a common exam as the measure of achievement) is better evidence than the 60+ studies included in Cohen's meta-analysis. Few attempt this. Fewer still have had done it well.

Enter Bob Uttl, Carmela White, and Daniela Gonzalez. Instead of presenting the results of a single new study, they performed a new meta-analysis that included the results of many recent studies. And instead of ignoring Cohen's results, they began with them, proposed alternative explanations, tested them, and re-ran his entire analysis taking those explanations into account.

And what did they conclude? Their results are more complicated than a summary will be able to capture, but here are the highlights:

The relationship between student ratings and student achievement in the studies Cohen reviewed was also related to sample size. Studies that compared only a few sections were likely to find large correlations, while studies that compared many sections were likely to find small correlations.
If you re-run Cohen's meta-analysis, leaving out studies that compare fewer than 30 sections (as well as studies that don't meet Cohen's own inclusion criteria), the estimated correlation between student ratings and student achievement drops from .43 to .27.
A new meta-analysis that includes newer studies with different measures of student learning, that controls for prior knowledge, and that accounts for small sample size, finds that the correlation between student ratings and student achievement drops to 0.

These results are not immune from legitimate criticisms. It is worth noting, for example, that an analysis of only the statistically significant results in Cohen's studies doesn't reveal straightforward sample-size effects (indeed, only one of those 20 studies showed results below .25). And more significantly, the new meta-analysis gives a great deal of weight to a few recent studies with sample sizes as large as 190 sections. Yet a moment's reflection will make clear that those studies are not measuring student learning via a common exam (how many universities have a single course with 190 different sections in a single semester?). Instead, they're measuring learning across multiple types of courses via grades in future classes.

But given that all studies have flaws, it's hard to imagine a better response to Cohen's work than this. So those of us who claim to value evidence must take it seriously. Taking it seriously can mean a variety of things, but it's clear that we shouldn't ignore the results. Especially if one result suggests there is no relationship between student ratings and student achievement.

Mengel, Friederike, Jan Sauermann, and Ulf Zolitz. “Gender Bias in Teaching Evaluations.” Journal of the European Economic Association 16 (2018).

Over the last three years, one specific issue has dominated the conversation about student ratings of instruction. Although questions about validity never left the scene, concerns about potential biases against marginalized faculty, and particularly women, took center stage.

As I noted in my initial post, there has been a great deal of research on variables that might create systematic biases in student ratings. And even the greatest defenders of student ratings admit that there are many. Yet, surprisingly, most of the research on gender has produced mixed results, with many studies suggesting no significant bias. Recent studies have tried to challenge this result, but as with most recent validity studies, they were often poorly designed and rarely addressed the literature.

Yet the soon to be published work of Friederike Mengel, Jan Sauermann, and Ulf Zolitz has been a notable exception. Unlike most studies of gender bias, which simply compare the results of male and female instructors, Mengel and her co-authors were able to control for the quality of the teacher in their analysis. Their sample was large (20,000 ratings), the students were randomly assigned to the instructors, and they measured numerous other variables (student gender, content of the course, hours spent studying, etc.) to improve their analysis. Without going into further detail, I will simply note that this study is one of the most thorough analyses of student ratings I have ever read.

To get a full picture of its complexity, I urge you to read the complete manuscript. But, controlling for teacher quality, they nevertheless found the following:

On average, female instructors received lower ratings than male instructors.
Female instructors received lower ratings from male students than from female students, but both genders gave female instructors lower ratings than male instructors.
Female graduate student instructors received lower ratings than female lecturers and professors.
Female instructors teaching math-related content received lower ratings than female instructors in other disciplines.

If we stopped here, like many of those who covered this work in the media did, we'd conclude this was unequivocally bad news for women and for student ratings, more generally. Yet, this study also found the following:

Female lecturers and professors (in contrast to graduate student instructors) do NOT receive lower ratings from male students, and actually receive higher ratings from female students. This gives them an advantage over male faculty in courses with female students.
For the subgroups that do receive biased results, the effect is modest compared to other known biases. The difference was at most around .25 points on a 5-point scale.

As with Uttl, et al., one can raise legitimate questions about some features of this study. The "instructors" rated were leaders of discussion sections, not the actual instructors of record; the "instructor rating" was a composite score of atypical questions about the instructor (rather than the typical "overall, I rate this instructor ..."); there were fewer than 15 students in each section; and the response rates were only 36%. Taken together, this means that there were at most 5 students determining the rating of instructors in their individual sections--a result that student ratings researchers would consider highly unreliable. Nevertheless, I know of no study of gender and student ratings that comes close to matching the rigor of this study.

The results presented in these two studies are both new and, in my mind, significant contributions to the literature. So what does this mean for how we should think about the use and interpretation of student ratings on our campuses?

For the most part, these results simply reinforce many of the recommendations scholars of student ratings have been sharing for quite some time. For an especially careful outline of these guidelines, as well as references to the literature that supports them, I recommend Angela Linse's 2017 piece that appeared alongside Uttl's meta-analysis in Studies in Educational Evaluation. But for those who are new to this conversation, key guidelines include:

Recognize that student ratings are not themselves evaluations, but rather data to be used by evaluators who have (hopefully) been trained in responsible interpretation (first step: changing your language from "student evaluations" to "student ratings"!)
Never use student ratings as the only measure of teaching effectiveness. Other important measures include: direct evidence of student learning, observations, reviews of teaching materials, and the teacher's self-reflection.
Take account of variables that systematically and significantly bias the results in certain courses (e.g., student motivation, student effort, class size, and discipline)
Avoid drawing conclusions from statistically meaningless differences (i.e., using decimal points to distinguish between faculty performance)

The results summarized above also suggest the following new recommendations. First, we should expand the list of known biases to include biases against younger female instructors, and especially those teaching math-related content to a disproportionately large number of male students. This is particularly important for those of us on hiring committees making decisions on the basis of student ratings received in graduate school. And while we're at it, we should also recognize that senior female faculty teaching courses with lots of female students are likely to get an unfair boost in their scores.

But the most important recommendation I would now make is the following: we should put a moratorium on using student ratings results to rank and compare individual faculty to one another. If we accept Uttl et al.'s best-case scenario and assume there is a .27 correlation between ratings and learning, there is simply too much room for error with individual comparisons (for a review of why this is the case, see he hypothetical data visualizations in my original follow-up).

Of course, many who have been writing about student ratings will think this recommendation does not go far enough. For these folks, the best solution to this problem is to discontinue the use of student ratings altogether.

I think this view is mistaken for three reasons.

First, I still stand by the observation I made in my original post: it's not clear that many of the other measures we might use are any better. One exception is the direct measure of student learning via standardized tests, but this method of evaluating faculty has very few advocates. So the best we can do with imperfect measures is use as many of them as we can--recognizing that each has its own unique flaws.

Second, while comparing faculty to one another is dangerous, the quantitative scores can still be valuable if used to chart growth of a single instructor over time. Presuming that most of the noise in the measure is the result of variables unique to each instructor and the courses they teach, there is likely to be much less variability over time unless there is genuine improvement. It will be important to not over-interpret small differences in this case, as well (dropping from a 4.3 to a 4.2 average is not a cause for concern!), but if an instructor moves from a 2.5 average to a 4.5 average over the course of their career, we can be fairly confident that there was real and significant growth in their teaching performance.

Finally, and most significantly, even if I believed student ratings were entirely worthless for the purposes of assessing the performance of instructors (which I do not), it does not follow that they are worthless for all purposes. As any educational researcher will tell you, one of the most significant variables contributing to student growth in a classroom is the quality and quantity of the feedback their instructors receive about their impact on students. The more we know about how our students are experiencing our course, the more we can adapt our teaching to achieve our goals. And while there are multiple ways of getting that information, there are few methods more efficient and effective than asking the students directly.