Join the GOOGLE +Rubber Room Community

Monday, April 18, 2022

Time To Change the Endless Teacher Evaluations That Make No Impact on Student Outcomes


Dear readers,

I think that Peter Greene (see below) has a point that needs to be addressed in New York City by the Department of Education VIPs - the attorneys, mayor, Commissioner, and all others - who have put into place a fraudulent teacher discipline procedure that punishes teachers who receive "developing" or "ineffective" ratings on their formal and/or informal observations.

I have posted my opinion on this blog many times on the way principals, assistant principals, and evaluators (remember PIP+?) rate teachers on observation reports and then charge them with 3020-a and pursue their termination. I have collected most of the decisions made at 3020-a since 2007, and I have found that most arbitrators agree with the assessment/rating given by the school administrators. The ratings are taken as evidence and fact, not opinions. I think this needs to be changed: arbitrators must stop finding ratings credible without looking behind the numbers and at all circumstances in the storyline.

At 3020-a, we at Advocatz always go into the backstory of evaluators - who paid them, is there any animosity between the administrators at the school and the teacher, etc. The evaluator is there to create a paper trail of incompetence for the administrator and must agree with whatever the principal wants the evaluator to write. Teachers/staff often tell me that the evaluator may say "Well done!" at the end of a lesson but never put that in writing or put something completely different in the actual observation report. Often, the evaluator knows that if he/she puts his/her honest opinion in the report (i.e., that the teacher is terrific), then he/she may be fired - it's the evaluator's job at stake or yours. You, the teacher, won't win.

And then along came Charlotte Danielson and her teaching rubric, which was erroneously used as a benchmark for rating teachers, and failed miserably to point to any standard of good teaching, only the whim of an evaluator who wanted a teacher fired. Evidently, Ms. Danielson herself was not happy with the way her rubric is used in New York City. 

The problem is, the NYC Department of Education will not admit to any wrongdoing, and never allows anyone to suggest changes in the way their business is done.

Nonetheless, what needs to be done 

New research shows that endlessly evaluating teachers has made no impact on student outcomes.

BY ,,


For several decades, everyone from President George H.W. Bush to then-Arkansas Governor Bill Clinton believed it was possible to measure outcomes (“deliverables,” some called them) to separate the educational wheat from the chaff. Teachers would be held accountable for real data—numbers generated by tests.

No Child Left Behind (NCLB) had its bipartisan birth in 2001. Central to the law was not only to collect test scores but to break them down by subgroups. This meant that, at least on paper, Black students or low-income students would not have their struggles hidden in a school average score. 

“Effective” schools would be those that pulled high test scores from all students. NCLB had data and a deadline: By 2014, all schools would have all students scoring above average on the Big Standardized Test. This was, of course, a statistically impossible goal.

The Obama Administration moved past the unachievable goals of NCLB, but held onto the belief in data. Tests would be used to collect data, and that data would be used not only to judge schools but also to evaluate individual teachers. Students across the nation would take Common Core-aligned tests, and then teachers would be judged on the results.

Educators like me were subjected to complicated professional development programs, where we learned that models could predict what students would have scored in some parallel universe with an imaginary “neutral value” teacher.

The plan was, as one presenter told me, like predicting the weather. And if the student scored higher or lower than their hypothetical counterpart, that difference could be chalked up to either the teacher’s credit or blame.

Education reformers were convinced they had unlocked the secret of identifying good and bad teachers and, as a result, revitalizing schools. Once identified, “good” teachers could be moved to struggling schools that needed them, incentive programs could be created to pay the better teachers, and “bad” teachers could be fired.

Position papers like “The Widget Effect” called for a future where teachers could be differentiated by their ability to improve student achievement. Critics of public education complained that teacher evaluations produced too many satisfactory ratings, and a new system of high-stakes testing would “toughen” the evaluation systems.

This new system would mimic the stack ranking of industry giants like Microsoft, where employees were ranked according to desired metrics. The bottom rung of the employee ladder would be dropped, and the company could fire its way to excellence.  

But, as a recently released working paper from the Annenberg Institute at Brown University underscores, the “massive effort to institute high-stakes teacher evaluation systems” had essentially no effect on “student achievement.” Though high-stakes teacher evaluation was supposed to raise test scores all across the nation, student achievement—measured by test scores—barely budged.

“Following the data,” it would seem, has led us far astray. The experiment has failed.

Why? I can think of several reasons.

Reducing the job of teaching to “get students ready to take a single math and reading test” was insulting and demoralizing. It was not what teachers had signed up for. Instead of using professional expertise and autonomy to create instruction to meet the many varied needs of students, teachers were increasingly required to deliver test prep.

The high-stakes testing also had the effect of warping education. Another new working paper shows that by focusing on math and reading, schools actually reduced education in other subject areas.

High-stakes testing also led to upside-down schools—schools that, instead of helping to meet all students’ needs, were now there to serve the need to produce test scores. More fundamentally, however, the very premise of the test-driven system was flawed: The notion that a great teacher is always a great teacher in all schools for all students on every day of their career doesn’t reflect reality

It’s more akin to any other human relations-based role. I might be a great partner for my spouse but not a great partner for every other person on the planet. My “spousal effectiveness” is not a static quality that can be measured like height or hair color. It is the same for teacher effectiveness.

Beyond that flawed premise, the single biggest problem with this data-based system is that it runs on bad data. The tests themselves are a black box. Teachers in many states are forbidden to see the content of the test and, therefore, can never know where their students came up short. Worse yet, since not all educators teach reading or math, many teachers were evaluated based on the scores of students they didn’t teach.

The data that was intended to drive the system came solely from the student tests, and the tests are not good. Questions are often poorly designed; one famous bad set required students to answer questions about a talking pineapple. A poet found herself unable to answer test questions about her own poem; how could students be expected to do any better?

Standardized test results were supposed to provide useful data about student learning, but they are closely tied to the wealth and whiteness of families. To try to turn test scores into more usable data, value-added scores (based on a formula used for farming) were introduced, but these turn out to be random and unpredictable—the data is so bad it has been thrown out by courts.

Still, supporters have argued that higher test scores mean greater success in life. That’s a dubious finding—correlation is not causation, and socioeconomic background is a good predictor of both test scores and life outcomes. Nor does evidence support that raising a test score raises a student’s life outcomes, which would be a far more important insight. It is no wonder that a system running on junk data has produced undesirable results.

In the meantime, private industry giants like Microsoft dropped stack ranking, realizing that it bred a bad workplace atmosphere, stopped innovation, and could not always be trusted.

And yet, there are still education reformers who cling to this high-stakes testing teacher evaluation model.

Current discussions about learning loss due to the pandemic often assume that high-stakes tests must serve as the best measure of what students are missing. Amid our current pandemic disruption, some are still suggesting that high-stakes testing remains our best bet.

But we have tried to guide education with high-stakes standardized testing data for two decades now, and yet we still have little evidence that this approach actually works. It’s the great irony of the modern school reform era—we must follow the data, except when the data shows that our systems don’t work.

English teacher for a few years. Blogger at Curmudgucation.