A close-up look at NYC education policy, politics,and the people who have been, are now, or will be affected by acts of corruption and fraud. ATR CONNECT assists individuals who suddenly find themselves in the ATR ("Absent Teacher Reserve") pool and are the "new" rubber roomers, and re-assigned. The terms "rubber room" and "ATR" mean that you or any person has been targeted for removal from your job. A "Rubber Room" is not a place, but a process.
The New York Times, yesterday, released the value-added data on 18,000 New York City teachers collected between 2007 and 2010. Though teachers are irate and various newspapers, The New York Post, in particular, are gleeful, I have mixed feelings.
For sure the ‘reformers’ have won a battle and have unfairly humiliated thousands of teachers who got inaccurate poor ratings. But I am optimistic that this will be be looked at as one of the turning points in this fight. Up until now, independent researchers like me were unable to support all our claims about how crude a tool value-added metrics still are, though they have been around for nearly 20 years. But with the release of the data, I have been able to test many of my suspicions about value-added. Now I have definitive and indisputable proof which I plan to write about for at least my next five blog posts.
The tricky part about determining the accuracy of these value-added calculations is that there is nothing to compare them to. So a teacher gets an 80 out of 100 on her value added — what does this mean? Does it mean that the teacher would rank 80 out of 100 on some metric that took into account everything that teacher did? As there is no way, at present, to do this, we can’t really determine if the 80 was the ‘right’ score. All we can say is that according to this formula, this teacher got an 80 out of 100. So what we need to ‘check’ how good of a measure these statistics are some ‘objective’ truths about teachers — I will describe three which we will see if the value-added measures support.
On The New York Times website they chose to post a limited amount of data. They have the 2010 rating for the teacher and also the career rating for the teacher. These two pieces of data fail to demonstrate the year-to-year variability of these value-added ratings.
I analyzed the data to see if they would agree with three things I think every person would agree upon:
1) A teacher’s quality does not change by a huge amount in one year. Maybe they get better or maybe they get worse, but they don’t change by that much each year.
2) Teachers generally improve each year. As we tweak our lessons and learn from our mistakes, we improve. Perhaps we slow down when we are very close to retirement, but, in general, we should get better each year.
3) A teacher in her second year is way better than that teacher was in her first year. Anyone who taught will admit that they managed to teach way more in their second year. Without expending so much time and energy on classroom management, and also by not having to make all lesson plans from scratch, second year teachers are significantly better than they were in their first year.
Maybe you disagree with my #2. You may even disagree with #1, but you would have to be crazy to disagree with my #3.
Though the Times only showed the data from the 2009-2010 school year, there were actually three files released, 2009-2010, 2008-2009, and 2007-2008. So what I did was ‘merge’ the 2010 and 2009 files. Of the 18,000 teachers in the 2009-2010 data I found that about 13,000 of them also had ratings from 2008-2009.
Looking over the data, I found that 50% of the teachers had a 21 point ‘swing’ one way or the other. There were even teachers who had gone up or down as much as 80 points. The average change was 25 points. I also noticed that 49% of the teachers got lower value-added in 2010 than they did in 2009, contrary to my experience that most teachers improve from year to year.
I made a scatter plot with each of these 13,000 teacher’s 2008-2009 score on the x-axis and their 2009-2010 score on the y-axis. If the data was consistent, one would expect some kind of correlation with points clustered on an upward sloping line. Instead, I got:
With a correlation coefficient of .35 (and even that is inflated, for reasons I won’t get into right now), the scatter plot shows that teachers are not consistent from year to year, contrary to my #1, nor do a good number of them go up, contrary to my #2. (You might argue that 51% go up, which is technically ‘most,’ but I’d say you’d get about 50% with a random number generator — which is basically what this is.)
But this may not sway you since you do think a teacher’s ability can change drastically in one year and also think that teachers get stale with age so you are not surprised that about half went down.
Then I ran the data again. This time, though I used only the 707 teachers who were first year teachers in 2008-2009 and who stayed for a second year in 2009-2010. Just looking at the numbers, I saw that they were similar to the numbers for the whole group. The median amount of change (one way or the other) was still 21 points. The average change was still 25 points. But the amazing thing which definitely proves how inaccurate these measures are, the percent of first year teachers who ‘improved’ on this metric in their second year was just 52%, contrary to what every teacher in the world knows — that nearly every second year teacher is better in her first year. The scatter plot for teachers who were new teachers in 2008-2009 has the same characteristics of the scatter plot for all 13,000 teachers. Just like the graph above, the x-axis is the value-added score for the first year teacher in 2008-2009 while the y-axis is the value-added score for the same teacher in her second year during 2009-2010.
In part 1[see below - Editor] I demonstrated there was little correlation between how a teacher was rated in 2009 to how that same teacher was rated in 2010. So what can be more crazy than a teacher being rated highly effective one year and then highly ineffective the next? How about a teacher being rated highly effective and highly ineffective IN THE SAME YEAR.
I will show in this post how exactly that happened for hundreds of teachers in 2010. By looking at the data I noticed that of the 18,000 entries in 2010, about 6,000 were repeated names. This is because there are two ways that one teacher can get multiple value-added ratings for the same year.
The most common way this happens is when the teacher is teaching self-contained elementary in 3rd, 4th, or 5th grade. The students take the state test in math and in language arts and that teacher gets two different effectiveness ratings. So a teacher might, according to the formula, ‘add’ a lot of ‘value’ when it comes to math, but ‘add’ little ‘value’ (or even ‘subtract’ value) when it comes to language arts.
To those who don’t know a lot about education (yes, I’m talking to you ‘reformers’), it might seem reasonable that a teacher can do an excellent job in math and a poor job in language arts and should not be surprising if the two scores for that teacher do not correlate. But those who do know about teaching would expect the amount the students to learn to correlate since someone who is doing an excellent job teaching math is likely to be doing an excellent job teaching language arts since both jobs are set up by some common groundwork that benefits all learning in the class. The teacher has good classroom management. The teacher has helped her students to be self-motivated. The teacher has a relationship with the families. All these things increase the amount of learning of every subject taught. So even if an elementary teacher is a little stronger in one subject than another, it is more about the learning environment that the teacher created than anything else.
Looking through the data I noticed teachers, like a 5th grade teacher at P.S. 196 who scored 97 out of 100 in language arts and 2 out of 100 in math. This is with the same students in the same year! How can a teacher be so good and so bad at the same time? Any evaluation system in which this can happen is extremely flawed, of course, but I wanted to explore if this was a major outlier or if it was something quite common. I ran the numbers and the results shocked me (which is pretty hard to do). Here’s what I learned:
Out of 5,675 elementary school teachers, the average difference between the two scores was a whopping 22 points. One out of six teachers, or approximately 17%, had a difference of 40 or more points. One out of 25 teachers, which was 250 teachers altogether, had a difference of 60 or more points, and, believe it or not, 110 teachers, or about 2% (that’s one out of fifty!) had differences of 70 or more points. At the risk of seeming repetitive, let me repeat that this was the same teacher, the same year, with the same kids. Value-added was more inaccurate than I ever imagined.
I made a scatter plot of the 5,675 teachers. On the x-axis is that teacher’s language arts score for 2010. On the y-axis is that same teacher’s math score for 2010. There is almost no correlation.
For people who know education, this is shocking, but there are people who probably are not convinced by my explanation that these should be more correlated if the formulas truly measured learning. Some might think that this really just means that just like there are people who are better at math than language arts and vice versa, there are teachers who are better at teaching math than language arts and vice versa.
So I ran a different experiment for those who still aren’t convinced. There is another scenario where a teacher got multiple ratings in the same year. This is when a middle school math or language arts teacher teaches multiple grades in the same year. So, for example, there is a teacher at M.S. 35 who taught 6th grade and 7th grade math. As these scores are supposed to measure how well you advanced the kids that were in your class, regardless of their starting point, one would certainly expect a teacher to get approximately the same score on how well they taught 6th grade math and 7th grade math. Maybe you could argue that some teachers are much better at teaching language arts than math, but it would take a lot to try to convince someone that some teachers are much better at teaching 6th grade math than 7th grade math. But when I went to the data report for M.S. 35 I found that while this teacher scored 97 out of 100 for 6th grade math, she only scored a 6 out of 100 for 7th grade math.
Again, I investigated to see if this was just a bizarre outlier. It wasn’t. In fact, the spreads were even worse for teachers teaching one subject to multiple grades than they were for teaching different subjects to the same grade.
Out of 665 teachers who taught two different grade levels of the same subject in 2010, the average difference between the two scores was nearly 30 points. One out of four teachers, or approximately 28%, had a difference of 40 or more points. Ten percent of the teachers had differences of 60 points or more, and a full five percent had differences of 70 points or more. When I made my scatter plot with one grade on the x-axis and the other grade on they y-axis I found that the correlation coefficient was a miniscule .24
Rather than report about these obvious ways to check how invalid these metrics are and how shameful it is that these scores have already been used in tenure decisions, or about how a similarly flawed formula will be used in the future to determine who to fire or who to give a bonus to, newspapers are treating these scores like they are meaningful. The New York Post searched for the teacher with the lowest score and wrote an article about ‘the worst teacher in the city’ with her picture attached. The New York Times must have felt they were taking the high-road when they did a similar thing but, instead, found the ‘best’ teachers based on these ratings.
I hope that these two experiments I ran, particularly the second one where many teachers got drastically different results teaching different grades of the same subject, will bring to life the realities of these horrible formulas. Though error rates have been reported, the absurdity of these results should help everyone understand that we need to spread the word since calculations like these will soon be used in nearly every state.
I’ve never asked the people who read my blog to do this before since I prefer that it happen spontaneously, but I’d ask for you to spread the word about this post. Tweet it, email it, post it on Facebook. Whatever needs to happen for this to go ‘viral,’ I’d appreciate it. I don’t do this for money or for personal glory. I do it because I can’t stand when people lie and teachers, and yes those teachers’ students, get hurt because of it. I write these posts because I can’t stand by and watch it happen anymore. All you have to do is share it with your friends.