Join the GOOGLE +Rubber Room Community

Wednesday, February 29, 2012

Gary Rubinstein's Analysis of NYC Value-Added Data, Part 1 and 2

Analyzing Released NYC Value-Added Data Part 1

by Gary Rubinstein
The New York Times, yesterday, released the value-added data on 18,000 New York City teachers collected between 2007 and 2010.  Though teachers are irate and various newspapers, The New York Post, in particular, are gleeful, I have mixed feelings.
For sure the ‘reformers’ have won a battle and have unfairly humiliated thousands of teachers who got inaccurate poor ratings.  But I am optimistic that this will be be looked at as one of the turning points in this fight.  Up until now, independent researchers like me were unable to support all our claims about how crude a tool value-added metrics still are, though they have been around for nearly 20 years.  But with the release of the data, I have been able to test many of my suspicions about value-added.  Now I have definitive and indisputable proof which I plan to write about for at least my next five blog posts.
The tricky part about determining the accuracy of these value-added calculations is that there is nothing to compare them to.  So a teacher gets an 80 out of 100 on her value added — what does this mean?  Does it mean that the teacher would rank 80 out of 100 on some metric that took into account everything that teacher did?  As there is no way, at present, to do this, we can’t really determine if the 80 was the ‘right’ score.  All we can say is that according to this formula, this teacher got an 80 out of 100.  So what we need to ‘check’ how good of a measure these statistics are some ‘objective’ truths about teachers — I will describe three which we will see if the value-added measures support.
On The New York Times website they chose to post a limited amount of data.  They have the 2010 rating for the teacher and also the career rating for the teacher.  These two pieces of data fail to demonstrate the year-to-year variability of these value-added ratings.
I analyzed the data to see if they would agree with three things I think every person would agree upon:
1)  A teacher’s quality does not change by a huge amount in one year.  Maybe they get better or maybe they get worse, but they don’t change by that much each year.
2)  Teachers generally improve each year.  As we tweak our lessons and learn from our mistakes, we improve.  Perhaps we slow down when we are very close to retirement, but, in general, we should get better each year.
3)  A teacher in her second year is way better than that teacher was in her first year.  Anyone who taught will admit that they managed to teach way more in their second year.  Without expending so much time and energy on classroom management, and also by not having to make all lesson plans from scratch, second year teachers are significantly better than they were in their first year.
Maybe you disagree with my #2.  You may even disagree with #1, but you would have to be crazy to disagree with my #3.
Though the Times only showed the data from the 2009-2010 school year, there were actually three files released, 2009-2010, 2008-2009, and 2007-2008.  So what I did was ‘merge’ the 2010 and 2009 files.  Of the 18,000 teachers in the 2009-2010 data I found that about 13,000 of them also had ratings from 2008-2009.
Looking over the data, I found that 50% of the teachers had a 21 point ‘swing’ one way or the other.  There were even teachers who had gone up or down as much as 80 points.  The average change was 25 points.  I also noticed that 49% of the teachers got lower value-added in 2010 than they did in 2009, contrary to my experience that most teachers improve from year to year.
I made a scatter plot with each of these 13,000 teacher’s 2008-2009 score on the x-axis and their 2009-2010 score on the y-axis.  If the data was consistent, one would expect some kind of correlation with points clustered on an upward sloping line.  Instead, I got:

With a correlation coefficient of .35 (and even that is inflated, for reasons I won’t get into right now), the scatter plot shows that teachers are not consistent from year to year, contrary to my #1, nor do a good number of them go up, contrary to my #2.  (You might argue that 51% go up, which is technically ‘most,’ but I’d say you’d get about 50% with a random number generator — which is basically what this is.)
But this may not sway you since you do think a teacher’s ability can change drastically in one year and also think that teachers get stale with age so you are not surprised that about half went down.
Then I ran the data again.  This time, though I used only the 707 teachers who were first year teachers in 2008-2009 and who stayed for a second year in 2009-2010.  Just looking at the numbers, I saw that they were similar to the numbers for the whole group.  The median amount of change (one way or the other) was still 21 points.  The average change was still 25 points.  But the amazing thing which definitely proves how inaccurate these measures are, the percent of first year teachers who ‘improved’ on this metric in their second year was just 52%, contrary to what every teacher in the world knows — that nearly every second year teacher is better in her first year.  The scatter plot for teachers who were new teachers in 2008-2009 has the same characteristics of the scatter plot for all 13,000 teachers.  Just like the graph above, the x-axis is the value-added score for the first year teacher in 2008-2009 while the y-axis is the value-added score for the same teacher in her second year during 2009-2010.
Reformers beware.  I’m just getting started.
Continued in part 2 …

9 Responses

  1. Thanks for the post. Where did you access the raw data? Or did you have to request?
  2. Sean
    GR, I’m with you. Publishing the scores was puff-your-chest move. Rushing to include VAM in formal evaluations will turn the tide against a potentially promising tool. Demoralizing teachers is popular for reasons no rational person can understand. Economists and reformers can apply high levels of abstraction and little nuance into what is a complex profession.
    But you can’t put this type of research standard on VAM and then completely ignore it for current measures of quality: experience and master’s degrees.
    Take your second assumption: “Teachers generally improve each year.” For the first five years, there’s suggestive evidence. From years 6-10, a bit less. From years 10-30, close to none.
    Dan Goldhaber found nearly identical distributions of teacher quality comparing two groups: those with and those without master’s degrees.
    VAM, by comparison, is considerably more reliable. As a teacher for 15 years or whatever it is, you surely know that there’s (significant) variability in the quality of teachers. I think a better path is to let VAM breathe for a few years, let the modeling improve some, and then we’ll see.
    I ask you: in the tradeoff between type 1 (dismissing an effective teacher) and type 2 (keeping an ineffective one), which do you choose? The current system runs rampant with type 2. VAM obviously has serious potential for type 1 (and type 2).
    • Sean, what’s this with letting VAM breathe for a few years until the models improve?
      It’s not like they’re piloting this system. Starting in 2012-2013, teachers all across New York State will have 40% of their evaluations come from VAM – 20% from state tests, 20% from local tests, third party tests or the state tests measured a different way than the state measured them.
      If VAM is as unreliable as what Gary shows above, we’re going to see thousands of teachers unfairly tarred with the “ineffective” label who wind up in the NY Post with a glossy DOE-provided photo under the headline NY’S WORST TEACHERS.
      Maybe if I thought the Regents and the NYSED and Cuomo and Bloomberg and Gates and Murdoch and the rest of the so-called reformers weren’t trying to rid the system of thousands of teachers, I might trust them to implement this system fairly.
      But since I know that’s exactly what they want to do, I do not trust them or the system they want to implement.
      Given that Bloomberg is on record about wanting to fire 50% of NYC teachers, Gates thinks most teachers suck, and Merryl Tisch believes teachers are THE problem in public education (funny, she’s been a Regent for 15 years, but somehow the problems are never her fault), I think I would be a fool to trust them to fairly implement so complex and easy to manipulate a system.
      Therein lies the problem with VAM for me. I do not trust the people implementing it and it is so complex and non-transparent as it now stands that I would be a fool if I did.
      Perhaps as you say the model will improve later on.
      When that happens, we can then argue the wisdom of basing evaluations on high stakes tests.
      Until then, what we have is a poisoned and toxic environment that suggest teachers be wary of any “reforms” the powers that be want to implement, especially ones as complex as VAM. The publication of the TDR’s after the DOE offered promises that they would never do that is an exclamation point on the need for wariness.
  3. jandh
    crucial typo in conclusion
    should read “is better THAN her first year.”
  4. Rafi Nolan
    An important note that I don’t believe has been mentioned by publishing organizations (and the reason you should not expect a large jump from first to second year): Teachers with one and two years of experience are graded separately, with their percentile rankings representing their performance within the “peer group”. For first year teachers, the peer group consists only of other first year teachers, and likewise for second year teachers; as a result, the expected net improvement in percentile rankings from the first to second year would be close to zero.
    Of course this means that its highly inappropriate to compare the percentile rankings of first/second year teachers to those of more experienced teachers—it only makes sense to compare within the same level of experience (>2 years was considered as all the same level of experience for peer grouping purposes). No online databases that I have seen have noted this effect.
    Disclosure: I am one of those teachers in your sample group who saw large improvements in value-added scores from the first to second year. I appear to benefit from comparisons to other teachers in my school (for 09-10), when we were in fact not in the same comparison group.
  5. Great work, Gary. I hope others will take the same approach and expose the problems that are apparent here. And frankly, even if the models improve and there’s stronger correlations, I wouldn’t accept those correlations as proof of overall teaching efficacy. There are still too many assumptions built into the models, and too little of our work in the classroom and school accounted for in the tests.

 Analyzing Released NYC Value-Added Data Part 2

by Gary Rubinstein
In part 1[see below - Editor] I demonstrated there was little correlation between how a teacher was rated in 2009 to how that same teacher was rated in 2010.  So what can be more crazy than a teacher being rated highly effective one year and then highly ineffective the next?  How about a teacher being rated highly effective and highly ineffective IN THE SAME YEAR.
I will show in this post how exactly that happened for hundreds of teachers in 2010.  By looking at the data I noticed that of the 18,000 entries in 2010, about 6,000 were repeated names.  This is because there are two ways that one teacher can get multiple value-added ratings for the same year.
The most common way this happens is when the teacher is teaching self-contained elementary in 3rd, 4th, or 5th grade.  The students take the state test in math and in language arts and that teacher gets two different effectiveness ratings.  So a teacher might, according to the formula, ‘add’ a lot of ‘value’ when it comes to math, but ‘add’ little ‘value’ (or even ‘subtract’ value) when it comes to language arts.
To those who don’t know a lot about education (yes, I’m talking to you ‘reformers’), it might seem reasonable that a teacher can do an excellent job in math and a poor job in language arts and should not be surprising if the two scores for that teacher do not correlate.  But those who do know about teaching would expect the amount the students to learn to correlate since someone who is doing an excellent job teaching math is likely to be doing an excellent job teaching language arts since both jobs are set up by some common groundwork that benefits all learning in the class.  The teacher has good classroom management.  The teacher has helped her students to be self-motivated.  The teacher has a relationship with the families.  All these things increase the amount of learning of every subject taught.  So even if an elementary teacher is a little stronger in one subject than another, it is more about the learning environment that the teacher created than anything else.
Looking through the data I noticed teachers, like a 5th grade teacher at P.S. 196 who scored 97 out of 100 in language arts and 2 out of 100 in math.  This is with the same students in the same year!  How can a teacher be so good and so bad at the same time?  Any evaluation system in which this can happen is extremely flawed, of course, but I wanted to explore if this was a major outlier or if it was something quite common.  I ran the numbers and the results shocked me (which is pretty hard to do).  Here’s what I learned:
Out of 5,675 elementary school teachers, the average difference between the two scores was a whopping 22 points.  One out of six teachers, or approximately 17%, had a difference of 40 or more points.  One out of 25 teachers, which was 250 teachers altogether, had a difference of 60 or more points, and, believe it or not, 110 teachers, or about 2% (that’s one out of fifty!) had differences of 70 or more points.  At the risk of seeming repetitive, let me repeat that this was the same teacher, the same year, with the same kids.  Value-added was more inaccurate than I ever imagined.
I made a scatter plot of the 5,675 teachers.  On the x-axis is that teacher’s language arts score for 2010.  On the y-axis is that same teacher’s math score for 2010.  There is almost no correlation.
For people who know education, this is shocking, but there are people who probably are not convinced by my explanation that these should be more correlated if the formulas truly measured learning.  Some might think that this really just means that just like there are people who are better at math than language arts and vice versa, there are teachers who are better at teaching math than language arts and vice versa.
So I ran a different experiment for those who still aren’t convinced.  There is another scenario where a teacher got multiple ratings in the same year.  This is when a middle school math or language arts teacher teaches multiple grades in the same year.  So, for example, there is a teacher at M.S. 35 who taught 6th grade and 7th grade math.  As these scores are supposed to measure how well you advanced the kids that were in your class, regardless of their starting point, one would certainly expect a teacher to get approximately the same score on how well they taught 6th grade math and 7th grade math.  Maybe you could argue that some teachers are much better at teaching language arts than math, but it would take a lot to try to convince someone that some teachers are much better at teaching 6th grade math than 7th grade math.  But when I went to the data report for M.S. 35 I found that while this teacher scored 97 out of 100 for 6th grade math, she only scored a 6 out of 100 for 7th grade math.
Again, I investigated to see if this was just a bizarre outlier.  It wasn’t.  In fact, the spreads were even worse for teachers teaching one subject to multiple grades than they were for teaching different subjects to the same grade.
Out of 665 teachers who taught two different grade levels of the same subject in 2010, the average difference between the two scores was nearly 30 points.  One out of four teachers, or approximately 28%, had a difference of 40 or more points.  Ten percent of the teachers had differences of 60 points or more, and a full five percent had differences of 70 points or more.  When I made my scatter plot with one grade on the x-axis and the other grade on they y-axis I found that the correlation coefficient was a miniscule .24
Rather than report about these obvious ways to check how invalid these metrics are and how shameful it is that these scores have already been used in tenure decisions, or about how a similarly flawed formula will be used in the future to determine who to fire or who to give a bonus to, newspapers are treating these scores like they are meaningful.  The New York Post searched for the teacher with the lowest score and wrote an article about ‘the worst teacher in the city’ with her picture attached.  The New York Times must have felt they were taking the high-road when they did a similar thing but, instead, found the ‘best’ teachers based on these ratings.
I hope that these two experiments I ran, particularly the second one where many teachers got drastically different results teaching different grades of the same subject, will bring to life the realities of these horrible formulas.  Though error rates have been reported, the absurdity of these results should help everyone understand that we need to spread the word since calculations like these will soon be used in nearly every state.
I’ve never asked the people who read my blog to do this before since I prefer that it happen spontaneously, but I’d ask for you to spread the word about this post.  Tweet it, email it, post it on Facebook.  Whatever needs to happen for this to go ‘viral,’ I’d appreciate it.  I don’t do this for money or for personal glory.  I do it because I can’t stand when people lie and teachers, and yes those teachers’ students, get hurt because of it.  I write these posts because I can’t stand by and watch it happen anymore.  All you have to do is share it with your friends.

14 Responses

  1. amazing; I have tweeted emailed & Facebooked it. thanks!
  2. KSK
    I found the link from your first post, and put it up on Facebook. This is an outrage — thank you for doing the statistics. (I am one of the elementary teachers in your plot.)
  3. maxine turner
    Yes, the many aspects of classroom environment affect teacher performance, but I wish you or someone would discuss the lack of supports we receive from administrators and the DOE themselves.
    Disruptive behavior, usually by only a couple of students, is never dealt with. And some teachers are set up to have less progress with their students because they get the students with the most emotional and scholastic needs. Add to this a cut in services to these kinds of kids and a cut of supplies and learning materials — it’s a wonder we ever teach anything at all.
  4. Great work!  Thank you for your hard work and dedication!  
    I am a passionate elementary special education teacher.  As a teacher of special education, it is obvious that I am extremely concerned about VAM.  Here are a few thoughts from a special education teacher’s point of view.
    Developmentally, do we expect our children to grow equally in both reading and math at the same time?  It is well documented that when babies/toddlers begin to walk, their speech might decline.  When babies/toddlers begin to speak in sentences, other milestones might maintain status quo.  How can we expect our children to win both the reading AND the math “Races” in the same year?  Olympic sprinters are not expected to win the marathon, also.  
    Success in reading and/or math in school is a team approach.  In my school, I would not, could not (we are celebrating the birthday of Dr. Seuss this week) take credit for a student’s growth without acknowledging the hard work and dedication of the child, family, reading specialist, math specialist, regular education teacher, speech pathologist, OT, lunch server, recess supervisor, secretary, principal, parent volunteers, school custodian, etc.  How can 1 teacher be measured for 1 child’s success?  
    When standardized tests are given in the fall of the school year, how can the current teacher who has worked with the child for approximately 6 weeks take credit for the hard work and dedication of the team with whom worked with the child the previous year?  
  5. syoung
    I’m a retired teacher living on Vancouver Island, British Columbia. You reached me – so I’m hoping your incredible work will spread far and wide. Thanks for taking the time.
  6. Hi Gary, maybe I’m just looking in the wrong places, but I can’t seem to find a way to download the entire dataset. Could you give the link that you used? (I’m sure others would appreciate the same.) Thanks.
  7. This reminds me of a scatter plot my sister used in her masters thesis defense that showed no correlation – she connected the dots to make a picture of a donkey! It got a big laugh. Shared, and will disseminate. Thanks for the hard work on this, it’s so valuable.
  8. Ditto on the link. It’s really frustrating trying to find the data.
    And those R^2 values….damn.
  9. Tom
    You failed to run what I think would be the most obvious relationship, that between one years scores and the next.
    As a test run, I looked at just 4th grade math teachers in the 08-09 and 09-10 years. The correlation coefficient for this relationship turned out to be .44 , which, for one variable in the social sciences, would be considered quite high.
    What this suggests is that, while one-year of results should be taken with a grain of salt, after 3 or 4 years of data these numbers will become quite significant.
    While I do agree that no good comes from publishing this data (in particular single year scores), I think you too easily dismiss their usefulness in evaluating teachers over multi-year periods.
  10. Tom
    oops! I see you did that now in pt. 1. I would still argue that the .35 coefficient you found is high enough to draw conclusions from over a multi-year period.
    One other point that I think is getting missed here is the desire of the NY DOE to release these scores. As far as I can tell, they were forced to via the Freedom of Information act.

No comments: