Computer grading is here for STAAR essays. Should Fort Worth school leaders worry?

Having just adapted to a newly reformatted state test, school leaders across Texas are now looking at a new change in how their students are assessed: computer-based scoring.

The Texas Education Agency rolled out the new “automated scoring engine,” a computer-based grading system, in December, the Dallas Morning News reported. Following the change, about three-quarters of all essay questions will be scored by a computer program rather than human scorers.

School district leaders in the Fort Worth area say it’s too soon for them to tell whether the new grading system is a cause for concern. But some say they need more information about the new system.

“I think anytime a computer program is going to take on grading of something of this magnitude, I think it is concerning,” said Jennifer Price, chief academic officer for the Keller Independent School District.

Automated scoring comes amid STAAR reformat

The new scoring engine comes amid broader changes to the state test. Last year, the Texas Education Agency rolled out a newly revamped STAAR exam that includes more writing prompts and fewer multiple choice questions than previous versions. State education officials say the new test is designed to more closely mirror instruction students get in the classroom.

But open-ended responses like essays also take longer to score than multiple choice questions. TEA officials said using computer-based scoring in combination with human scorers allows the agency to score tests and get results back to districts more quickly and cheaply.

Chris Rozunik, director of the agency’s student assessment division, said the computer program scores exams based on the same rubric that human graders use. The agency is also using human-scored sample papers to train the engine on what to look for in students’ responses, she said.

Rozunik said the new engine isn’t an AI system with broad capabilities like ChatGPT, but rather a computer-based scoring system with narrow parameters. She noted the agency has used machine scoring for closed-ended questions like multiple choice prompts for years.

The agency is committed to having human scores evaluate 25% of all essays, she said. The essays graded by humans include those the computer program can’t make sense of, and also a certain number the agency randomly assigns to human scorers, she said.

The reasons the computer program might kick an essay to human graders are varied, Rozunik said. If a student enters a series of random letters instead of an answer, the computer won’t understand how to evaluate it. But real answers, even good ones, can also baffle a computer program. If a student answers a question in a language other than English, the essay will end up being referred to a human, she said. Likewise, if a student gives an answer that is thoughtful and creative, but doesn’t come in a form the computer recognizes, their answer will go to a human, who will be better able to score it appropriately, she said.

“We do not penalize kids for unique thinking,” she said.

The agency is already facing a lawsuit brought by several school districts, including the Fort Worth and Crowley independent school districts, over the state’s A-F accountability system, which is primarily based on STAAR scores. Last October, a state district judge temporarily blocked the agency from releasing that year’s A-F scores.

Fort Worth school officials want more clarity on scoring change

Price, the Keller ISD administrator, said she’s worried about what guardrails are in place for the new automated system. State education officials say the exam is no longer a high-stakes test for students, since their performance doesn’t have any bearing on whether they go on to the next grade. But STAAR scores are still a high-stakes matter for school districts, since they’re the main factor in accountability ratings. Those scores can affect how parents perceive their school districts or campuses, ultimately influencing their decision about where to enroll their kids.

Given those stakes, Price doesn’t think state education officials have given districts enough information about how the new system works. The district has known the change was coming for about a year, she said, but TEA has given districts only limited details about what it would look like.

Melissa DeSimone, executive director of research, assessment and accountability for the Northwest Independent School District, said she doesn’t have enough data yet to know whether the new scoring system is a cause for concern. So far, TEA has only used the automated engines to score last December’s end-of-course exams. The district has gotten raw scores from that round of testing, she said, but hasn’t yet received students’ responses to test questions. Districts should get those responses sometime in late March, she said. At that point, the district can go through students’ answers and see if they were scored appropriately, she said.

If the district does find discrepancies between the scores that students received and the quality of their responses, officials can request that those tests be reevaluated by a human score, DeSimone said. The drawback is that those requests cost the district about $50 each if the scores come back the same, she said. The agency waives that fee if human scorers rate the response differently than the computer did.

District leaders have known that automated scoring was coming since the early part of last year, DeSimone said. The district didn’t adjust any of its test preparation because the automated scoring system is supposed to be based on the same rubric as human scoring, she said.

Fort Worth ISD officials weren’t available for an interview for this story. In an email, Melissa Kelly, the district’s associate superintendent of learning and leading, said there’s “a significant level of uncertainty” around how the new system will work.

So far, the district isn’t planning any major changes in response to the new scoring system, Kelly said. District leaders will stay focused on teaching Texas’ state-mandated standards and wait to see what results come out of the scoring change, she said.

Testing expert says automated scoring is growing

Kurt Geisinger, director of the Buros Center for Testing at the University of Nebraska–Lincoln, said the shift to automated grading shouldn’t be a big cause for concern for local school districts. Automated grading of essays is becoming more common across the country, he said, and for the most part, it’s been implemented without major problems.

A few years ago, Geisinger served as board chairman for the Graduate Review Examinations, an admissions test used for graduate schools across the country. At the time, the testing organization shifted to a hybrid AI-human grading model, where each test would be scored by both a computer and a human, he said. The organization found that the AI program did about as well as the human grader, he said.

Geisinger said one of the admissions exams in use across the country — he wouldn’t say which test — is graded at least in part using AI. The grading program analyzes essays based on about 40 different criteria, he said. But the three factors that end up being most critical to the final score are the length of the essay, the number of paragraphs and the average word length, he said. That means those tests aren’t so much measuring the quality of writing as a few factors that often correlate with good writing, he said.

Using those factors as a proxy for judging the quality of writing has some drawbacks, Geisinger said. If a test-taker uses longer words, it can be a sign of a larger vocabulary, he said. But the awkward use of big words makes for bad writing. If an AI system can’t tell whether the test-taker uses those words correctly, it may struggle to tell good writing from bad writing, he said.

Geisinger said some professors are also concerned about whether creativity in writing gets lost in the shift to AI grading, although he said he hasn’t seen any research to validate those concerns.

“I’ve heard English scholars say they wonder how someone like James Joyce would do on an AI-scored (test),” he said.