09 / 09, 2024
Autograder
Grading 80,000+ exam answers with BERT and BM25, at 98% accuracy.
Role
Engineer & TA, NLP pipeline, evaluation (SRM)
Stack
BERT · BM25 · FastAPI · Python
Links
The Problem
Manually grading subjective exam answers at university scale is slow, inconsistent, and exhausting, and consistency suffers most exactly when volume is highest.
This autograder scores 8,000+ student submissions (across 80,000+ answers) by combining semantic understanding from BERT with lexical relevance from BM25, reaching 98.1% agreement with human graders.
The Architecture
01Hybrid semantic + lexical scoring
BERT captures whether an answer means the right thing; BM25 captures whether it contains the right terms. Combined, the system grades on understanding and coverage rather than either alone.
02Reference based evaluation
Each answer is scored against model answers, so the system measures closeness to a rubric rather than guessing in a vacuum.
03FastAPI grading service
The pipeline is exposed as a FastAPI service that batches submissions, so 80,000+ answers can be processed without manual handling.
Decisions that mattered
Combine semantic and lexical signals
BERT alone over credited fluent but wrong answers; BM25 alone over credited keyword stuffing. The hybrid score corrected both failure modes.
Position it as a supplement, not a replacement
The tool was introduced as a parallel evaluation alongside human grading, high accuracy earns trust, but the human stays in the loop for the edge cases.
Validate against human graders relentlessly
98.1% accuracy is only meaningful measured against real graders on real answers, the benchmark was agreement with humans, not an internal proxy metric.
The Numbers
98.1%
grading accuracy
80K+
answers evaluated
8K+
student submissions
2
models combined
What it taught me
Hybrid models win when single models fail in opposite directions, semantic and lexical errors cancel out.
Automated grading is a trust problem as much as an accuracy problem: shipping it as a supplement, validated against humans, is what made it usable.