09 / 09, 2024

Autograder

Grading 80,000+ exam answers with BERT and BM25, at 98% accuracy.

Role

Engineer & TA, NLP pipeline, evaluation (SRM)

Stack

BERT · BM25 · FastAPI · Python

Links

Private project

The Problem

Manually grading subjective exam answers at university scale is slow, inconsistent, and exhausting, and consistency suffers most exactly when volume is highest.

This autograder scores 8,000+ student submissions (across 80,000+ answers) by combining semantic understanding from BERT with lexical relevance from BM25, reaching 98.1% agreement with human graders.

The Architecture

01Hybrid semantic + lexical scoring

BERT captures whether an answer means the right thing; BM25 captures whether it contains the right terms. Combined, the system grades on understanding and coverage rather than either alone.

02Reference based evaluation

Each answer is scored against model answers, so the system measures closeness to a rubric rather than guessing in a vacuum.

03FastAPI grading service

The pipeline is exposed as a FastAPI service that batches submissions, so 80,000+ answers can be processed without manual handling.

Decisions that mattered

1.

Combine semantic and lexical signals

BERT alone over credited fluent but wrong answers; BM25 alone over credited keyword stuffing. The hybrid score corrected both failure modes.

2.

Position it as a supplement, not a replacement

The tool was introduced as a parallel evaluation alongside human grading, high accuracy earns trust, but the human stays in the loop for the edge cases.

3.

Validate against human graders relentlessly

98.1% accuracy is only meaningful measured against real graders on real answers, the benchmark was agreement with humans, not an internal proxy metric.

The Numbers

98.1%

grading accuracy

80K+

answers evaluated

8K+

student submissions

2

models combined

What it taught me

Hybrid models win when single models fail in opposite directions, semantic and lexical errors cancel out.

Automated grading is a trust problem as much as an accuracy problem: shipping it as a supplement, validated against humans, is what made it usable.