08 / 09, 2022

Bayesian

A Gaussian Naive Bayes income classifier evaluated with confusion matrices, ROC analysis, and 10 fold validation.

Role

ML researcher, classifier evaluation and validation

Stack

Python · NumPy · Pandas · Matplotlib · Seaborn · Gaussian Naive Bayes · 10 fold CV

The Problem

The README builds a Gaussian Naive Bayes classifier to predict whether a person makes over 50K per year, using Bayes' theorem and maximum a posteriori class selection.

The core concern is not only the model's accuracy, but whether the classifier overfits, how it compares with null accuracy, what errors it makes, and how stable it is under cross validation.

The Architecture

01Bayesian classifier baseline

A Gaussian Naive Bayes model estimates class membership probabilities and predicts the most likely income class using Bayes' theorem.

02Evaluation notebook

The workflow uses NumPy, Pandas, Matplotlib, and Seaborn for data handling and visualization, then checks model accuracy, training accuracy, test accuracy, and null accuracy.

03Confusion, ROC, and cross validation

The README reports confusion matrix counts, predicted probability histograms, an ROC curve, and 10 fold cross validation to evaluate both error types and fold stability.

Decisions that mattered

1.

Compare against null accuracy

The model accuracy around 0.8083 is interpreted against a null accuracy of 0.7582, making the result meaningful relative to a majority class baseline.

2.

Inspect error types

The confusion matrix separates true positives, true negatives, false positives, and false negatives, so the evaluation covers what kind of mistakes the classifier makes.

3.

Validate fold stability

The README uses 10 fold cross validation and notes a small accuracy range across folds, suggesting the result is not dependent on one lucky split.

The Numbers

80.83%

model accuracy

80.63%

mean CV accuracy

75.82%

null accuracy

10

CV folds

What it taught me

A classifier result becomes more credible when it is compared with a null baseline, tested for overfitting, and validated across folds.

Confusion matrices and probability histograms explain model behavior in ways a single accuracy number cannot.