08 / 09, 2022
Bayesian
A Gaussian Naive Bayes income classifier evaluated with confusion matrices, ROC analysis, and 10 fold validation.
Role
ML researcher, classifier evaluation and validation
Stack
Python · NumPy · Pandas · Matplotlib · Seaborn · Gaussian Naive Bayes · 10 fold CV
Links
The Problem
The README builds a Gaussian Naive Bayes classifier to predict whether a person makes over 50K per year, using Bayes' theorem and maximum a posteriori class selection.
The core concern is not only the model's accuracy, but whether the classifier overfits, how it compares with null accuracy, what errors it makes, and how stable it is under cross validation.
The Architecture
01Bayesian classifier baseline
A Gaussian Naive Bayes model estimates class membership probabilities and predicts the most likely income class using Bayes' theorem.
02Evaluation notebook
The workflow uses NumPy, Pandas, Matplotlib, and Seaborn for data handling and visualization, then checks model accuracy, training accuracy, test accuracy, and null accuracy.
03Confusion, ROC, and cross validation
The README reports confusion matrix counts, predicted probability histograms, an ROC curve, and 10 fold cross validation to evaluate both error types and fold stability.
Decisions that mattered
Compare against null accuracy
The model accuracy around 0.8083 is interpreted against a null accuracy of 0.7582, making the result meaningful relative to a majority class baseline.
Inspect error types
The confusion matrix separates true positives, true negatives, false positives, and false negatives, so the evaluation covers what kind of mistakes the classifier makes.
Validate fold stability
The README uses 10 fold cross validation and notes a small accuracy range across folds, suggesting the result is not dependent on one lucky split.
The Numbers
80.83%
model accuracy
80.63%
mean CV accuracy
75.82%
null accuracy
10
CV folds
What it taught me
A classifier result becomes more credible when it is compared with a null baseline, tested for overfitting, and validated across folds.
Confusion matrices and probability histograms explain model behavior in ways a single accuracy number cannot.