Site icon Synapses

Imbalanced Dataset: Be cautious, don’t fall into the Accuracy pit!

Recently, I was trying to build a predictive model for an organization that was carrying out Employee verification and background checks.

Like a typical data scientist (with a deadline looming) I rushed myself in building the models (classification) and found that most of the models that were built, had an accuracy over 99.5% (on test dataset). Wow..Amazing… right ?

Nooo…..Inspite of having such a great accuracy, I wasn’t feeling happy. There was something uneasiness running in my mind (something like doing incorrectly) …I paused myself. Looked into my (genius) professor’s notes/slides – Dr. Vineeth Balasubramanian (Asst Prof at Indian Institute of Technology, Hyderabad). Found the point that Sir (Dr.Vineeth) had made in the class, “Don’t fall into the accuracy pit – The data might be imbalanced”. Oh gosh.. What a relief J !

Carried out the Data Exploration phase, found that the data was extremely skewed towards “Good to hire”… Mann, which means only 1 in 4500 potential employees had provided incorrect (was trying to use the word ‘fake’) experience and/or had some ‘criminal’ background.

With a such an imbalanced dataset, even a blind man could predict with an accuracy of 99%, that if the potential employee has provided genuine details or a fake one! (Oops that was too much from my side)

Later, tried several techniques as “Over-sampling the minority class”, “Under-sampling the majority class”, “SMOTE” technique etc. Things went very well. Relieved that I was able to built a better data product.

Thank you Sir (Dr. Vineeth) for your golden words “Don’t fall into the accuracy pit – The data might be imbalanced

Good Luck “Scientist”..

https://www.linkedin.com/pulse/imbalanced-dataset-cautious-dont-fall-accuracy-pit-safdar-hussain/

Exit mobile version