One of the problems with machine learning is that is heavily based on datasets that were labeled by people, and people can, sometimes, have a biased view of the world, not just that, some datasets could also have bias in the sense of exploring just one part of a population while excluding others.
Explore your data
We can use the dataset from Kaggle 120 years of Olympic history: athletes and results to investigate more on this topic.Enter your text here...
Looking at the Distributions
You’ve suddenly realize that the average weight of the male population is away more then one standard deviation from the mean of the female distribution. You thing with yourself, could it be that I could classify the gender of the athlete with a cutoff weight, maybe one standard deviation from the female average? You then create a fast snippet of code that does exactly that
You may thing that the value you choose could be optimized to maximize the overall accuracy, maybe using a list of cutoffs, from let’s say, 60 to 80
You get a new accuracy around 80%, pretty impressive just looking at the weights not? You would expect that a friend of yours looking at hundreds of thousands of weights could get the gender right 8 in 10 times ?
Don't rely just on accuracy
That’s when you get things wrong, you can’t be satisfied with your model just looking at the accuracy, you should also be looking at other metrics, the problem with this dataset is what is called in machine learning as prevalence.If you look at the proportion of males and females in your dataset you would find this.
You’ve way more males then females in your dataset. You could go about learning more about your model from what’s called the Confusion Matrix, that let’s you know how many false positives and false negatives you had
You can see where you’re getting wrong, about 50% of the female population are being classified as males! But how then the model could be having around 80% of accuracy? This happens because as the dataset has a low proportion of females, the times that you get the female gender wrong is more then balanced by the times you get the male gender right.
This is why in machine learning models we always should be carefully with what models we put in production. If you want more content like this, I have a YouTube Channel