Bias in Data Science? 3 Most Common Types and Ways to Deal with ThemJune 24th, 2021
Book titles and newspaper headlines like the following ones are the new normal in a world regulated by algorithms and data science.
The more the world becomes reliant on data-based algorithms, the more we’ll get to see things like this. Yet, Machine learning bias is not a new thing.
One of the most cited IT articles in this theme "Reducing bias and inefficiency in the selection algorithm" by James Edward Baker, is from 1987 with over 2000 citations. So we can see that we've been dealing with this for decades.
This is an old issue with current problems. No data person (scientist, engineer, analyst and so on) is free from committing this type of mistake. The best way to deal with them is by talking about them. So, in this article, I’ll guide you through the 3 most common types of bias and I’ll give you some tools and ideas on how to avoid them.
But first, let's define algorithmic bias. In its simplest definition, Algorithmic Bias "describes systematic and repeatable errors in a computer system that create unfair outcomes, such as privileging one arbitrary group of users over others".
So, let's see the 3 most typical scenarios in which this happens, and the techniques to deal with them.
1 - Confirmation Bias
The most common of the bias, we've all been victims of it.
It happens when we go into the data with a previous expectation of what we are hoping to see. It can be conscious or unconscious so it's crucial for a data scientist to be attentive to this kind of bias. Our bias clouds our own judgement, and we can end up making judgments on the data that might not be in line with reality.
A classical example is Anscombe's Quartet, with 4 different datasets with 2 variables, x and y. If you do simple summary statistics, you’ll get similar results for all.
The average x value is 9 for each dataset
The average y value is 7.50 for each dataset
The variance for x is 11 and the variance for y is 4.12
The correlation between x and y is 0.816 for each dataset
A linear regression (line of best fit) for each dataset follows the equation y = 0.5x + 3
At first, you might think they are equivalent, descriptive statistics are all the same. But then you plot them and to your surprise, they are all different. Reality does not match your bias.
There is no easy way out of this kind of bias. Two major ways to prevent it is to keep a standardized protocol of data analysis that you use consistently across projects, and to keep domain knowledge experts close to evaluate your work. Be sure that those domain knowledge experts reflect the diversity of your expected clients so you don't reproduce the same bias but in your domain knowledge expert.
2 - Sampling Bias (also known as selection bias)
When a dataset does not reflect the population in which the model will be used, sampling bias will happen. Let's take Amazon's AI recruitment tool failure as an example. This model was ignoring good applications by women because its training dataset was overwhelmingly filled with male applicants. The accuracy was also diminished for BIPOC candidates.
There are methods to attempt to mitigate this bias such as synthetic data (generating artificial data from underrepresented groups) or resampling techniques (creating a subset of the original data that balances groups) but first, it's important for us to be aware that this sampling bias exists. And for that, tools such as Aequitas - a free tool to perform bias and fairness audits to your projects - are fantastic.
Using tools such as this one should become a standard in your work.
3 - Association Bias
This is the most prevalent bias in the news today. It usually appears when data clearly shows reinforcement of a cultural bias.
It can be as simple as an association word game that pairs men with programmers and women with nurses, reinforcing a stereotype. It can also be as serious and problematic as when AI was applied in court decisions, in the USA, which consistently made them harsher towards African Americans than White Americans.
The algorithm, called COMPAS, was created in an attempt to predict the risk of recurrence of each criminal and, based on that risk, the value of the bail was calculated. While it worked reasonably when calculating bail values within the same race, it was clearly biased towards African Americans when compared to white Americans. The calculations reflect the data, showing the internal bias that exists in the judicial system.
Solutions for this type of bias are not easy and controversial to manage and use. Nowadays, best practices tell us that consistent bias reviews by diverse teams are crucial to avoid situations like this. A diverse team (that reflects the population that will be served by the model) is the best pathway forward.
Reflecting the world in our teams, having the work reviewed by more than one group, and being alert to the fact that bias is a part of data science will help us move forward towards a more fair algorithmic world.
Remember, fellow data people, stay cautious, stay alert and do your best to avoid bias.
It's all that could be asked of a good data pro. 😉