Human beings define the problem, collect data, train, test and deploy the machine learning model. How can you expect the model to be fair when humans are biased? There has been much focus in the recent past on this. You could introduce bias in each stage of the model-building process, causing the model to behave unfairly. I have listed down ten known but overlooked areas that can cause harm.
1. Task definition can cause unfairness. For instance, if you define a task to classify the gender but intend to classify only male and female. This definition will unfairly classify the population as only male and female, whereas there are more than two classifications of gender in reality.
2. The Technology/System used for data collection can also cause unfairness in the model, introducing biases.
For instance, you collect data from a smartphone and train a model for the entire population. As this approach omits the population that doesn't use smartphones, you have already introduced bias.
Another well-known example is that your Online application has only male and female in the drop-down for the Gender field. This application forces the applicants to choose one of them, causing an incorrect representation of data or not onboarding the gender other than male and female.
3. The labelling process can cause bias for the below two reasons:
First, it is originally a manual process, and the label already has human preferences/biases.
Secondly, incorrect automated labelling will mislabel the data and introduce bias. For example, is tomato a fruit or vegetable?
4. Preprocessing data is another leading cause of models behaving unfairly. One well-known practice in "treating the missing values or outliers" is removing the entire row if the numbers are relatively small. But unfortunately, you would train the model without those minority populations. Hence, if this happens to be a loan approval model, no one from that minority group will get a loan.
5. The machine learning model can also cause biases in the model outcome. As you can see in the below example, the population has two groups, namely the majority group and the minority group, which look fundamentally different. Hence, having a single linear classifier and a single threshold for the population will produce different error rates for distinct subpopulations.
6. People: If not from the different cultural backgrounds, people involved in data collection will tend to collect data from their native culture, causing skewed data.
7. Demography: Including data from all demographic segments during training will ensure that the data reflects the population and minimises the unfairness in the model outcome.
8. Testing and Deployment: The choice of testing data should reflect the ratio of the population among all demographies, failing which testing will be ineffective and will cause an unfair outcome. Secondly, you may wonder how the deployment will cause fairness issues. Yes, if you deploy to a population that includes a subpopulation you didn't have in the training phase. Makes sense?
9. Feedback mechanism: The feedback mechanism itself doesn't cause harm, but a biased ML model will give biased feedback, thereby widening the gap that's already existing.
For instance, your model doesn't treat the women fairly, so dissatisfied women will not use your AI system subsequently. Hence, all the feedback is only from the men, which further skews the learning process towards men.
10. Data collection:
10.1.Availability: Focus on all the methods during the data collection process. It is pretty normal to be satisfied with the available and ready-made data. But potentially, you would have omitted a subpopulation.
10.2.Recency: Collect the present and reasonable past data to ensure that you don't introduce recency bias.
10.3.Confirmation: A tendency to focus on data that confirms your belief may ignore a subpopulation, thereby missing data that is part of the population.
I hope this article has given you a fair understanding of why AI systems can be unfair. Thanks for reading. If you find this article interesting, please like, share and comment.
Views are personal.
Banner created using canva.com
References and additional reading: