Document Type
Thesis
Abstract
The goal of our research is to determine if gender bias exists in Wikipedia. Wikipedia is a very large dataset that has been used to train artificial intelligence models. If a dataset that is being used for this purpose is biased, then the artificial intelligence model that was trained with it will be biased as well, therefore making biased decisions. For this reason, it is important to explore large datasets for any potential biases before they are used in machine learning. Since Wikipedia is ontologically structured, we used graph theory to create a network of all of the website’s categories in order to look at the relationships between men-related categories and women-related categories with measures of shortest paths, successor intersections, and average betweenness centrality. We found there is an overexposure of categories that relate to men as they are far more central in Wikipedia and easier to get to than categories that relate to women. However, although women-related categories are not as central, there are about six times more categories that mention women in the title than men, which we consider to be overrepresentation. This is most likely due to women being considered an exception in many fields while men are considered the norm. Our methods can be used to either periodically study gender bias in Wikipedia as its data changes relatively frequently or our methods can be used to study other biases in either Wikipedia or other network-like datasets.
Recommended Citation
Marinina, Anna, "Overrepresentation of the Underrepresented: Gender Bias in Wikipedia" (2019). Honors College Theses. 277.
https://digitalcommons.pace.edu/honorscollege_theses/277
Comments
Computer Science
Advisor: Yegin Genc