Recorded on April 17, this video features a talk by Dr. Alex Hanna, Director of Research at the Distributed AI Research Institute (DAIR). The talk was part of a symposium series presented by the UC Berkeley Computational Research for Equity in the Legal System Training Program (CRELS), which trains doctoral students representing a variety of degree programs and expertise areas in the social sciences, computer science and statistics.
The talk was co-sponsored by the UC Berkeley Department of Sociology, the Criminal Law & Justice Center, and the Berkeley Institute of Data Sciences (BIDS).
Abstract
Artificial intelligence (AI) technologies like ChatGPT, Stable Diffusion, and LaMDA have led a multi-billion dollar industry in generative AI, and a potentially much larger industry in AI more generally. However, these technologies would not exist were it not for the immense amount of data mined to make them run, low-paid and exploited annotation labor required for labeling and content moderation, and questionable arrangements around consent to use these data.
Although datasets used to train and evaluate commercial models are often obscured from view under the shroud of trade secrecy, we can learn a great deal about these systems by interrogating certain publicly available datasets which are considered foundational in academic AI research.
In this talk, I investigate a single dataset, ImageNet. It is not an understatement to say that without ImageNet, we may not have the current wave of deep learning techniques which power nearly all modern AI technologies. I begin from three vantage points: the histories of ImageNet from the perspective of its curators and its linguistic predecessor WordNet, the testimony of the data annotators which labeled millions of ImageNet images, and the data subjects and the creators of the images within ImageNet. Academically, I situate this analysis within a larger theory and practice of infrastructure studies. Practically, I point to a vision for technology which is not based on practices of unrestricted data mining, exploited labor, and the use of images without meaningful consent.
About the Speaker
Dr. Alex Hanna is Director of Research at the Distributed AI Research Institute (DAIR). A sociologist by training, her work centers on the data used in new computational technologies, and the ways in which these data exacerbate racial, gender, and class inequality. She also works in the area of social movements, focusing on the dynamics of anti-racist campus protest in the US and Canada. She holds a BS in Computer Science and Mathematics and a BA in Sociology from Purdue University, and an MS and a PhD in Sociology from the University of Wisconsin-Madison. Dr. Hanna has published widely in top-tier venues across the social sciences, including the journals Mobilization, American Behavioral Scientist, and Big Data & Society, and top-tier computer science conferences such as CSCW, FAccT, and NeurIPS. Dr. Hanna serves as a Senior Fellow at the Center for Applied Transgender Studies, and sits on the advisory board for the Human Rights Data Analysis Group and the Scholars Council for the UCLA Center for Critical Internet Inquiry.