Lang Doc Ethics & AI - looking for case studies

We’re adding an extra day on ethics in our Methods in Lang Doc class to discuss how these issues may change in the context of the AI resurgence, E.g. if industry is interested in expanding to low-resource languages, making archival data possibly commercially valuable, (how) should informed consent forms be adjusted? To stimulate discussion I want to give the student some case studies where language documentation data took on a new value or was used in unexpected ways, perhaps to build a language technology for a community that cost money (either sold for profit or to cover the cost of ongoing tech support). I have 1-2 based on issues I’ve encountered or been asked about and I have some realistic situations that I plan to imagine up as a case study. I would love, however, to give some real life case studies.

Does anyone have stories, questions, issues, surprises, or struggles related to language documentation data and technology (data science, AI, language technology) and ethical fieldwork (informed consent, access, copyright, sensitivity, politics, etc.) that they would be willing to share with me? This week, if possible! :slight_smile:

Post here or send me via email: smoeller@ufl.edu (I will anonymize everything).

1 Like

You could check out my paper on rights beyond copyrights… I address some of these issues and others… with AI the legal claim is that copyright limits don’t apply to the models created through Machine Learning. If the materials are Open Access by the OA definition then there is an implied use license even if in copyright and no explicit use license is declared. On Rights Management in Anthropological and Linguistic Sound Collections | Hugh's Curriculum Vitae

Thanks! I had not heard about the legal claim, but it makes sense. This is the sort of thing I was thinking about - that AI does not need to use the data in the same way as a linguist might, so how does/should that change best ethical practices? They can access the data for training a model but the data doesn’t need to be “seen” by anyone. That is, no sensitive information will be shared publicly. Instead, word counts may be gathered and then essentially thrown away once the language technology is trained. But the texts are still floating around outside the archive for awhile, so should archives have a new designation of access rights so that language technology can be developed and possibly benefit the community? On the other hand, companies may make money by selling the technology, so should they be required to pay for using the data? How is that different from academics who “only” advancing their career by using the data? What about computational linguists who develop methods that make AI accessible to low-resource languages but by doing so, also open the door for commercial exploitation of documentary data?

Those are the kind of questions I want my students to ask themselves…

Carefully consider the weight of the implications of one’s actions… consider carefully the difference between ethics, morals, and legal obligations. Consider Oppenheimer’s reaction here. And its commentary here. As a scientist he helped make the modern era.

He quotes:

Now I am become Death, the destroyer of worlds

Then there is Paul Newman the lawyer and linguist who quotes here:

I have met the enemy and it is us.