We’re adding an extra day on ethics in our Methods in Lang Doc class to discuss how these issues may change in the context of the AI resurgence, E.g. if industry is interested in expanding to low-resource languages, making archival data possibly commercially valuable, (how) should informed consent forms be adjusted? To stimulate discussion I want to give the student some case studies where language documentation data took on a new value or was used in unexpected ways, perhaps to build a language technology for a community that cost money (either sold for profit or to cover the cost of ongoing tech support). I have 1-2 based on issues I’ve encountered or been asked about and I have some realistic situations that I plan to imagine up as a case study. I would love, however, to give some real life case studies.
Does anyone have stories, questions, issues, surprises, or struggles related to language documentation data and technology (data science, AI, language technology) and ethical fieldwork (informed consent, access, copyright, sensitivity, politics, etc.) that they would be willing to share with me? This week, if possible!
Post here or send me via email: smoeller@ufl.edu (I will anonymize everything).
You could check out my paper on rights beyond copyrights… I address some of these issues and others… with AI the legal claim is that copyright limits don’t apply to the models created through Machine Learning. If the materials are Open Access by the OA definition then there is an implied use license even if in copyright and no explicit use license is declared. On Rights Management in Anthropological and Linguistic Sound Collections | Hugh's Curriculum Vitae
Thanks! I had not heard about the legal claim, but it makes sense. This is the sort of thing I was thinking about - that AI does not need to use the data in the same way as a linguist might, so how does/should that change best ethical practices? They can access the data for training a model but the data doesn’t need to be “seen” by anyone. That is, no sensitive information will be shared publicly. Instead, word counts may be gathered and then essentially thrown away once the language technology is trained. But the texts are still floating around outside the archive for awhile, so should archives have a new designation of access rights so that language technology can be developed and possibly benefit the community? On the other hand, companies may make money by selling the technology, so should they be required to pay for using the data? How is that different from academics who “only” advancing their career by using the data? What about computational linguists who develop methods that make AI accessible to low-resource languages but by doing so, also open the door for commercial exploitation of documentary data?
Those are the kind of questions I want my students to ask themselves…
Carefully consider the weight of the implications of one’s actions… consider carefully the difference between ethics, morals, and legal obligations. Consider Oppenheimer’s reaction here. And its commentary here. As a scientist he helped make the modern era.
He quotes:
Now I am become Death, the destroyer of worlds
Then there is Paul Newman the lawyer and linguist who quotes here:
With the release of GPT-4, there are so many ethical questions that have become immediately relevant. I see great potential in AI for supporting minority languages through analysis, creating pedagogical materials for schools, allowing digital interaction in the language, etc. Is it ethical for us to start experimenting with AI for these purposes with the data we have collected? With open-source archival collections on the internet? Is it okay to archive data we have already collected, when the consent framework didn’t take AI applications into account? What conditions need to be met for that work to be ethical? What do we need to keep in mind?
The issue of language documentation ethics given AI is relevant for all of us and has implications that are much broader and more imminent than what I (and many of us) realized or were thinking about. As just one example, AI needs just 3 seconds of audio of someone’s voice in order to speak in their voice, which can be used for very convincing scams.
I would love to coordinate a discussion to talk about these issues together! We need to broaden our discussions of ethics, and soon, as AI is moving very quickly. Personally I feel a great deal of uncertainty around these and related questions.