Summer Schools: New Languages for NLP: Building Linguistic Diversity in the Digital Humanities / USA

Perhaps of interest:

Date: 10-Dec-2020
From: Andrew Janco <>
Subject: New Languages for NLP: Building Linguistic Diversity in the Digital Humanities / USA
E-mail this message to a friend

Host Institution: Princeton University

Dates: 14-Jun-2021 - 16-May-2022
Location: Remote and Princeton, New Jersey, USA

Focus: Digital Humanities, Natural Language Processing
Minimum Education Level: BA

Do you wish you could do large-scale text analysis on the languages you study? Is the lack of good linguistic data and tools a barrier to your research?

The Center for Digital Humanities at Princeton is calling for applications for New Languages for NLP: Building Linguistic Diversity in the Digital Humanities, a 3-part workshop series to be held between May 2021 and August 2022. Deadline for applications is January 10, 2021.

We are seeking a cohort of scholars working with diverse languages that currently lack NLP resources. No technical experience is necessary to participate. Institute participants will learn how to annotate linguistic data and train statistical language models using cutting-edge NLP tools and will advance their own research projects.

For more information and to apply, see our project website:

This Institute workshops is funded by a National Endowment for Humanities Institutes for Advanced Topics in the Digital Humanities grant, and is a collaboration between the Princeton CDH, Haverford College, the Library of Congress Labs, and DARIAH.

Please feel free to contact the project directors with questions:
Natalia Ermolaev (
Andrew Janco (

Linguistic Field(s): Text/Corpus Linguistics

Registration: 10-Jan-2021 to 10-Jan-2021
Contact Person: Andrew Janco
Phone: 8572108078

Dear Linguist List, light gray text on a white background is bad. :skull::skull::skull:

1 Like

Yes this does look really interesting! I started filling out the form and reached this question:

“For the Institute you will need a collection of machine-readable texts containing 20,000 or more words/tokens in your language (here, machine-readable means OCRd text, transcribed documents, or born-digital). Do you presently have such a corpus? If yes, please describe how it was created. If no, how do you plan to gather these materials?”

That might rule out a number of people (myself included), unfortunately. :disappointed_relieved:


Ah, yes, that would probably rule me out too :sweat_smile::sweat_smile: Thanks for sharing!

Huh, yeah. That is a pretty big number.

Do you know where your word-count is hovering now? Maybe you could make an argument about the trajectory of your word-count over time.

I think they’re going to run into problems if they think that a ton of projects are starting with those kinds of numbers. I think a lot of times we see this mismatch between the “big data” community and the documentation community.

I wonder if @Hilaria might have some insight into this problem?

1 Like

Could you share the call for the summer school?

Thanks Hilaria

1 Like

I got it thank you. This is interesting. I would not know how to answer those questions too. Glad to see there is a growing interest in this problem.