DoReCo [Language Documentation Reference Corpus] launch event

Passing this along from the LINGTYP mailing list, surely of interest to lots of folks here:

Dear typologists,

We are thrilled to announce the upcoming inauguration of the complete
DoReCo [Language Documentation Reference Corpus] database, with fully
processed data sets on a typologically diverse set of 50 (plus one!)
languages (see Languages – DoReCo). For each of these, we
will make available ca. 10,000 words of narrative texts that are
phonemically transcribed and time-aligned with the audio signal at the
segment level. For 36 of these languages, we will additionally provide
morphological segmentation and glossing.

To mark this occasion, there will be a public event on 29 July 2022,
3:30-5:30pm CEST, to be held at the ZAS in Berlin and online. We are
looking forward to a keynote address by Evangelia Adamou and we are also
very happy that many of the DoReCo corpus contributors will be present
to introduce themselves and the languages they work on. For the full
program see the announcements on the DoReCo website
(29 July 2022 DoReCo Inauguration Ceremony – DoReCo).

Online or on-site attendance is free, but registration (by 22 July) is
required, via this link: Microsoft Forms.

We hope to see you soon in Berlin or online.

Best wishes

Frank Seifart, on behalf of the DoReCo team

Frank Seifart
Linguist @ ZAS, Berlin

Check out the current language map (lots more languages on the site):

Local heroes @rgriscom and @Andrew_Harvey are contributors, for Asimjeeg Datooga and Gorwaa, respectively! Mad props.

Looking forward to learning more about this, it looks like a great step for language documentation.

The Inauguration Ceremony will be held:


More info here:

Don’t forget to register by July 22!


Let us pause to appreciate the mysterious U+FFFC character, , at the end of that URL.




According to Wikipedia:

The original proposal for this character can be found here (search for “Embedded Objects”).

1 Like

Hey @skalyan!

I love the fact that there are people in this site who find such things interesting! :joy:

1 Like

It’s been great to be involved in this project! I think this kind of collaboration between people who want to use language data (in this case DoReCo), and people who collect that data (in this case me) is going to be happening more and more in the future, and not only do I want to continue participating in it, but also to encourage other people doing language documentation to do the same.

:large_blue_circle: Here’s a brief snapshot of what that looked like for me:

-In early March of 2019, @rgriscom sent me an email saying that a colleague of his was joining a project and that they were looking for samples of language that met the following criteria:

  1. a minimum of 10,000 transcribed words (typically distributed over various recording sessions/annotation files)

  2. translation into a major language

  3. primarily monological texts (e.g., personal or traditional narratives)

  4. time-alignment of transcription and translation with audio files at the level of sentences, paragraphs, utterance, or intonation units (i.e., “annotation units” in ELAN, time stamps in Toolbox records)

  5. audio is of reasonable quality (not too much overlapping speech or background noise)

  6. transcription/translation/annotation files (not audio/video files) can be made accessible within three years on the DoReCo platform under a Creative Commons Attribution 4.0 (CC BY 4.0) license, with strict rules for fair scientific use (see below)

…as well as to indicate if the data includes at least 10,000 words that are additionally morphologically annotated (typically using Toolbox/Shoebox) with (i) morpheme segmentation, (ii) morpheme glosses, and (optionally) (iii) part-of-speech tags.

There was an additional understanding that following this initial ‘donation’, I would, over the course of the project, provide:

  1. A chart specifying correspondences between the orthographic characters used in the transcription and IPA symbols

  2. Answering our questions regarding e.g. inconsistencies between the audio and the transcription (e.g. transcribing/glossing elements that are not transcribed)

  3. Providing basic metadata per recording session if not already available (e.g. anonymized speaker codes, speaker sex and approximate age)

-once I had reached out to the DoReCo contact email, as well as sent them a short example of a parsed and glossed text, I received a response very quickly (around 1 week later), saying that my material looked like a good fit. I was then asked to specify a subset of recordings from the archived collection of Gorwaa materials which were similar to the sample I had provided

-around early October 2019 (and after some short exchanges back and forth regarding small stuff like missing files etc.) I was told that my data had been processed with the MAUS software and given automatic word-level alignments. I was then asked if I would like to take over post-processing of this material (things like identifying code-switching, filled pauses, missing transcriptions, etc.). I was also told that there was a small amount of funding available for this, which would cover any time I might have to put into it.

-I originally thought that I would be able to do this, but it turned out that I just didn’t have the capacity. For several reasons (primarily Covid and being separated from the computer I usually use to process my files for several months as a result of pandemic travel restrictions), I ended up making this decision around 12 months later (October 2020). Amazingly, DoReCo was still able to work with my data, and employed a plan B to have the material post-processed by an assistant on their side.

-In September (2020) DoReCo got back in touch, at which point my material had been successfully post-processed by their assistant. I was asked to review the ELAR files to make sure what they’d done was an accurate representation of the recordings, and was given several specific questions to respond to (things I might not have transcribed but were clearly in the recording, identifying typos, etc. This was really straightforward, and required just a few hours’ work on my end. By mid-November (2020), I had responded to everything.

-Since that time, the Gorwaa materials processed by DoReCo have been used in a recent (2021) paper (DOI here; open access here), with my archive deposit as the citation for the data used therein.

:large_blue_circle: And now for some reflections:

-Having the Gorwaa data used as part of the DoReCo project involved back-and-forth between the project team and me and went FAR beyond me just having my materials openly accessible online and DoReCo downloading them for use. This will probably be the case for most re-use of archived materials, and should be understood by those of us who want our materials to be reused. It’s not a passive process for the language documenter, and we should be aware of (and prepared for) this

-DoReCo made things explicit from the very beginning (criteria for what kinds of data they were looking for, expectations of contributions from the documenter, etc.). This was crucial for my participation because I knew from the start what kind of time I would have to put into this from the start

-It should also be noted that DoReCo explicitly sought lesser-documented (or lesser-supported) languages for its sample. This approach should be recognised and acknowledged as an important step in the right direction.


It’s a bit of a shame, though, that papers are published without the data. The paper states

and will be publicly available in 2021

Let’s hope it will be in July.

If delays are shameful then I for one am positively dripping with shame. :grimacing:

Well, the paper used the data - so it’s there :slight_smile: - just not for us to see.

To elaborate, I think in a time of “reproducibility crisis” and services like OSF, etc. it seems out-of-place to publish a paper with the promise that people will be able to reproduce the results sometime later …

It’s part of DoReCo, which has a specified release date above. Open Access is a good thing but we have to keep in mind that these are still folks just like us with lives and obligations.

I mean, I agree that we should publish data at the time of publication as a default.

It is interesting to muse about paper-writing workflows that are making use of the database directly. That would encourage a pattern where the data and paper are ready simultaneously. I think right now we have more of a “parallel project” pattern, which results in things getting out of sync.

It’s really easy nowadays to get a DOI stamped on a set of files, ensuring no-one can steal your data. But I don’t really blame people for acting within the status quo - with no pressure from publishers or reviewers regarding publication of data. But from outside the discipline that might look strange.

But like… I don’t think “blame” is at all relevant here. It’s a bit much to criticize the same people who are about to publish 50 * 10K words of open-access, time-aligned documentation because of a single paper. Let’s focus on the success story here.

1 Like