📜 Anybody else working with historical documents?

Is anyone else working with linguistic data from historical documents? If so, it would be great to learn about your experiences. Here’s a page from the Aymara text, from 1612, that @pathall and I are working on. The left column is in Aymara, and the right column is a Spanish translation of the Aymara:

So there are several steps:

  1. transcribing the text
  2. putting the 17th century Aymara spelling conventions into a normalized, modern orthography
  3. doing an interlinear analysis of the Aymara
  4. aligning the Aymara sentences with the corresponding Spanish sentences
  5. giving free translations of both

It’s a lot of work, but it’s a very rich text. If anyone wants to use Docling for similar purposes, let me know! Maybe we can develop some shared tools.


Hi @nqemlen!

Nick’s introduced me to the Aymara materials that he has been working with for some time before I got involved. If only every language had this much bilingual material available.

Nick knows it much better than I do, but the corpus consists of at least these things:

  1. A nearly 1000-page Spanish to Aymara (to Spanish!) translation of a relgious book
  2. Two dictionaries, Spanish - Aymara and Aymara - Spanish
  3. A grammar

So that’s a whole Boasian trilogy right there. It’s a really amazing corpus.


It sure is! But totally inaccessible without the kinds of tools you’re developing.

1 Like

And the hard work of linguists and other language workers like you who are doing the actual work of transcription!

My own software, such as it is, is inspired in large part by precedents that already exist: ELAN, Flex, Toolbox, even Praat. (Mad props to all of those.) My own dream is to bring basic functionality from the varied domains of documentary linguistics (texts, media, lexicography…) into a single platform (the web), and to encourage all us linguist folks to talk about what we want our software to do going forward. Really hopeful that this forum can be a place where experts in particular languages (like you!) can participate in such a discussion.


1 Like

Back to the historical documents topic, I once worked on the so-called “Kostromitinov Vocabulary” from 1833 (never published it, didn’t even finish the paper!):

Here’s the whole thing:


Here’s a rather preliminary web version of a transcription I did:

And the rest of that:


It’s an interesting story. In the early 19th century, there was a Russian output in California (it’s still there, a park now) called Fort Ross. This was (and is) in Kashaya Pomo territory. Further south the brutality of the Spanish (Mexican) missions led to various peoples fleeing north, and some ended up at Fort Ross, particularly the Bodega Miwok, but many others as well. (There were also many native Alaskan peoples at Ross, who had come down with the Russians.)

Unsurprisingly, then, the document includes several languages: German (it was published in Germany),
Russian, Kashaya Pomo, and Bodega Miwok. All but the German entries are in a Cyrillic writing system, so half the work consisted of transcribing the original orthographies. The Russian transcription is of its time: ѣ’s and Ѣ’s abound, so I modernized those (since I know not much about Russian and had to look things up in modern dictionaries). I didn’t do much with the Bodega Miwok, except try to transcribe it.

The Kashaya was most of the work, and it was mostly a matching game; trying to figure out how the Cyrillic transcriptions mapped onto the late Robert Oswalt’s materials and orthography.

My work on this is 6 or 7 years old now, but if I redid it now i would probably do it differentlet. Even so, the data isn’t in too bad of a state (there’s a JSON file). The quality of the content in this old document is pretty amazing, and it’s pretty rare in California to have material that old at all.


Hi Nick and Pat,
I have been working on two sets of historical documents that may be of interest. One is a set of 300 notebook pages from a Sierra Leonean goldsmith who was active in the 1950’s; he wrote in the Mende Kikakui script. Tools I’m starting to use for that include Mirador and a customized Unicode input application. Another is a 1913 diary of Boima Kiakpomgbo kept in the Vai script that runs for 180 pages.



Hi all!
I’ve recently started working with some of Jochelson’s Yukaghir legacy materials from the late 19th century. There’s a collection of 100+ texts (in Cyrillic and some sort of Roman transliteration), a grammar sketch and vocabulary list. I’m trying to develop a corpus with these materials and contemporary texts, and hopefully my own fieldwork.

I have a question for you @pathall, since you’ve mentioned JSON. I’m quite new to this; I’ve been mostly going with XML for each text (following the BNC structure). Would you recommend a different encoding?



Wow Pat, that’s a fascinating history, and a rich set of data. Looking forward to hearing more about it!


Hi Charles,
Nice to be back in touch! Those sound like really interesting corpora. Are these scripts widely used, or were they in the past?


Both are now past their period of heaviest use, but there is still some interest and limited expertise to be found. In the Mende case, we’re running into some gaps and unknowns that the Unicode proposal as approved did not cover. The Vai has largely been translated on a first rough pass, but will be under closer examination now to prepare it for publication.



Sounds fascinating, Charles! Can’t wait to learn more about it.