How do we implement “orthographic”, “broad”, and “close” transcriptions as data?

Something I’ve been thinking about with regard to an ongoing fieldmethods class that my colleagues (and friends!) @squidtm and @Dani are working on:

What exactly do we mean by “orthographic”, “broad”, and “close” transcription, as far as data is concerned?

In fieldwork, no transcription is certain, and thus the same word can be written down in numerous ways. This is down to several reasons, and it depends a lot on the stage of the fieldwork. In the earliest days, the phoneme inventory is completely or nearly completely unknown. So every transcription is — especially in early days — essentially an informed guess.

There’s no better evidence of this than comparing multiple linguists’ transcriptions of the same word.

In 2009 I participated in the LSA Institute Fieldmethods class on Kashaya Pomo with Anita Silva, who has since passed away. Working with her was a wonderful experience for all involved.

So anyway, all the work from that class is now available on the California Language Archive.

Below, I have screenshots of the very first word we transcribed: camay ‘hello, goodbye’.

Things to note:

  • orthographic choices
  • corrections
  • marking of supersegmentals
  • variations in glossing

Ricardo Lezama

Eun Joo Kim

Sverre Johnsen

Patrick Hall

Roey Gafter

Pamela Munro

Yen-ling Chen

So here’s a table containing roughly that information:

form gloss linguist note
dʒamay hello, goodbye Lezama
tʒamaj hello, goodbye Kim Acute arrow over last syllable
dʒaʔmái hello, goodbye, good morning Johnsen replaces jaʔmái
dʒamaj like aloha: hello or goodbye Hall replaces ǰamay, dʒamay
d͡ʒamaj hello Gafter replaces jamay
ǰamáːy hello, goodbye Munro replaces ǰamáy
tʃiamaj Hello Chen

The rest of the corpus of the class, like so many others, remains as scans (but fortunately it’s archived with metadata thanks to the hard work of the CLA). This means that we can’t easily find more of the documentary history of this word in the corpus, but this small dataset suffices to foreground some thorny questions.

Close transcriptions are valuable. As a set, the observations here tell us a fair amount about the word. While there are numerous transcriptions of the first phoneme, it’s pretty clear that we’re dealing with some kind of palatal affricate. Also, it’s on interest that there’s only one transcription of a medial glottal stop (Johnsen’s). Was it there? Maybe. In early days, we really don’t want to throw away anything.

Sadly, while this class was recorded and is available in its entirety on the CLA, this particular word was not recorded — the first comment on the recording is Pam Munro saying “Should have started recording with the the camaj, but that’s the way it goes!” :sweat_smile: So we can’t listen to it again.

Close transcriptions are highly variable. Yeah, it starts with some kind of palatal affricate — you and I “just know” that — but computers don’t know a palatal affricate from a salad fork. :fork_and_knife: .

We kind of want to preserve everything. One of the unique aspects of fieldwork data, specifically, is that every scribble is potentially precious. This is why lots of fieldwork manuals recommend striking through transcriptions rather than scribbling them out. (In fact, I remember Pam instructing us to do just that in the class.) Overwriting a form in a digital file is more like scribbling out: it replaces the earlier version. Maybe we should represent a “fieldwork word” something like this (using my own transcription as an example, because I was particularly indecisive!)?

  "form": "dʒamaj",
  "gloss": "hello_goodbye",
  "history": [

Is that enough information?

Anyway, in the class we came up with a standard orthography, and I believe the spelling we used woudl have standardized this word on «camay». This would become the representation of the word in the database.

  "phonetic": "dʒamaj",
  "form": "camay",
  "gloss": "hello_goodbye",
  "history": [

I just added that phonetic field because it seems reasonable. But I think it’s generally true that when people talk about “broad” and “close” transcriptions — which describes a cline, to be sure — those usages are often equvalent respectively to “orthographic” or “working orthographic” and “phonetic”.

Another factor, somewhat orthogonal, is the question of where we’re talking about textual or lexical documentation. What people call a “dictionary entry” is invariably somewhat “standardizing” in nature (with all the baggage that brings), but annotations in a “textual” context probably should note any interesting phonetic detail. After all, lots of phonetic processes only occur in fluent speech.

Anyway, just thinking about all this stuff and I’d be interested to know what peoples’ thoughts on this are. How do you handle digitization of early-stage fieldwork? Do you even do it? Or do you wait till a working phonemic orthography has shaken out?


I’d like to take issue with

computers don’t know a palatal affricate from a salad fork.

CLTS is exactly about teaching computers to know about palatal affricates. So, here’s what pyclts (using the CLTS data as reference) would tell us about the “start”:

>>> from pyclts import CLTS
>>> clts = CLTS('clts-data')
>>> bipa = clts.bipa
>>> bipa['dʒ']
<pyclts.models.Consonant: voiced post-alveolar sibilant affricate consonant>
>>> bipa['tʒ']
UnknownSound(ts=<pyclts.transcriptionsystem.TranscriptionSystem object at 0x7f37b517d000>, grapheme='tʒ', source='tʒ', generated=False, note=None)
>>> bipa['d͡ʒ']
<pyclts.models.Consonant: voiced post-alveolar sibilant affricate consonant>
>>> bipa['ǰ']
UnknownSound(ts=<pyclts.transcriptionsystem.TranscriptionSystem object at 0x7f37b517d000>, grapheme='ǰ', source='ǰ', generated=False, note=None)
>>> bipa['tʃ']
<pyclts.models.Consonant: voiceless post-alveolar sibilant affricate consonant>

Not perfect, and UnknownSound is obviously pyclts’ equivalent of a salad fork. But it’s a start, I’d say.

1 Like

Well, I guess it would be truer to say that computers, by default, don’t know a phoneme from a salad fork!

Sorry for forgetting to mention CLTS (there is at least a short mention here). It must have been a crazy amount of work. :clap:

Features :arrow_right: Graphemes

From my experience in (air-conditioned) fieldwork, a primary difficulty in early days of a project is input. And not just the questions of “where is my character?” “how can I input characters efficiently?”. Both of those are challenging, but the one that I think is the most challenging (and most frequent) is “I heard a sound which I believe has features x, y, and z. Which IPA graphemes do I need?” So for instance, you might say, “that thing sounded like a voiced dental stop «d̪ » . But if it also sounds velarized…”

Then you have to remember that “velarized” maps to «ˠ», superscript gamma.


A database like CLTS is going to be the solution to working with featural phonetic encodings, but there are a lot of user interface questions related to how we actually access and insert that information in a transcription.

I built a sort of feature-based-button-panel thingy in the context of my dissertation, it looks like this:

Online here:

There is plenty of room for experimentation with interfaces like this. Various UI tools for search and input could be designed. So the tool above for instance should be rewritten with BIPA data as the source, but it needs (familiar) feature names for everything. I’m going take a closer look at the source and see if perhaps it’s already there.

This particular topic in digital documentary data always makes me dizzy. So much detail!