Something I’ve been thinking about with regard to an ongoing fieldmethods class that my colleagues (and friends!) @squidtm and @Dani are working on:
What exactly do we mean by “orthographic”, “broad”, and “close” transcription, as far as data is concerned?
In fieldwork, no transcription is certain, and thus the same word can be written down in numerous ways. This is down to several reasons, and it depends a lot on the stage of the fieldwork. In the earliest days, the phoneme inventory is completely or nearly completely unknown. So every transcription is — especially in early days — essentially an informed guess.
There’s no better evidence of this than comparing multiple linguists’ transcriptions of the same word.
In 2009 I participated in the LSA Institute Fieldmethods class on Kashaya Pomo with Anita Silva, who has since passed away. Working with her was a wonderful experience for all involved.
So anyway, all the work from that class is now available on the California Language Archive.
Below, I have screenshots of the very first word we transcribed: camay ‘hello, goodbye’.
Things to note:
- orthographic choices
- corrections
- marking of supersegmentals
- variations in glossing
Ricardo Lezama
Eun Joo Kim
Sverre Johnsen
Patrick Hall
Roey Gafter
Pamela Munro
Yen-ling Chen
So here’s a table containing roughly that information:
form | gloss | linguist | note |
---|---|---|---|
dʒamay | hello, goodbye | Lezama | |
tʒamaj | hello, goodbye | Kim | Acute arrow over last syllable |
dʒaʔmái | hello, goodbye, good morning | Johnsen | replaces jaʔmái |
dʒamaj | like aloha: hello or goodbye | Hall | replaces ǰamay, dʒamay |
d͡ʒamaj | hello | Gafter | replaces jamay |
ǰamáːy | hello, goodbye | Munro | replaces ǰamáy |
tʃiamaj | Hello | Chen |
The rest of the corpus of the class, like so many others, remains as scans (but fortunately it’s archived with metadata thanks to the hard work of the CLA). This means that we can’t easily find more of the documentary history of this word in the corpus, but this small dataset suffices to foreground some thorny questions.
Close transcriptions are valuable. As a set, the observations here tell us a fair amount about the word. While there are numerous transcriptions of the first phoneme, it’s pretty clear that we’re dealing with some kind of palatal affricate. Also, it’s on interest that there’s only one transcription of a medial glottal stop (Johnsen’s). Was it there? Maybe. In early days, we really don’t want to throw away anything.
Sadly, while this class was recorded and is available in its entirety on the CLA, this particular word was not recorded — the first comment on the recording is Pam Munro saying “Should have started recording with the the camaj, but that’s the way it goes!”
So we can’t listen to it again.
Close transcriptions are highly variable. Yeah, it starts with some kind of palatal affricate — you and I “just know” that — but computers don’t know a palatal affricate from a salad fork. .
We kind of want to preserve everything. One of the unique aspects of fieldwork data, specifically, is that every scribble is potentially precious. This is why lots of fieldwork manuals recommend striking through transcriptions rather than scribbling them out. (In fact, I remember Pam instructing us to do just that in the class.) Overwriting a form in a digital file is more like scribbling out: it replaces the earlier version. Maybe we should represent a “fieldwork word” something like this (using my own transcription as an example, because I was particularly indecisive!)?
{
"form": "dʒamaj",
"gloss": "hello_goodbye",
"history": [
"ǰamay",
"dʒamay"
]
}
Is that enough information?
Anyway, in the class we came up with a standard orthography, and I believe the spelling we used woudl have standardized this word on «camay». This would become the representation of the word in the database.
{
"phonetic": "dʒamaj",
"form": "camay",
"gloss": "hello_goodbye",
"history": [
"dʒamaj",
"ǰamay",
"dʒamay"
]
}
I just added that phonetic
field because it seems reasonable. But I think it’s generally true that when people talk about “broad” and “close” transcriptions — which describes a cline, to be sure — those usages are often equvalent respectively to “orthographic” or “working orthographic” and “phonetic”.
Another factor, somewhat orthogonal, is the question of where we’re talking about textual or lexical documentation. What people call a “dictionary entry” is invariably somewhat “standardizing” in nature (with all the baggage that brings), but annotations in a “textual” context probably should note any interesting phonetic detail. After all, lots of phonetic processes only occur in fluent speech.
Anyway, just thinking about all this stuff and I’d be interested to know what peoples’ thoughts on this are. How do you handle digitization of early-stage fieldwork? Do you even do it? Or do you wait till a working phonemic orthography has shaken out?