🎙 💻 Automatic Speech Recognition and field recordings

There is a long-running debate in documentation about whether to “keep everything” or not. A prime candidate for “not” are the recordings you might make when working directly with a speaker in open-ended elicitation or rehearing and stuff like that. The recordings can be long, and there’s no way you’re going to transcribe the whole thing. They’re pretty hard to use.

I’m sure other people have tried this, but I thought I’d bring up the topic: what about running those recordings through automatic speech recognition (ASR)? I did that with some old recordings today and it was pretty interesting to see what came out — they were one-on-one recordings, just me and one speaker, and of course the stuff that wasn’t English came out as garbage.

But the English side is quite usable, from both of us. For questions like “dagnabbit, I know we talked about ‘oysters’ at some point, and that’s when we came across that good verb…”, then ASR can be a lifesaver.

In the case of Amazon Transcribe, it outputs a JSON file with timestamps down to the word level, which could, with a little massaging, be plonked into ELAN or something for further work.

I think ultimately there might be better ways to manage the acquisition of field data in the first place (what if the annotations you wrote down during the recording were automatically timestamped?), but it’s probably true that we all have monolithic, more or less opaque recordings of this sort lying about, that we’d like to do something with.

Anybody else tried this before?

1 Like

this sounds like a great workflow for people who work in English or other ASRable languages! (alas my recordings all pretty much only contain Syuba and Nepali)

For longer elicitation sessions I’d try to listen back to them through ELAN and mark up targeted examples that I’d make notes of. This let me confirm my notes (or change things), and also gave me a structured anchor to the elicitation if I came back to it.

It’s funny how I never thought to tell anyone about workflow choices like this before the forum made me realise they were actually choices I made!


This one is news to me!


Wat! Ohhh…


Surely this warrants a redirect.

Wait, what were we talking about?

Oh right!

Yeah, this bit is a drawback. With time, etc etc. Actually, in my Hiligaynon case, I was thinking that maybe I should have told the recognizer that the content was in English plus some other language with a similar phonology, maybe Indonesian, to see if that would have improved anything. (Tagalog probably would have, but Amazon doesn’t support that either.)

Right? I think this is actually reflective of a big problem in our field. Because we (’re forced to) rely on tools that have are only designed for a subset of the kinds of things we actually do, there’s a lot of knowledge that we pass around kind of “guild-style”. It’s hard to externalize (let alone standardize or institutionalize) workflows that we really use to get stuff done.

…I know her.

I would argue for “record everything and keep everything” – storage these days is so cheap and you never know what information is going to be of interest to others, including the community, now and into the future. A late colleague who was working in Australia in the 1970s was told by her supervisor to only record texts (story telling) and NOT the translations of them or elicitation sessions – she lived to regret that decision, and indeed it is a decision, as she had no way to go back and check discussion about meanings or contexts of use etc., let alone her own misunderstandings at the time which she later realised after doing further analysis. There is also lots of discussion in the contact language (English, Spanish, Nepali, Tagalog) etc. that may be of interest to people wanting to study contact varieties, or to the community and others for the content, rather than the linguistic form.

1 Like

I for one agree with that… I’ve never seen how it makes sense to delete anything, really. It’s not too much of a burden to put some basic metadata in place, or better, some time-stamped notes of the sort that @laureng mentions, and then stash it somewhere (well, a few redundant somewheres!).

It seems likely that the generality of technologies like ASR will only improve in the future, after all: more languages, better accuracy. Who knows what kinds of uses will become much easier down the road?

Besides keeping everything, think about how to preserve it. My current sigline in emails is " “Digital objects last forever–or five years, whichever comes first” (which I owe to Jeff Rothenberg). Seriously, no digital media–not hard drives, not SSDs, not tapes, not CDs or DVDs (especially the kind you write on your computer, as opposed to the ones you buy with pre-recorded content)–lasts forever. Most of those have a shelf life of a decade or two.

Your best bet is to find an archive repository. That way your data can (hopefully!) outlast you. Although I understand they can be hard to find if your data doesn’t come from the right part of the world, and they may have rules about required metadata. (Usually those rules are reasonable, although I’ve heard of exceptions.)

1 Like

I had this idea recently! Specifically, I’ve been trying to use Google’s speech-to-text API, with the idea that it would at least make my elicitation & translation sessions (I opt to record almost everything) searchable. It’s been a bit challenging with my limited coding knowledge, especially to get something resembling an SRT file or suchlike, but I’m optimistic it would make life a lot easier.


Hi @mayhplumb, welcome! Please feel free to get into nuts and bolts of your process of you like, there are lots of fellow code critters about.

I used Amazon’s ASR, which gives quite usable output on the form of a JSON document.

The system can also do some pretty interesting things, like try to identify speakers for instance, and it also gives timestamps down to the word level.

I’ve never tried to do this myself, but it’s also possible to expand the recognition vocabulary before running the recognition algorithm.

Thanks for this interesting post, Pat. This perhaps not exactly what you were thinking of, but there have been a few initiatives to use ASR or other AI approaches to addressing the transcription bottleneck.


Prosodylab-Aligner (forced alignment)

ELPIS (still in development):

I’ve only used Prosodylab-Analyzer for forced alignment myself. You can start with forced alignment to create training data for an ASR system, though.

ELPIS has been in development for a while now and I really hope they finish it soon. They switched from using Kaldi to ESPnet, so perhaps that is why it is taking a bit longer. My understanding is that LD practitioners could be integrating ASR tools into their workflow but much of the general purpose ASR software is challenging to use, and that is the gap that ELPIS is designed to fill.


Has anyone used trint for this (or other tasks)? I ask because my university now has a site license. But it’s hard to know what languages they support

1 Like

Huh, interesting, hadn’t heard of this.

I think many know from experience already but I did some timed tasks with some RAs and the ‘transcription admin’ (identifying speech/non-speech and speakers) took as much if not more time than English transcription (Table 4: https://arxiv.org/pdf/2204.07272.pdf), so if the service gives you good enough speech activity detection/speaker diarisation as part of the ASR service, it might be worth looking into even if the transcriptions are all gibberish and you throw them away…

@cbowern — Not trint but I was recently talking to Ruth Singer and she says she’s been using Descript https://www.descript.com/ to speaker diarization and English transcription. She sent me an output file (.srt) which I was relatively easily able to wrangle into a tab-separated file in R for import into ELAN:

@laureng — it’s not quite self serve-able like Trint/Descript/Otter/etc. (perhaps it could be with Elpis?) but Nepali is apparently ASRable (as of late last year). There’s a set of openly released ASR models for Indic languages IndicWav2Vec | AI4Bharat IndicNLP with a 9-11% error rate (Table 3: https://arxiv.org/pdf/2111.03945.pdf).


I’ve been using Descript for a few months now to help transcribe some interviews that are mainly in English and as Claire shows it does export to Elan quite well, even preserving speaker separation. However, in looking at our transcripts in Elan now we have discovered a bit of a major problem, the timestamps get slightly out of alignment the further you look in time at each transcript. This is because sometimes when you edit the automated transcript in Descript by cutting out an entire word it cuts some time out of the media file (which you import into Descript). We didn’t realise this though. So each of our 1 hour interviews has about 10 of these little accidental ‘edits’ and each media file in Descript is a little shorter than the original media file. It doesn’t seem possible to turn this 'feature (?bug) off. I’m thinking now of going with another app. I emailed support at Otter.ai and this doesn’t seem like it can happen in Otter.ai - the original media file is never altered. So be wary of apps like Descript that are marketed as a full package for podcasters - a great way of editing media via transcripts! Not so good for us. Sonix looks like it might have the same problem as Descript from what I can garner from their help material.


Hi @Ruth, happy to see you here!

It’s pretty bonkers that Descript actually edits the audio from the transcription… it’s must have been intentional but dang, certainly qualifies as a bug in my book.

1 Like