🎙 💻 Automatic Speech Recognition and field recordings

There is a long-running debate in documentation about whether to “keep everything” or not. A prime candidate for “not” are the recordings you might make when working directly with a speaker in open-ended elicitation or rehearing and stuff like that. The recordings can be long, and there’s no way you’re going to transcribe the whole thing. They’re pretty hard to use.

I’m sure other people have tried this, but I thought I’d bring up the topic: what about running those recordings through automatic speech recognition (ASR)? I did that with some old recordings today and it was pretty interesting to see what came out — they were one-on-one recordings, just me and one speaker, and of course the stuff that wasn’t English came out as garbage.

But the English side is quite usable, from both of us. For questions like “dagnabbit, I know we talked about ‘oysters’ at some point, and that’s when we came across that good verb…”, then ASR can be a lifesaver.

In the case of Amazon Transcribe, it outputs a JSON file with timestamps down to the word level, which could, with a little massaging, be plonked into ELAN or something for further work.

I think ultimately there might be better ways to manage the acquisition of field data in the first place (what if the annotations you wrote down during the recording were automatically timestamped?), but it’s probably true that we all have monolithic, more or less opaque recordings of this sort lying about, that we’d like to do something with.

Anybody else tried this before?

1 Like

this sounds like a great workflow for people who work in English or other ASRable languages! (alas my recordings all pretty much only contain Syuba and Nepali)

For longer elicitation sessions I’d try to listen back to them through ELAN and mark up targeted examples that I’d make notes of. This let me confirm my notes (or change things), and also gave me a structured anchor to the elicitation if I came back to it.

It’s funny how I never thought to tell anyone about workflow choices like this before the forum made me realise they were actually choices I made!

1 Like

This one is news to me!

https://en.wikipedia.org/wiki/Syuba

Wat! Ohhh…

https://en.wikipedia.org/wiki/Yolmo_language#Syuba_%28Kagate%29

Surely this warrants a redirect.

Wait, what were we talking about?

Oh right!

Yeah, this bit is a drawback. With time, etc etc. Actually, in my Hiligaynon case, I was thinking that maybe I should have told the recognizer that the content was in English plus some other language with a similar phonology, maybe Indonesian, to see if that would have improved anything. (Tagalog probably would have, but Amazon doesn’t support that either.)

Right? I think this is actually reflective of a big problem in our field. Because we (’re forced to) rely on tools that have are only designed for a subset of the kinds of things we actually do, there’s a lot of knowledge that we pass around kind of “guild-style”. It’s hard to externalize (let alone standardize or institutionalize) workflows that we really use to get stuff done.

…I know her.

I would argue for “record everything and keep everything” – storage these days is so cheap and you never know what information is going to be of interest to others, including the community, now and into the future. A late colleague who was working in Australia in the 1970s was told by her supervisor to only record texts (story telling) and NOT the translations of them or elicitation sessions – she lived to regret that decision, and indeed it is a decision, as she had no way to go back and check discussion about meanings or contexts of use etc., let alone her own misunderstandings at the time which she later realised after doing further analysis. There is also lots of discussion in the contact language (English, Spanish, Nepali, Tagalog) etc. that may be of interest to people wanting to study contact varieties, or to the community and others for the content, rather than the linguistic form.

1 Like

I for one agree with that… I’ve never seen how it makes sense to delete anything, really. It’s not too much of a burden to put some basic metadata in place, or better, some time-stamped notes of the sort that @laureng mentions, and then stash it somewhere (well, a few redundant somewheres!).

It seems likely that the generality of technologies like ASR will only improve in the future, after all: more languages, better accuracy. Who knows what kinds of uses will become much easier down the road?

Besides keeping everything, think about how to preserve it. My current sigline in emails is " “Digital objects last forever–or five years, whichever comes first” (which I owe to Jeff Rothenberg). Seriously, no digital media–not hard drives, not SSDs, not tapes, not CDs or DVDs (especially the kind you write on your computer, as opposed to the ones you buy with pre-recorded content)–lasts forever. Most of those have a shelf life of a decade or two.

Your best bet is to find an archive repository. That way your data can (hopefully!) outlast you. Although I understand they can be hard to find if your data doesn’t come from the right part of the world, and they may have rules about required metadata. (Usually those rules are reasonable, although I’ve heard of exceptions.)

1 Like

I had this idea recently! Specifically, I’ve been trying to use Google’s speech-to-text API, with the idea that it would at least make my elicitation & translation sessions (I opt to record almost everything) searchable. It’s been a bit challenging with my limited coding knowledge, especially to get something resembling an SRT file or suchlike, but I’m optimistic it would make life a lot easier.

2 Likes

Hi @mayhplumb, welcome! Please feel free to get into nuts and bolts of your process of you like, there are lots of fellow code critters about.

I used Amazon’s ASR, which gives quite usable output on the form of a JSON document.

The system can also do some pretty interesting things, like try to identify speakers for instance, and it also gives timestamps down to the word level.

I’ve never tried to do this myself, but it’s also possible to expand the recognition vocabulary before running the recognition algorithm.

Thanks for this interesting post, Pat. This perhaps not exactly what you were thinking of, but there have been a few initiatives to use ASR or other AI approaches to addressing the transcription bottleneck.

Persephone
https://scholarspace.manoa.hawaii.edu/handle/10125/24793

Prosodylab-Aligner (forced alignment)
https://scholarspace.manoa.hawaii.edu/bitstream/10125/24763/1/johnson_et_al.pdf

ELPIS (still in development):

I’ve only used Prosodylab-Analyzer for forced alignment myself. You can start with forced alignment to create training data for an ASR system, though.

ELPIS has been in development for a while now and I really hope they finish it soon. They switched from using Kaldi to ESPnet, so perhaps that is why it is taking a bit longer. My understanding is that LD practitioners could be integrating ASR tools into their workflow but much of the general purpose ASR software is challenging to use, and that is the gap that ELPIS is designed to fill.

3 Likes