Segmenting speech recordings, automatically and otherwise

This is a topic that I run into all ye olde time with colleagues:

Given a recording of audio, how should we segment smaller units of speech?

Purposefully non-committal on the phraseology there — this task can apply to segmenting into “utterances”, “sentences”, “phones”, whatever. I thought it would be cool to start a topic here where we simply collect links to tools. There are a lot out there, and some of them are rather obscure, others are familiar but do lots of other stuff too (looking at you, Audacity), so it’s not immediately obvious that some tools even can be used for segmentation of this kind.

Also, if anyone else is interested in this topic, it would be great to have separate tutorials and HOWTOs (or links to same) for some of the more advanced methods out there (forced alignment, for instance).

I’ll break the seal with a venerable one:

http://www.billposer.org/Software/SndBiteManual/Manual.html

Holy eighties colorscheme, Batman!

The docs for this (as always) are interesting. I haven’t managed to get it to run, myself.

3 Likes

ELAN is (and has always been) the standard software I use for transcription and translation, and is a tool I spend a LOT of my working time using.
https://archive.mpi.nl/tla/elan
All of my segmentation is done manually, but not necessarily by me. I’m in the privileged position of being able to employ five Tanzanian local researchers, who use ELAN to segment audio recordings of materials in their own languages, provide rough transcriptions of what they hear in working orthographies of the target languages, and who then provide Swahili translations of this.

1 Like

Thanks for mentioning ELAN @Andrew_Harvey!

One thing I have always wondered about (but never bothered to try) is ELAN’s built-in “recognizers” for automatic speech segmentation. (Has your team ever made use of this, by the way, Andrew?)

I found this interface rather opaque, some are web-based services, but it seems that people mostly work with the MPI-PL Silence Recognizer, which is built-in to ELAN and seems to work well enough.

In this nice video tutorial by Dr. David Ruskin, there is a step-by-step guide to how to do automatic segmentation in both ELAN, and also PRAAT:

Dr. Ruskin compares both processes, and of course the particular alignments output are slightly varied. He prefers the PRAAT process to ELAN channel in the end, apparently because the results are better. It does impose an extra import step to get that output into ELAN, though.

In any case, the video is very useful because it covers just enough to try out automatic segmentation in both tools, and you can follow it along to try out both.

By the way, does anyone know Dr. Ruskin? He seems very docling, I’d like to invite him to join us. But I couldn’t find an email…

2 Likes

I’ve never tried the recognisers myself – but I can see how they could possibly save a lot of time!

1 Like

Here’s another tool you can use to do segmentation, one which doesn’t quite get the attention in documentation circles for some reason: Audacity. Made a short (about 3 minutes) screencast demonstrating how to use it, but somehow it didn’t record the audio. Le sigh. Anyways, here it is:

I’ll try to fix it at some point, but if you feel like taking a look there it is.

Ugh. There’s a “no microphone” option in Quicktime. Very helpful.

I actually met David at CoLang this summer! I can hit him up and see if he’d like to come hang out here :slight_smile:

1 Like

Oh cool!

He’s got some very cool stuff on his site, including interactive modeling demos and what looks like a work in progress with a real-time, browser-based spectrogram. (The source code is interesting, quite compact: https://guamlinguistics.com/spectrogram/spect.js.) A web component version of this would be very useful.

2 Likes

Wow, neat!

I wasn’t sure anyone would find that tutorial useful! Though I will say that I’ve actually referenced it myself just about every time I parse a file to see what the setting were that I used, lol.
My experience with the above methods (and I’d never heard of SndBite, so I may have to check that out) is that ELAN is fast, but could use more customization. I would love if you could add a small buffer before and after the segment. This would really help catch soft sounds at the edges of words that otherwise get cut off.
Praat is my go to, but it’s sloooow. And I don’t quite understand why it’s so slow, considering how quickly ELAN does the same thing. Praat can take several minutes on longer files and just looks like it’s frozen. It ALSO will overflow my computer’s memory and crash if I’m working on a longer file and have anything else open. So keep that in mind.
Audacity I played around with, but I didn’t care for the way it imports into ELAN. I use it for other pre-processing tasks, though, and find it works well for those purposes. Also, its spectrograms are prettier than Praat’s, for what it’s worth.

Anyway, so cool! Glad that someone else might have found that video useful

2 Likes

Wow, Pat, that’s awesome! My website is woefully out of date, but thanks for the thumbs up. The realtime browser spectrogram tool has actually been reaaaaaallly great to use with students and has been a lot of fun to play around with. Humming, whistling, different vowels – it’s neat to see it as you produce it. I also just added different color-schemes – the palate button in the upper-right corner. I was hoping to do a real-time vowel-space analysis and tried a first-pass estimate with peak-finding and the second derivative of the spectrum (which is fast and easy to calculate). Works for high vowels where F1 and F2 are separate enough, but low vowels the formats cross over each other and you can’t do it that way. I’ve been looking into doing a JavaScript LPC analysis, but that’s a “oof headache” sort of project and I’m not sure if it will be fast enough to do it in realtime. I think I can use some of the code from this post, but only when I get some free time, which is… never? Linear Least Squares: A Javascript Implementation and a Definitional Question | by tiefengeist | Medium

2 Likes

belatedly, I’ve used the silence recognizer a bunch and found it very useful. The language transcription for Australian languages was another story; I was never able to get usable output (that said, I probably could have tried harder, so some of this was likely on me, not on the system).

1 Like

Another program that some people use for segmenting audio is SayMore. Their segmentation/transcription tab is actually a shell for ELAN. It outputs an .eaf file, but I’m not sure if their auto-segmenter or other code is taken from ELAN or self-developed. Here’s a site that explains how to segment/transcribe in SayMore: 4.3 Using SayMore | CORSAL - Computational Resource for South Asian Languages.