This is a topic that I run into all ye olde time with colleagues:
Given a recording of audio, how should we segment smaller units of speech?
Purposefully non-committal on the phraseology there — this task can apply to segmenting into “utterances”, “sentences”, “phones”, whatever. I thought it would be cool to start a topic here where we simply collect links to tools. There are a lot out there, and some of them are rather obscure, others are familiar but do lots of other stuff too (looking at you, Audacity), so it’s not immediately obvious that some tools even can be used for segmentation of this kind.
Also, if anyone else is interested in this topic, it would be great to have separate tutorials and HOWTOs (or links to same) for some of the more advanced methods out there (forced alignment, for instance).
ELAN is (and has always been) the standard software I use for transcription and translation, and is a tool I spend a LOT of my working time using. https://archive.mpi.nl/tla/elan
All of my segmentation is done manually, but not necessarily by me. I’m in the privileged position of being able to employ five Tanzanian local researchers, who use ELAN to segment audio recordings of materials in their own languages, provide rough transcriptions of what they hear in working orthographies of the target languages, and who then provide Swahili translations of this.
One thing I have always wondered about (but never bothered to try) is ELAN’s built-in “recognizers” for automatic speech segmentation. (Has your team ever made use of this, by the way, Andrew?)
I found this interface rather opaque, some are web-based services, but it seems that people mostly work with the MPI-PL Silence Recognizer, which is built-in to ELAN and seems to work well enough.
In this nice video tutorial by Dr. David Ruskin, there is a step-by-step guide to how to do automatic segmentation in both ELAN, and also PRAAT:
Dr. Ruskin compares both processes, and of course the particular alignments output are slightly varied. He prefers the PRAAT process to ELAN channel in the end, apparently because the results are better. It does impose an extra import step to get that output into ELAN, though.
In any case, the video is very useful because it covers just enough to try out automatic segmentation in both tools, and you can follow it along to try out both.
By the way, does anyone know Dr. Ruskin? He seems very docling, I’d like to invite him to join us. But I couldn’t find an email…
Here’s another tool you can use to do segmentation, one which doesn’t quite get the attention in documentation circles for some reason: Audacity. Made a short (about 3 minutes) screencast demonstrating how to use it, but somehow it didn’t record the audio. Le sigh. Anyways, here it is:
I’ll try to fix it at some point, but if you feel like taking a look there it is.
Ugh. There’s a “no microphone” option in Quicktime. Very helpful.
I wasn’t sure anyone would find that tutorial useful! Though I will say that I’ve actually referenced it myself just about every time I parse a file to see what the setting were that I used, lol.
My experience with the above methods (and I’d never heard of SndBite, so I may have to check that out) is that ELAN is fast, but could use more customization. I would love if you could add a small buffer before and after the segment. This would really help catch soft sounds at the edges of words that otherwise get cut off.
Praat is my go to, but it’s sloooow. And I don’t quite understand why it’s so slow, considering how quickly ELAN does the same thing. Praat can take several minutes on longer files and just looks like it’s frozen. It ALSO will overflow my computer’s memory and crash if I’m working on a longer file and have anything else open. So keep that in mind.
Audacity I played around with, but I didn’t care for the way it imports into ELAN. I use it for other pre-processing tasks, though, and find it works well for those purposes. Also, its spectrograms are prettier than Praat’s, for what it’s worth.
Anyway, so cool! Glad that someone else might have found that video useful
Wow, Pat, that’s awesome! My website is woefully out of date, but thanks for the thumbs up. The realtime browser spectrogram tool has actually been reaaaaaallly great to use with students and has been a lot of fun to play around with. Humming, whistling, different vowels – it’s neat to see it as you produce it. I also just added different color-schemes – the palate button in the upper-right corner. I was hoping to do a real-time vowel-space analysis and tried a first-pass estimate with peak-finding and the second derivative of the spectrum (which is fast and easy to calculate). Works for high vowels where F1 and F2 are separate enough, but low vowels the formats cross over each other and you can’t do it that way. I’ve been looking into doing a JavaScript LPC analysis, but that’s a “oof headache” sort of project and I’m not sure if it will be fast enough to do it in realtime. I think I can use some of the code from this post, but only when I get some free time, which is… never? Linear Least Squares: A Javascript Implementation and a Definitional Question | by tiefengeist | Medium
belatedly, I’ve used the silence recognizer a bunch and found it very useful. The language transcription for Australian languages was another story; I was never able to get usable output (that said, I probably could have tried harder, so some of this was likely on me, not on the system).
Another program that some people use for segmenting audio is SayMore. Their segmentation/transcription tab is actually a shell for ELAN. It outputs an .eaf file, but I’m not sure if their auto-segmenter or other code is taken from ELAN or self-developed. Here’s a site that explains how to segment/transcribe in SayMore: 4.3 Using SayMore | CORSAL - Computational Resource for South Asian Languages.