I got a query from someone I work with in Australia about audio searching. He has a large collection of songs recorded with family members, and would like to be able to search for particular words. As far as I know there are no transcripts. I remember seeing some papers about transcription without pretraining from early in the pandemic but haven’t seen anything recently, and it wasn’t clear what was needed (presumably songs are going to be extra-tricky )
From what I understand, this is the state of the art in speech-to-text for low-resource languages: Welcome to the Elpis ASR documentation! — Elpis 1.0.6 documentation.
It still requires transcripts, though (in the form of ELAN files).