Building an Audio lexicon with CLDF

Here’s a rather technical (and long) explanation of what I meant above by “coding to the interface”:

Since version 1.1, CLDF includes a MediaTable, and a mediaReference property to link from other tables to media.

This can be used to e.g. link audio files to forms in a Wordlist, as is done in Henrik Liljegren’s Hindukush data.

In the following, I’ll try to show how such a specified data format helps with tool development (enabling things such as the audio glossary talked about above).

The price of a flexible spec is paid in higher complexity of implementation. So the flexible URL discovery and different URL schemes (such as http:, file: or data: URLs) are best supported in a basic CLDF library such as pycldf - rather than implementing bits and pieces of it in higher-level tools. As of version 1.26 pycldf provides a python API, abstracting away the details of the spec, and instead offering straightforward and methods for Files (which can easily be derived from rows in a MediaTable). With this API, implementing a commandline program to download all media files for a dataset boils down to a handful of lines of code.

So with a specification and implementation in place, we can go about putting our audio glossary together. Since HTML creation is typically done using templates, the process has two steps: assembling the data for the template and the template.

Then, creating an HTML page like the attached file, is as “easy” as running

cldfbench cldfviz.audiowordlist liljegrenhindukush/cldf/cldf-metadata.json cldf:name=hand -o test.html

Note: The cldfviz functionality demonstrated above isn’t released yet. Thus, you’d need to install cldfviz from a repository clone to reproduce this locally

test.html (22.7 KB)

1 Like

I should add that even though this process seems convoluted - having to do stuff in three places (spec, pycldf, cldfviz), and in particular if the same person does all three - it feels pretty liberating to me that this does not result in one-off code, but help me work with tens or hundreds more datasets.

1 Like