The simple elicitation web app linguists didn't know they need

A lot of attention has been given to minimizing the transcription bottleneck associated with processing natural speech texts, perhaps because it is a good problem for AI/ML to solve. Much less attention has been paid to minimizing the manual processing tasks associated with linguistic elicitation, which for many linguists often include manually segmenting audio and either copy-and-pasting text or directly typing text into an application like ELAN or Praat.

I’ve previously tried to get at this problem by creating audio which can be more easily automatically segmented (e.g. using “Voice Activity Detection” or a function like Praat’s “Annotate to TextGrid (silences)”). This requires training speakers to produce elicited items in a very consistent manner, maximizing the signal-to-noise ratio (e.g. using a headset microphone), and creating a recording specifically for productions of elicited items with no metalinguistic information. It also helps, of course, if you create digital data directly by typing transcriptions into a spreadsheet app rather than writing by hand. Putting the timecode data and the text data together still requires a separate processing step, however, and not all speakers are good at producing repetitions consistently, which can result in segments of audio which contain mistakes being parsed by an automated system and also restricts the method to only those speakers who have a natural aptitude for elicitation.

Enter the linguistic elicitation web app. It’s actually a very simple idea, and I imagine that anyone reasonably familiar with web programming could put it together in just a few hours. Here are the basic features:

  1. A dynamic spreadsheet-like table with form boxes, something like this quick mock-up I made:
  2. The ability to add and remove columns/rows using buttons.
  3. A CSV/JSON import function
  4. A stopwatch/timer function. This is the critical piece! The user starts the timer the same time they start the audio recording, so that they are more or less synced together (with a degree of accuracy sufficient for the task of aligning text at the utterance/elicited-item level). The standard start/stop + resent buttons should work, perhaps with a confirmation dialogue if you want to stop or reset to avoid mistakes.
  5. An “insert time” function which takes the current time and inserts it into either the start time or end time. This creates the timecode data for each utterance. Ideally this would automatically advance, so that if all is going well the linguist simply hits the “insert time” button before and after each utterance/elicitation item.
  • Critically, the user also needs to be able to remove timecode data in the event that they want to redo an item.
  • If the speaker produces a slight variation of the target data which the linguist finds acceptable (and thus not requiring a total redo), they can directly modify the transcription. This should be done in the moment, not at a later time.
  1. An export function which can produce CSV, JSON, .eaf, .flextext, and .TextGrid based on the data entered into the forms.
  2. An adaptative/responsive design which makes it possible to use the app either on a computer or a mobile device

What’s cool about it
The first cool thing about this method is that after you have exported your data there are no more steps. Aside from transferring your audio recording to your computer, you are done! You can literally open the .Textgrid/.eaf/.flextext immediately and run searches across the time-aligned data.

A second cool thing about it is that it doesn’t require specialized training for the speaker, and you can still work with a less-than-ideal signal-to-noise ratio (e.g. a noisy environment or a lower-quality microphone).

Third cool thing: you can create a recording with both metalinguistic information and linguistic data together, without having to worry about distinguishing between the two.

Finally, a fourth thing that is cool about it is that you can use the same app on either a computer or a mobile device with a bluetooth keyboard. This is really critical because it makes it possible to use the method literally anywhere in the world.

The take home message
This could have big implications for fieldworkers and field methods courses, and would strongly encourage descriptive linguists to adopt reproducible research practices (e.g. making their primary data available).

It also should be relatively easy to create…so who wants to make it with me?! :smiley:


I am so stoked about this post. Need to work up my thoughts! Thank you!

Here is a live and messy mock-up of the timer function with a dynamic-ish table. Start the timer, select the first radio button, and then click “Assign” to assign the current stopwatch time to the data associated with the selected radio button and progress the selection to the next radio button. Currently, to re-assign a time you have to press the “Clear Selected Time(s) button” because I can’t figure out how to uncheck other radio buttons when checking and unchecked button.

This was mostly made by mashing together code from W3Schools.

While working on this I realized that in order for the ELAN/FLEx export functions to work correctly the app would need a dropdown menu for each column, to specify whether it is translation, transcription, or “other” (notes, etc.), and the ISO 639-3 code would be needed to be entered, too.

1 Like

Today I added a basic JSON export feature and fixed the radio button selection so that you can more easily go back to redo a previous word. I also added fields for the ISO language codes for the three columns (which are now the only available columns, for simplicity’s sake).

The JSON export appears to be printed out on one line, I’ll have to fix that.

Next up, .TextGrid export!

1 Like

A couple of updates. First, and probably most interesting, is the .TextGrid export function. This was generally straightforward, aside from the fact that Praat interval tiers require an end time, which ideally would be equivalent to the duration of the audio recording. Although not 100% accurate, this can be approximated by pressing the “stop” button on the timer at the same time you stop the recording. Here is an example .TextGrid output and the corresponding .WAV file. I haven’t done a long recording yet to check for drift. I have to say, though, it is quite thrilling to see the time-aligned text data within ~30 seconds of finishing the recording.

The app:

I made a couple of additional changes related to time. I updated the timer to include a drift history and predictive correction code, and I took out the milliseconds from the display in order to decrease the drift. I also changed the assignment of time values for text data so that it runs an independent time check with rather than taking the values from the visual timer display. This makes it possible to get milliseconds for the assigned times without having to update the millisecond display.

1 Like

Dang, you are in high gear! How do you access the TextGrid export from the UI?

Oh whoops, apparently I forgot to change the button text from “Export JSON” to “Export Data”! :stuck_out_tongue:

But even better would be a separate “Export TextGrid” button, so I put that in instead (now updated).

1 Like

Question: how do you imagine the relationship between transcription and assigning timestamps? Are the transcription and translation fields filled out beforehand, or is that done “between” the assignments?

I could imagine multiple workflows that might be useful, depending on your starting data and output goals. Did you have a particular workflow pattern in mind or did you design this interface as “non-committal” in that regard?

Thanks, Pat, that is a :star2:great question:star2: and you are right that there are many ways you could go about it. In the previous elicitation methods I’ve used, I would divide the session into two parts (with an initial preparation stage):

  1. Preparation - Prepare elicitation stimuli, such as a list of words or phrases in a contact language (i.e. the “Translation” column in the app).
  2. Exploratory elicitation - Work with the speaker to create transcriptions, collect metalinguistic information, and add or change stimuli as needed. (you can optionally record this stage)
  3. Targeted elicitation - Work with the speaker to create recordings of items which were transcribed during the first stage. The transcriptions for this stage are based on the transcriptions from the previous stage, but may be changed as necessary.

The benefit of working this way is that each stage orients the linguist and speaker towards a particular goal: either creating text data and metadata, or creating a high-quality recording to have text data aligned to. In the previous methods I’ve used, this was actually necessary, as the audio recording needed to be very structured and as such could not include the dialogic exchange of the exploratory session. The downside to working this way is that it takes more time, and the speaker can become tired during extend periods of rapid-fire targeted elicitation.

With an app like the one I started working on, you could in theory put the two stages back together. This might save time and avoid speakers getting so tired. It would also indirectly give some structure to the audio segments of exploratory exchanges by ending each one with a time-aligned segment (i.e. each time-aligned recording of an elicited item would be directly preceded by audio which includes the dialogic exchange about that item).

One key piece here, though, is that the speech that is produced during the aligned segments needs to have corresponding text data which matches 100% (or to the best of your ability). That means that if, during an exploratory exchange, you get sitaweza, but during the recording of the aligned segment you get sintaweza, you either need to re-record/assign and ask the speaker to say sitaweza, change the transcription for that segment to sintaweza, or add another row for sintaweza. You can have metalinguistic notes which explain the existence of these two variants, but you can only have one of them in the transcription associated with a given segment of audio. Also, data correction like this should be done in the moment and not at a later time, to avoid the creation of a time-consuming data correction task for your future self or someone else.

I do think that it would be best for any elicitation app to have the flexibility to support either the two-stage or single-stage approach, and as far as I can tell that is true of this one. Personally I’m a bit on the fence about the one-stage approach, because I do appreciate having a nice clean recording of elicited items without the dialogic exchange, but I do see the benefits of putting the two together.

1 Like

I added EAF and FlexText support:

Still need to add CSV and clean up the code.

1 Like

Some new features today (click here to try them out!):

  1. CSV export. All export formats are also now exported with a single button (Flextext/TextGrid/EAF/CSV/JSON)
  2. Separated .html, .css, and .js files for easier human reading :smiley:
  3. A basic CSV import function, which reads text from the first three columns and adds it to the Translation, Transcription, and Notes input boxes. This makes it easier to create data in a spreadsheet application and import it into the app for time alignment/elicitation.
  4. When using a computer, the “Align time” button now works when you press the button and again when you release, but on mobile devices it works only once each time you tap the button.

Also, some minimal documentation.

1 Like