Comparative Pahoturi River Website

Fixed the table! Yamfinder makes sense now :slight_smile:

1 Like

Haha great! I was in amazement of a languge that had the form /jamfinder/ for so many glosses :rofl:

I’ll be back later this afternoon to talk proverbial turkey. :turkey:

1 Like

So how are the audio files named?

Agob_Kibuli_KUK_KL_canoe.wav (Variety_word.wav). I avoided putting the transcription in the file name because (a) the transcriptions change, and (b) IPA symbols.

1 Like

Indeed, makes sense, thanks.

Update: I’ve identified an RA who wants to help get this website up and running as a summer project. I’m not sure how much HTML she knows, but if we point her in the right direction, I think she could do it!


Well, it’s been a year :woman_shrugging: but I’ve found new inspiration to continue this project.

  1. The Yamfinder database that kicked off this data collection is now back up and running ( It’s not the exact format that I would like to see for the Pahoturi River database (e.g., Sound Comparisons...) but it would be great if I could both get my data in good shape for inclusion on the Yamfinder site and even pull from the Yamfinder site to get the data in a more useful format.

  2. I just hired an RA for 120 hours of work to make this happen.

What I know about the Yamfinder site:

  1. You can easily download data in a .csv file, but it doesn’t look like you can download the audio with it. I’ve written to Matt (Carroll) to see if I can get the audio pulled too.
1 Like

Very cool @katelynnlindsey! Was just looking at this and chatting with @meaganvigus about it. I wonder if might be interested in joining us here to talk about the project too?

I’ll send him the link :slight_smile:

1 Like

Hi Everybody! Thanks for inviting me to join the conversation.

Just a quick (edit: ha!) post about datasets, websites and how we serve data. After a few years working on various large data collection projects, I have become very skeptical of the usefulness of sites like soundcomparisons or the old version of Yamfinder (which preceded soundcomparisons). Not to diminish the achievements of these sites, they are beautiful sites with some amazing features.

However, I have found that once a researcher starts to really analyse the data they typically will download the data into their own workflows for the following reasons (and many others):

  1. There are so many tools for analysing and visualising data these days, there is no point trying to replicate these on a website. You can do more with just excel than you would ever want to include on a website, never mind the data analytics power of python, R, Watson, SPSS, etc…
  2. Each person has their own workflow derived from the way they think, their research questions and the types of patterns visible in the data.
  3. Any level of automation / custom scripting will require a download the data

IMO this leaves these websites as better suited for casually browsing the data and serving as public facing points for comparative projects consider something like the 50 words project (

The old Yamfinder site, which we started in 2012, took hundreds of hours of custom development and iteration and in the end most people just exported the data to excel (#facepalm). My current philosophy is that datasets are better published on places like Zenodo ( or Github where they get a DOI and you dont have to pay for server space (unlike Yamfinder) and websites databases should really just be a place to view and download the data.

I really hope this post doesn’t sound too dogmatic or arrogant. I just worry that across our discipline so many projects have spent thousands of hours and research dollars developing custom databases for each project when in 90% of cases Zenodo is a better choice.

Back to your original post:

I would like to have an interactive website that allows you to filter by variety/word/IPA symbol, and let you click on the word like a button so that you can hear the audio file. I would like this to be automatically updateable as I add to the excel sheet and add to the folder of .wav files.

In this case, you can embed sound files in excel. If you need it online, i.e. for multiple researchers, you could use google sheets or excel 365 (although it doesn’t allow embedding of sound files but you can link to files hosted on github or somewhere else).

Anyone is also welcome to use what we have done for Yamfinder for their own project. It is fairly trivial to change the data structure and the display. I would be more than happy to help, we’ll just need to double check with Wolfgang who did most of the original coding but I am have no doubt he would be fine with that.

Sorry for the brain dump but I hope you find some of what I said useful : )


Hi @mjcarroll! Welcome aboard! And thanks for pointing him this way, @katelynnlindsey :smiley: Looking back I totally implied I was going to work with you and then did nothing! :grimacing: life.

Anyway, so many interesting observations here, @mjcarroll. I definitely agree on the availability of simple data being a huge plus in a project, and that all kinds of tools are useful in linguistic analysis. My opinion on software for documentation is, if it helps someone do language work of any kind — research, revitalization, pedagogy, whatever — it’s a net positive.

I don’t see it as a question of replication. There are some features of the web that essentially no other analytical tools offer: advanced layout (CSS grid, flexbox, incredible (and constantly improving) Unicode support, writing modes, and on and on.

Certainly, research patterns vary from person to person, and the web platform is not always the best home for certain kinds of research. Stats? Probably better off using R. Machine learning and stuff like that? Probably Python. And so on for several of the other tools you mention.

But those tools don’t match the accessibilty of the web. Just installation alone (or cost) can be a significant barrier.

I actually would love to hear more about this history. I tried looking up yamfinder in the Wayback Machine but couldn’t find any old versions :sweat_smile:

Certanly neither dogmatic nor arrogant. Science needs lots of viewpoints after all. I confess I have never really dug into Zenodo, although it‘s been mentioned here here and here@rgriscom a local guru on that topic.

This discussion right here… dang, this touches on so many of the issues we face as a field right now. I think the best way to start is to try to enumerate a set of desiderata — the solutions will be interrelated, but


  • Online - We want documentation to be widely available (where appropriate). Hosting is a hard problem.
  • Linked, playable media - it should be possible to get playback next to the transcriptions
  • Collaboration - several people should be able to update the content. Authentication and security are hard problems.
  • Searchable/filterable/interactive - Online documentation should be more useful than a print equivalent. Even beyond playable media, we want to be able to do stuff with documentation.

This things are all pretty complicated. For some problems, a shared Excel/Office 365 whatever online spreadsheet could be fine (for instance, say, historical comparison). But for making research available to a speech community, for example, or for pedagogical purposes, Excel is going to be less ideal than the kind of thing is providing.

I hope we can continue this discussion (perhaps in a separate topic so this one can stay related to the Pahoturi River languages content), because there are many paths to meeting all these desiderata (and others). What is most important, I think, is that we embrace experimentation and variation.

Hey @mjcarroll, a wee observation: some lines in the .csv file on seem to have inconsistent number of commas:

number of fields frequency
10 11867
11 349
12 17
13 34
15 1
16 3
19 1

I think some of them are due to commas in the comment field? Not sure about other cases.

Good catch, @mjcarroll and I (and others) have a separate conversation going working out the little bugs of, and commas is one of them!

1 Like

The issue here, is, I have hundreds of files and I’d like to add more without having to manually embed an audio file link that can get broken over time. I want to dump my audio in a folder and have a program arrange them into a grid so that I can listen to them just by clicking on a button. This interactive platform is for me (and I’d like it to be online so other researchers can use it too!)


Hi everyone!

I wanted to give you all an update on my progress on this project. I took @mjcarroll’s advice and tried to get Google Sheets to serve the purposes of this endeavor.

I created a sheet that includes a dynamic table that allows you to select a language/transcriber (organized by column) and a gloss (organized by row) and the spreadsheet will automatically fill in the table with a transcription of that gloss for that language by that transcriber AND a link to the audio file in my google drive.

I also found a way to automatically extract a list of URLs that correspond to each file in a Google drive folder so that I didn’t have to manually link each word to the audio file. by copying and pasting the link. (Very helpful, since there are more than 2000 audio files).

I’m still messing around with the formatting and functionality (with help from @rchon) and the end goal is to embed the editable spreadsheet into a Wordpress type website so that researchers can view the comparative data without seeing all the backend google sheets mess. I’ll also be able to include instructions and tips.

I’ll update again soon as the project continues!

1 Like

Thanks so much for sharing this, really interesting, and congrats on your progress.

I’m not a historical linguist myself so I don’t do much in the way of comparative tables like these, so I was wondering if I could ask you a data question when you’re working with this kind of data. If I understand correctly, you’ve got for instance in the third row six forms which correspond to canoe. So my question is, how do you track cases where one of the languages has undergone semantic shift? Like, I could imagine canoe coming to mean something else — raft or rowboat or rocket ship :sweat_smile: or whatever.

Is that sort of information stored in some other context, or do you find that working with a cognate-level gloss doesn’t really turn out to be a problem, even if it’s… er… glossing over some subtleties?

Great question. For historical/comparative work, it can be useful to have the data organized by cognate sets, that is sets of words that all have a common origin and may or may not refer to identical semantic categories (this would be your canoe-canoe-raft set) and to have the data organized with current synchronic glosses to look at semantic drift. Cognates though are key for the comparative method and the semantics have to be close enough that you can argue that they have a common origin to use it for reconstruction work.

1 Like

In the spreadsheet, each transcription-gloss pairing has a column with gloss notes (to discuss semantic differences or shifts) and transcription notes (to discuss different analyses in transcription).

1 Like

How fantastic! Great to see you’ve managed to pull-off the dynamic display and linking to your sound files!

This is exactly what I wanted for Yamfinder when we did it the first time and I wish I had of started here (although google docs wasn’t as nearly developed in 2012). Can’t wait to follow this as it develops.


Wow yamfinder has been around since 2012? That’s awesome, well done. :bowing_man: