Faffing about with ChatGPT for documentary linguistics

Right, so, ChatGPT.

(Pretty sure some people just skipped over this topic, it’s everywhere, I know!)

I have to admit that my initial reaction to ChatGPT’s ability to write Javascript/HTML/CSS applications was kind of… well…

Here’s a repo containing the source code from this example if you are interested:

GitHub - amundo/chatgpt-experiments

Some of it was written by me and some of it was written by… the AI… :person_shrugging:


Searching for emojis

It is kind of amazing that it can take a plaintext description of an application and turn it into functioning, commented code. So for instance, this request worked pretty well:

I realize that this task hardly qualifies as “documentary linguistics”, but the pattern of querying an array and rendering the output is one that comes up again and again — searching wordlists or texts, for instance. I think it’s a good learning task, so I used it as an example in the class I’m teaching at Yale (which is almost ending :face_holding_back_tears:).

The sample data is almost exactly the same structure as what I used as a demo in my current class for this task. And the code does… well, exactly what our demo did in class.

It’s worth noting that the output doesn’t really take “end users” into account very much, because it doesn’t have much in the way of help text (well, I guess there is the Search for an emoji… placeholder attribute). Still, though, kind of nuts, right?

You can try it yourself here:


All I added was the skeleton HTML page to put the generated form inside.

Finding minimal pairs

This example also seemed pretty :exploding_head: at first, but it took some work to realize 1) what it’s trying to do and 2) the fact that it doesn’t actually work right.

Typically one would use something like the Levenshtein distance to find minimal pairs.

If you read the comments of the generated content, it looks like it should do exactly what you’re asking:

// The map should contain the following minimal pairs:
// - bat and hat
// - cat and hat
// - rat and hat

I mean, modulo a crummy definition of phonemes, those are indeed words that differ in one letter. But guess what? The algorithm doesn’t actually detect what it purports to detect in its own comments. Instead, the minimal pairs it does detect are only those where the final segments differ. So, bat and bag, but not bat and bag.

I rigged up a quick little testing interface (which also doesn’t have much in the way of help text!) where you can try it out:


You will see that the default examples from the generated code produces no output; try adding bag on its own line, and there will be output:

So, I had to read the code and the algorithm to figure out what was going on, and what was wrong. It’s quite reminiscent of a task that translators face when dealing with machine translation or transcribers when dealing with OCR output: is it worth the effort to post-edit, or is it easier to just start from scratch?

I would be interested to see experiments from any of you related to documentation topics!


Welp, can’t resist this follow-up:

Try it here:


const levenshtein = require('levenshtein-edit-distance');

// An array of words to find minimal pairs in
const words = ['cat', 'bat', 'rat', 'mat', 'pat', 'hat'];

// A map to store the minimal pairs we find
const pairs = new Map();

// Loop through all pairs of words in the array
for (let i = 0; i < words.length; i++) {
  for (let j = i + 1; j < words.length; j++) {
    // Calculate the Levenshtein distance between the two words
    const distance = levenshtein(words[i], words[j]);

    // If the distance is 1, the words are minimal pairs
    if (distance === 1) {
      // Add the pair to the map, using the first word as the key
      pairs.set(words[i], words[j]);

// Print the pairs we found
for (const [key, value] of pairs) {
  console.log(`${key} - ${value}`);


  1. Why in the heck is ChatGPT so convinced that minimal pairs are only to do with words that differ in their final segment?
  2. Here it’s using a library. But it’s a correct choice of library.
  3. The code is a bit out of date; the require() syntax only works in node.js, as the library’s documentation explains.
  4. ChatGPT is now down so maybe I can go do some real work :slight_smile:
1 Like

This one is kind of interesting:


I asked it to generate a French vocabulary quiz, and to randomize the quiz from a list of words.

It works, but oddly, what it did was to hard code the translations and randomize which word is begin quizzed — so the answers remain the same on reload, but the answer changes.

Obviously, this is not a scalable approach for a larger vocabulary. Also obviously, no human would have written it this way…