Converting diacritics into tone melody annotations with Javascript

Tone to Tone Melody

Tone is often (but not always!) represented with diacritics. Common diacritics used include the acute (á), grave (à) or macron (ā). These often represent high, low, and mid tone, respectively.

Of course, tone is a tricky business, so what diacritics mean is project-specific. Because notations vary, it’s often useful to use an alternative “tone melody” notation where tones are represented independently, usually with uppercase letters.

So a word like ámà might have a “tone melody” spelling of HL.

Entering this notation manually alongside the transcription proper can be a drag. Could we automate it? Let’s try.

Finding diacritics with Unicode Property Escapes

Unicode Property Escapes have come up once before on this forum, but we haven’t gone into them too deeply. I personally think linguists will love them. Here’s the gist:

Suppose you wanted to find (or even remove) every Greek-script character from a string. You could find them with either of the following approaches:

let greekAlphabetRE = /[ΑαΒβΓγΔδΕεΖζΗηΘθΙιΚκΛλΜμΝνΞξΟοΠπΡρΣσ/ςΤτΥυΦφΧχΨψΩω]/

Gulp.

Or as a (more complex) codepoint ranges:

let numericGreekAlphabetRE = /[\u0030-\u0039\u0391-\u03A9\u03B1-\u03C9]/

The first one is easier to read, but it’s not this easy to read!

let greekScriptRE = /\p{Script=Greek}/gu

Waaaat. This is ridonkulously useful, because there are lots of properties you can match. For

A normalization gotcha

Before we get to the part where we find diacritics, we need to address a wrinkle. Quick, how many characters are used to make each of the following two lines?

á

á

Trick question. The answer, unfortunately, is “it depends”. For historical reasons, there are two ways to encode some characters in Unicode. Characters with the acute accent are such characters. You can either encode such characters as “unitary” characters of length 1, like U+00E1 LATIN SMALL LETTER A WITH ACUTE, or else you can encode it as plain old U+0061 LATIN SMALL LETTER A plus U+0301 COMBINING ACUTE ACCENT. They look them same either way, because fonts know how to stack combining characters in the right place.

This means that just looking at a character isn’t enough to know how it’s encoded. And it can be an issue if you want to encode the Lakota word kákhi “in that direction”: is it k + á + k + h + i or k + a + ◌́ + k + h + i? And what if your database contains one version, but a user runs a search using the other version? Will they match? (The answer is “maybe”.)

This whole business can get complicated (“Unicode equivalence” is this thing that… eh, let’s not worry about it.)

Suffice it to say that you can use this magical Javascript trick to go from “most squished together” to “most unsquished”:

"á".normalize("NFKD") // normalization form compatible decomposition
"á".normalize("NFKC") // normalization form compatible composition

Note:

[
"á".normalize("NFKD").length,
"á".normalize("NFKC").length
]

gives

[ 2, 1 ]

Matching Diacritics

So, here’s our plan to convert diacritical notation to tone melody notation:

  1. Decompose: (Obligatory zombie: :zombie:) We want to find just diacritics. That means we want to decompose or “unsquish” our string
  2. Match diacritics: When we have normalized our string into decomposed form, we can use the the Unicode Property Escape \p{Diacritic} to find the diacritics.
  3. Convert: Assuming we have a table of diacritic-to-tone for our project, we can then just look up each diacritic and convert it to a tone melody character.

Here is some Javascript that does all that:

let projectToneNotation = [
  { "diacritic": "́", "melodic": "H", "name": "High",      "character": "Acute Accent" },
  { "diacritic": "̀", "melodic": "L", "name": "Low",       "character": "Grave Accent" },
  { "diacritic": "̄", "melodic": "M", "name": "Mid",       "character": "Macron Accent" },
  { "diacritic": "̂", "melodic": "F", "name": "Falling",   "character": "Circumflex Accent" }
]

let diacriticToMelody = (form,notation) => {
  let decomposed = form.normalize("NFKD") // unsquish
  let diacritics = decomposed.match(/\p{Diacritic}/gu) // find diacritics
  return diacritics
    .map(diacritic => // map each diacritic… 
      notation.find(tone => tone.diacritic == diacritic).melodic // to its “melody”
    )
    .join("") // back to a string
}
diacritic tone codePoint name tone
́ U+0301 High Acute Accent H
̀ U+0300 Low Grave Accent L
̄ U+0304 Mid Macron Accent M
̂ U+0302 Falling Circumflex Accent F

Here’s a little demo that uses this function that you can play with:

  1. Type a letter (probably a vowel) in the input.
  2. Click an accent button.
  3. The tone melody willl update.

https://docling.net/tone-to-melody/tone-melody.html

Obviously, this is just a starting point, and the function would be used to bulk-annotate a lexicon or something like that.

See also:

4 Likes

super super super cool!

1 Like

I have something similar in R, but it also notes syllable boundaries with a period, because in Mixtec it’s often instructive to know if you’re dealing with HL on a monosyllable or H.L on a bisyllabic word. My code is overall a bit more complicated, because in our practical orthography, mid tone is unmarked. But the gist of it is the same :slight_smile:

4 Likes

Yeah the example I gave here is super simple, and it assumes every vowel js an accent mark. I had some Loma examples in mind (@Dani and @squidtm could tell you much more, I’m not in the class, just a fanboy :smile: ). In that case if I understand correctly there are just two tones, and pretty much every vowel in the corpus has an accent mark, so something like the approach given here will do (although I believe don’t have all the tone diacritics they use).

In any case it would be interesting to see your code if you’d like to share! I think the forum supports highlighting R code if you “fence” it like this:

```R

```

Also curious to know how you syllabify Mixtec, seems complicated!

Using greek script as an example for Unicode Property Escapes seems very appropriate - because it touches one of the bigger practical issues with Unicode: Unicode blocks, categories or properties are useful, but not necessarily transparent - and thus actual usage may not match the Unicode semantics. E.g. there’s quite a few “omegas” in Unicode, see Omega - Wikipedia
So before using a technique as described here on a corpus, I’d recommend computing summary statistics on the contents of the corpus on Unicode code-point level.

2 Likes

You code up such interesting things. Well done.

I wrote a paper on this topic you might be interested in… Phonetic Transcription of Tone in the IPA | Hugh's Curriculum Vitae

It is important in these transcriptions to note which pitches are annotated phonetically, and which are phonologically annotated. There is a rather wide use of “bar-notation” for phonetic transcription as falling or rising pitches may be reducible due to the merger of phonologically HL or LH sequences. Bar-notation was implemented in some SIL fonts in the unicode PUA area. There is also a non-unicode font for bar-notation floating around… :wink:

In my work I have found that perl regular expressions to be very helpful in dealing with these cases because they can target unicode character attributes (formally called properties) much better than regex in other languages. For examples see: perlunicode - Unicode support in Perl - Perldoc Browser

1 Like