Tone to Tone Melody
Tone is often (but not always!) represented with diacritics. Common diacritics used include the acute (á), grave (à) or macron (ā). These often represent high, low, and mid tone, respectively.
Of course, tone is a tricky business, so what diacritics mean is project-specific. Because notations vary, it’s often useful to use an alternative “tone melody” notation where tones are represented independently, usually with uppercase letters.
So a word like ámà might have a “tone melody” spelling of HL
.
Entering this notation manually alongside the transcription proper can be a drag. Could we automate it? Let’s try.
Finding diacritics with Unicode Property Escapes
Unicode Property Escapes have come up once before on this forum, but we haven’t gone into them too deeply. I personally think linguists will love them. Here’s the gist:
Suppose you wanted to find (or even remove) every Greek-script character from a string. You could find them with either of the following approaches:
let greekAlphabetRE = /[ΑαΒβΓγΔδΕεΖζΗηΘθΙιΚκΛλΜμΝνΞξΟοΠπΡρΣσ/ςΤτΥυΦφΧχΨψΩω]/
Gulp.
Or as a (more complex) codepoint ranges:
let numericGreekAlphabetRE = /[\u0030-\u0039\u0391-\u03A9\u03B1-\u03C9]/
The first one is easier to read, but it’s not this easy to read!
let greekScriptRE = /\p{Script=Greek}/gu
Waaaat. This is ridonkulously useful, because there are lots of properties you can match. For
A normalization gotcha
Before we get to the part where we find diacritics, we need to address a wrinkle. Quick, how many characters are used to make each of the following two lines?
á
á
Trick question. The answer, unfortunately, is “it depends”. For historical reasons, there are two ways to encode some characters in Unicode. Characters with the acute accent are such characters. You can either encode such characters as “unitary” characters of length 1, like U+00E1 LATIN SMALL LETTER A WITH ACUTE
, or else you can encode it as plain old U+0061 LATIN SMALL LETTER A
plus U+0301 COMBINING ACUTE ACCENT
. They look them same either way, because fonts know how to stack combining characters in the right place.
This means that just looking at a character isn’t enough to know how it’s encoded. And it can be an issue if you want to encode the Lakota word kákhi “in that direction”: is it k + á + k + h + i
or k + a + ◌́ + k + h + i
? And what if your database contains one version, but a user runs a search using the other version? Will they match? (The answer is “maybe”.)
This whole business can get complicated (“Unicode equivalence” is this thing that… eh, let’s not worry about it.)
Suffice it to say that you can use this magical Javascript trick to go from “most squished together” to “most unsquished”:
"á".normalize("NFKD") // normalization form compatible decomposition
"á".normalize("NFKC") // normalization form compatible composition
Note:
[
"á".normalize("NFKD").length,
"á".normalize("NFKC").length
]
gives
[ 2, 1 ]
Matching Diacritics
So, here’s our plan to convert diacritical notation to tone melody notation:
-
Decompose: (Obligatory zombie:
) We want to find just diacritics. That means we want to decompose or “unsquish” our string
-
Match diacritics: When we have normalized our string into decomposed form, we can use the the Unicode Property Escape
\p{Diacritic}
to find the diacritics. - Convert: Assuming we have a table of diacritic-to-tone for our project, we can then just look up each diacritic and convert it to a tone melody character.
Here is some Javascript that does all that:
let projectToneNotation = [
{ "diacritic": "́", "melodic": "H", "name": "High", "character": "Acute Accent" },
{ "diacritic": "̀", "melodic": "L", "name": "Low", "character": "Grave Accent" },
{ "diacritic": "̄", "melodic": "M", "name": "Mid", "character": "Macron Accent" },
{ "diacritic": "̂", "melodic": "F", "name": "Falling", "character": "Circumflex Accent" }
]
let diacriticToMelody = (form,notation) => {
let decomposed = form.normalize("NFKD") // unsquish
let diacritics = decomposed.match(/\p{Diacritic}/gu) // find diacritics
return diacritics
.map(diacritic => // map each diacritic…
notation.find(tone => tone.diacritic == diacritic).melodic // to its “melody”
)
.join("") // back to a string
}
diacritic | tone | codePoint | name | tone |
---|---|---|---|---|
́ | U+0301 | High | Acute Accent | H |
̀ | U+0300 | Low | Grave Accent | L |
̄ | U+0304 | Mid | Macron Accent | M |
̂ | U+0302 | Falling | Circumflex Accent | F |
Here’s a little demo that uses this function that you can play with:
- Type a letter (probably a vowel) in the input.
- Click an accent button.
- The tone melody willl update.
https://docling.net/tone-to-melody/tone-melody.html
Obviously, this is just a starting point, and the function would be used to bulk-annotate a lexicon or something like that.
See also: