Matching strings in Javascript

Nothing makes me grrr like a :lion: more a regular expression like this:

let wordRegExp = /[a-zA-Z]+/g  // UGH

That says that tokens are sequences of lowercase or uppercase ASCII letters.

Feel free to scoff dismissively!

This is not to say that tokenization is simple in a generic sense by any means, but there are much better defaults than using something like the regex above. In this post I’ll give you a brief post of one of them, Intl.Collator.

The docs I’ve found on this stuff are a little thin, so I’m going to try to show you some snippets that help with string matching.

I’ll be using the transcribed forms from some old fieldwork of my own, which you can see here:

https://docling.net/book/data/languages/hiligaynon/corpus/education_in_jaro/education_in_jaro.html

The word we’ll look at (a borrowing from English) is the word ‘topic’, which shows up three times in the text. In one of them, the stress is marked on the second syllable, so it’s spelled «topĂ­c». (Yeah, I know, it would probably be better to spell it «topĂ­k» and «topik». :person_shrugging: )

Anyway, these two spellings obvious aren’t the same string:

"topĂ­c" == "topic"

false. But sometimes we might want to be able treat it as true, say, if you’re in a fieldmethods class, and you are still working out the stress system. (Or trying to. Stress in Hiligaynon depends on whether Venus is in the house of Aquarius. Or something.)

Intl.Collator can be made to do this without too much trouble. When you create a Collator instance (using the new keyword), it has a .compare method which accepts two strings as arguments:

new Intl.Collator('en').compare("topĂ­c", "topic")

This returns 1, which means “these are NOT alphabetized.”

new Intl.Collator('en').compare("topic","topĂ­c")

This returns -1, which means “these ARE alphabetized.”

There’s one more possibility: the strings are exactly the same. That gets a 0 as a return value:

new Intl.Collator('en').compare("topic","topic")

Which returns 0, which means “these are the same string”.

Here’s a summary table (I have bolded the forms with accents to make the pattern stand out):

strings .compare(a,b) meaning
topic, topĂ­c -1 in order
topĂ­c, topic 1 out of order
topĂ­c, topĂ­c 0 identical
topic, topic 0 identical

I don’t know about you, but I find this precisely the opposite of what I would expect. Shouldn’t ‘in order’ be 1 and ‘out of order’ be -1??

:person_shrugging:

So essentially, what we want is some way to make the first two rows return 0.

Here’s how you do that. You add a second argument to the Intl.Collator instantiation (not to the .compare() call!). This is the options object, with the property sensitivity and value base. “Base” here basically means “everything but the diacritics”.

new Intl.Collator('en', {sensitivity: "base"})
  .compare("topĂ­c", "topic")

That returns 0. Of course, given that we are ignore the diacritics, the order no longer matters:

new Intl.Collator('en', {sensitivity: "base"})
  .compare("topic", "topĂ­c")

Also 0.

Or to summarize again:

strings {sensitivity:"base"} meaning
topic, topĂ­c 0 identical
topĂ­c, topic 0 identical
topĂ­c, topĂ­c 0 identical
topic, topic 0 identical

Using this to filter an array

Okay, so what? Well, this can be used as the basis for a simple diacritic-insensitive search.

let words = [ "akĂł", "ha", "ko", "ko", "ma-topic", "na", "next", 
"pa-sĂșnud-ĂČn", "subĂłng", "topic", "topĂ­c" ]

let matches = words.filter(form => {
  let comparison = new Intl.Collator('en', {sensitivity: "base"}).compare("topic", form)
  return comparison === 0
})

Now matches contains:

[
  "topic",
  "topĂ­c"
]

Notice that we are missing "ma-topic", which in fact does contain our word. This is because the approach above matches whole strings.

We could do other stuff to catch that one too, but I gotta get lunch people. :sandwich: