I did a little procrasti-project today⦠you know, when you have stuff you must do, and your says ā
! Donāt do that sensible thing!ā), and you end up putting your mental energies into something else⦠yāall know what Iām talking about. Might as well make something of it.
Iām sure you all spend a fair amount of time looking at linguistics and language stuff on Wikipedia. It occurred to me that there is actually a fair amount of fairly well-structured interlinear text there. Wouldnāt it be cool to try to get it all into some kind of queryable form? So I had to start looking into it.
First thing was just to poke around and try to find some. I happened to be looking at the article on the language of the Shawnee people, also called Shawnee, a central Algonquian language:
There are maybe 30 interlinears on that page ā not a huge database, but not nothing.
Should we scrape the HTML
?
Scraping is the programming word for writing a program that picks apart the bits and pieces of an HTML
page. Scraping can be painstaking or straightforward, depending on how the HTML
is written.
So, letās just pick a random bit of the page which has some significant interlinear content:
The way to get a feel for what the markup looks like for this particular bit is to āinspectā the markup. We used to use a feature called āViewing the sourceā, but āInspecting the sourceā is a much better way to learn about how HTML
works in practice. (Hereās a tutorial on how to do that: How to try Javascript now.)
ā¦But in this case, the markup is not super informative ā you can see in the screenshot below that Iām inspecting example (1), but all the tags are just <div>
s, without any classes to distinguish how the levels are set up. Thatās going to be a royal pain to parse. So letās⦠just not try to parse the HTML
.
Wikipedia is made of wikitext
Oh boy. Wikitext
. Love it, hate it⦠It will never go away because the biggest human-written anything ever, Wikipedia, is inextricably bound up in its complexities. wikitext
is what you see when you edit an article. It is not⦠well, is fugly.
The pattern goes like this:
For most practical purposes, the Parser
that converts Wikitext
into HTML
is a black box. Itās like an evolved organism. Itās bonkers. But the input to that parser is at least kind of understandable, if you can control your rage at how bewildering it is. So rather than inspecting the HTML
, letās see what the wikitext
corresponding to the this stranger⦠example looks like. How do you do that? Well, you click edit. Itās easier to just edit a single section at a time, so weāll click the edit link next to Demonstrative pronouns:
Which gets us:
===Demonstrative pronouns===
Refer to the examples below. 'Yaama' meaning 'this' in examples 1 and 2 refers to someone in front of the speaker. The repetition of 'yaama' in example 1 emphasizes the location of the referent in the immediate presence of the speaker.
{{interlinear|number=(1)|glossing=no abbr
|yaama- kookwe- nee -Īøa -yaama
|this- strange- appearing -PERSON -this
|'this stranger (the one right in front of me)'}}
{{interlinear|number=(2)|glossing=no abbr
|mata- yaama- ha'- pa-skoolii -wi ni-oosĪøe' -0a
|not this TIME- go-school -AI 1-grandchild -PERSON
|'this grandchild of mine does not go to school'}}
ā¦etcā¦
Okay, thatās not too bonkers. At least we can see that there are chunks, and each interlinear begins and ends with double curly brackets. And the first line looks like this:
{{interlinear|number=(1)|glossing=no abbr
Okay, so it starts with {{interlinear
, makes sense. In fact, this is whatās called a Template
in Wikipedia parlance. If I understand correctly (correct me @sunny!), a template is basically a kind of syntax for indicating that the content āwithinā should be transformed before being handed to the parser. So it goes like this:
Or I mean, I dunno if thatās how it actually goes down, the point is, by the time you see the rendered HTML
page, the template has been transformed.
The interlinear template
In point of fact, Wikipediaās interlinear
template is powerful. Real powerful. It can do an awful lot of stuff. Which means the āsyntaxā of the interlinear template gets a little hairy in its own right.
But just check out the documentation for the interlinear template:
Wouldja look at that. Pretty glosses! Small caps! You can add your own abbreviations. It lines up the words right! There is numbering! You can tweak stuff! It is, in short, pretty impressive.
But I donāt want all the other stuff, just the templates
So, maybe we just parse the wikitext
and slurp out all the instances of the {{interlinear}}
template? Well, yeah. Thatās what I did. I wrote a little app, and stuck it here:
So what you do is, you paste all the wikitext from a page into the left panel, and it tries to slurp out the interlinear templates and turn them into plain āol text in the right panel.
Like this:
The āextractionā code tries to fix a few things, but I know for a fact that it does some bad things. But Iāve found that you have to start somewhere, and itās better to try to make something
A clunky workflow (but some results)
So I ended up doing this⦠uh⦠70 times.
- Search Wikipedia for pages using this query (see these docs)
- Open a bunch of tabs with all those articles
- Edit each one, cut and paste the content
- Paste it into the extractor
- Cut and paste the output into a file, save that.
Like I say, clunky. Iām a little obsessive about things like this, though, once I get started. And honestly I found it kind of fun, even just glancing at all those articles. The real fun should be trying to do something with them all, however.
How could it be better?
Well, Iām going to wrap it up for tonight, but there are lots of things that could be done:
- Try it on Wikipedias in other languages (Are template names in the
wikitext
the same on Wikipedias in every language?) - Put this stuff in a github repo instead of cramming it in articles here.
- Figure out how to download all the articles at once and run the extraction offline
So thatās really it.
When ideas for low-hanging fruit like this get into my head, I feel pretty much compelled to hack something together. Iām curious to know if anyone else finds this project interesting, or has any ideas about how to improve the workflow or make use of the output.
Gnight friends.