Apropos of not much, I made this little repo with the examples from the Leipzig Glossing Rules represented as JSON. It’s just a start, but I think it’s worth thinking about as a test case for representing glossed data.
Nice! I’m curious, would it be feasible to have morpheme-level form-gloss pairs rather than word-level pairs? What might you do for infixes, non-concatenative morphology, etc.?
It raises this larger question of how useful we want our published example data to be for secondary research purposes. Do we want to go the extra mile so that someone can more easily repurpose our data? Of course, if our publications were simply extensions of our databases, then it might not require any extra steps.
Also, it would be great to see this rendered with HTML/CSS
Rule 2 of the Leipzig rules covers morpheme-level glossing, and rule 9 covers infixing. Various rules talk about different kinds of non-concatenative morphology, e.g. rule 4D is for stem ablaut as a signal of a grammatical property.
This is kind of redundant in that that information is already “there” in the form/gloss values in the original word, but the latter representation is more explicit, and it’s easy to imagine scenarios where you want to annotate an individual morpheme for some reason.
Things get more complicated quickly, as you point out. Even just using = to indicate a clitic can be ambiguous — which of the two morphemes is the clitic and which is something else? This is one possibility, I guess?
I’m curious, if you can include both the word-level form-meaning pairs and the morpheme-level form-meaning pairs in the same data structure, how important is it that you be able to generate one from the other? There is already redundancy in including both the sentence-level form-meaning pair together with the word-level form-meaning pairs, for example.
My impression is that if you can include all three levels, then you really only need to decide on the linear ordering of the morphemes in the morpheme array and the set of morpheme types. In flextext files, for example, only the word and morpheme levels are used (but with sentence-level free translations). Here you can see an example sentence (“phrase”) with word-level and morpheme-level form-meaning pairs (“txt” is form, “gls” is meaning - usually towards the end of the nested section), and only free translation on the sentence/phrase level (“gls” way at the bottom).
There is also the question of whether or not you want the “underlying form” to be the value at the morpheme level, e.g. rather than a context-dependent allomorph which appears in the example. If so, then you wouldn’t expect to be able to generate the morpheme level from the word level. Perhaps this depends on whether or not you are creating a system for representing text examples from publications vs. creating a system which is designed to represent text examples from a database. Ideally, though, these would be the same system, right?