JSON in the Middle

Over in Flextext to Plaintext, @rgriscom shared his cool code for converting a flextext file to plain text (I just learned that apparently some people also think that ‘plaintext’ means something else, but those people are clearly wrong :stuck_out_tongue:).

This got me thinking about a topic that we might want to discuss here. As Richard mentioned, I often ramble on about something called JSON , but without really getting into what it is. (We will get into it a little next Monday!)

Here I would like to introduce an idea that might help to explain why JSON is so great. Let’s call it JSON in the Middle.

json-in-the-middle

I’m way too old to be referencing Malcolm in the Middle. Am I cool yet?

The reality of our field is that we have a lot of file formats. Converting between formats is just a fact of our documentary lives. @rgriscom’s code addresses the following scenario:

flextext flextext plaintext plaintext flextext->plaintext

Exactly what kind of plaintext one might want could vary in different circumstances (which tiers? how much spacing? etc), but that is the problem that @richard had, and he solved it.

In fact, we do this sort of thing in lots of other circumstances too. Like:

eaf eaf HTML HTML eaf->HTML

Among many other features, LingView is a tool that carries out this conversion. (Come to think of it, it actually does:

flextext flextext HTML HTML flextext->HTML eaf eaf eaf->HTML

I once worked on a project with Brad McDonnell that did:

toolbox toolbox latex latex toolbox->latex

You get the idea. People convert from one format to another all the time.

Two things

But there are actually two things happening in what we call “conversion”:

  1. Import from a format
  2. Export to some other format.

If we write code to convert N input formats to M output formats, we are going to end up writing M * N programs. It’s probably not the case that we always want some import format to go to all other export formats, but let’s just imagine we want to for the sake of argument. Just in the examples above, we’ve mentioned flextext, plaintext, eaf, HTML, toolbox, and latex. That’s six… do we need 62 programs for conversion? When we add some other format, that’s 72 programs… that gets yikes real quick. :scream_cat:

(By the way, some formats suck as import formats: I’m looking at you LaTeX… :eyes:)

What’s the alternative?

We’re sort of already creating JSON anyway

When you write a conversion program, what do you actually do? You “parse” the content of the input format, which is to say, you go through the content in some way (by using string parsing methods, say, or a more abstract tool like an XML parser), and then you end up with “native” data structures in the programming language you’re working with. Then, you take that data structure, and you go through it and generate the output format, usually writing it to a file.

But here’s the thing: in the “data format” stage of that process, what you’re dealing with is going to be pretty dang close to JSON anyway. The main feature of JSON is that it’s an easy way to write down objects and arrays. The names for the different structures differ from language to language, but there is almost always something similar:

Language Object Array
Javascript Object Array
Python Dictionary List
Ruby Hash Array
R Something Something else*

* R hates me, somebody tell me the equivalent?

JSON stands for “Javascript Object Notation”, but that name doesn’t really reflect the way it’s used now. It’s really more like “generic data language”, because basically every language supports it.

No seriously, so many languages. 8th, ActionScript, Ada, AdvPL, APL, ASP, AWK, BlitzMax, C, C++, C#, Clojure, Cobol, ColdFusion, D, Dart, Delphi, E, Fantom, FileMaker, Fortran, Go, Groovy, Haskell, Java, JavaScript, LabVIEW, Lisp, LiveCode, LotusScript, Lua, M, Matlab, Net.Data, Nim, Objective C, OCaml, PascalScript, Perl, Photoshop, PHP, PicoLisp, Pike, PL/SQL, Prolog, PureBasic, Puredata, Python, R, Racket, Rebol, RPG, Rust, Ruby, Scala, Scheme, Shell, Squeak, Tcl, Visual Basic, Visual FoxPro

JSON in the Middle

The alternative is to embrace the fact that JSON is so universal, and use it as the endpoint of “importing” and the starting point of “exporting”. Then you end up with a graph like this:

cluster1 cluster2 cluster3 toolboxIn toolbox json JSON toolboxIn->json flextextIn flextext flextextIn->json eafIn eaf eafIn->json HTMLIn html HTMLIn->json XMLIn xml XMLIn->json toolboxOut toolbox json->toolboxOut latexOut latex json->latexOut flextextOut flextext json->flextextOut eafOut eaf json->eafOut HTMLOut html json->HTMLOut XMLOut xml json->XMLOut

There are so many advantages to this world. Any one of those arrows could be written in any programming language, for instance. If our community has expertise in the form of someone who knows Haskell, well, great! How easy that particular piece of code will be for the average documentary linguist to run is another question, but there’s no reason to rule that language out, if the result is a format that “swims downstream” — that is to say, it exports or imports JSON in the standardized flavor.

Okay, but what flavor of JSON?

Did I just say “standardized”? But…

The difference between what I’m proposing here and the XKCD scenario is that this putative JSON “flavor” would encompass something very close to the kind of data we’re already dealing with in documentation, but it would encompass all of the basic data types. That means, all three parts of the Boasian trilogy. If we can do that, then we have a “lingua franca” JSON flavor, so that all these various conversion programs will have a well-defined starting point.

So, I’m going to be bold here and just throw in a kitchen-sink (but small) example that shows my own opinion of the things that we have to have if we’re going to encode a documentary database as JSON. If you click the arrow below, be warned, you’re going to see a bunch of stuff — if you’re not familiar with JSON it may seem a little kookytimes. But try looking through it anyway. You might have more success looking at what lies behind the JSON example…

A very simple JSON “Boasian database”
{
  "language": {
    "metadata": {
      "name": "Esperanto",
      "codes": {
        "glottocode": "espe1235",
        "iso639": "epo"
      },
      "notes": [
        "This is (very) simple example of a JSON structure that contains a corpus, lexicon, and grammar."
      ]
    }
  },
  "corpus": {
    "metadata": {
      "title": "A tiny Esperanto corpus"
    },
    "texts": [
      {
        "metadata": {
          "title": "Hello"
        },
        "sentences": [
          {
            "transcription": "Mia nomo estas Pat.",
            "translation": "My name is Pat.",
            "words": [
              {
                "form": "mi-a",
                "gloss": "ego-ADJ"
              },
              {
                "form": "nom-o",
                "gloss": "name-1.SG"
              },
              {
                "form": "est-as",
                "gloss": "to_be-PRES"
              },
              {
                "form": "Pat",
                "gloss": "Pat"
              }
            ]
          }
        ]
      },
      {
        "metadata": {
          "title": "Advice"
        },
        "sentences": [
          {
            "transcription": "Amikon montras malfeliĉo.",
            "translation": "A friend shows in misfortune.",
            "words": [
              {
                "form": "amik-o-n",
                "gloss": "friend-N-ACC"
              },
              {
                "form": "montr-as",
                "gloss": "show-PRES"
              },
              {
                "form": "malfeliĉ-o",
                "gloss": "misfortune-N"
              }
            ]
          }
        ]
      }
    ]
  },
  "lexicon": {
    "metadata": {
      "title": "Lexicon derived from corpus."
    },
    "words": [
      {
        "form": "amik-o-n",
        "gloss": "friend-N-ACC"
      },
      {
        "form": "est-as",
        "gloss": "to_be-PRES"
      },
      {
        "form": "malfeliĉ-o",
        "gloss": "misfortune-N"
      },
      {
        "form": "mi-a",
        "gloss": "ego-ADJ"
      },
      {
        "form": "montr-as",
        "gloss": "show-PRES"
      },
      {
        "form": "nom-o",
        "gloss": "name-1.SG"
      },
      {
        "form": "Pat",
        "gloss": "Pat"
      }
    ]
  },
  "grammar": {
    "metadata": {
      "title": "Esperanto grammatical category index."
    },
    "categories": [
      {
        "category": "pos",
        "value": "adj",
        "symbol": "ADJ"
      },
      {
        "category": "person",
        "value": "first",
        "symbol": "1"
      },
      {
        "category": "number",
        "value": "singular",
        "symbol": "SG"
      },
      {
        "category": "tense",
        "value": "present",
        "symbol": "PRES"
      },
      {
        "category": "pos",
        "value": "noun",
        "symbol": "N"
      },
      {
        "category": "case",
        "value": "accusative",
        "symbol": "ACC"
      }
    ]
  }
}

Below is another presentation of the same data. Unfortunately I can’t control how tables are formatted (without work I don’t have time to do!) in this forum, so I’ll just let you take a look as-is. Hopefully you’ll be able to glean some idea of the way this structure contains a corpus, a lexicon, and a very, very bare representation of “grammar”.

Tabular demonstration of a JSON “Boasian database”
language
metadata
name Esperanto
codes
glottocode espe1235
iso639 epo
notes
This is (very) simple example of a JSON structure that contains a corpus, lexicon, and grammar.
corpus
metadata
title A tiny Esperanto corpus
texts
metadata sentences
title Hello
transcription translation words
Mia nomo estas Pat. My name is Pat.
form gloss
mi-a ego-ADJ
nom-o name-1.SG
est-as to_be-PRES
Pat Pat
title Advice
transcription translation words
Amikon montras malfeliĉo. A friend shows in misfortune.
form gloss
amik-o-n friend-N-ACC
montr-as show-PRES
malfeliĉ-o misfortune-N
lexicon
metadata
title Lexicon derived from corpus.
words
form gloss
amik-o-n friend-N-ACC
est-as to_be-PRES
malfeliĉ-o misfortune-N
mi-a ego-ADJ
montr-as show-PRES
nom-o name-1.SG
Pat Pat
grammar
metadata
title Esperanto grammatical category index.
categories
category value symbol
pos adj ADJ
person first 1
number singular SG
tense present PRES
pos noun N
case accusative ACC

Also, if you feel like, it, you can try messing around with that JSON/tabulation thing interactively here.

I’ll stop here but I’m hoping this is of some interest to some people (especially my committee, since this sort of thing is what my dissertation is about!! :sweat_smile: )

1 Like

Very cool! I think it makes sense to have the same thing in the middle between all these different formats. I guess any automatic conversion doohickey, even just between two file formats, has to deal with the same issue that not everyone wants their outputs to look the same. Like even if the pieces of the input and output are matched correctly, the same doohickey won’t work for everyone.

For example, there are many ways to format the same info in LaTeX. Here is a terrible half-broken doohickey that I made to output tree diagrams in LaTeX because I got tired of making them by hand. There isn’t really a conversion because the input is just, well, inputted on the page. But the output is LaTeX that makes trees how I like my LaTeX trees to look in one document.

My question for people who actually have some idea what they’re doing is, are there ways to deal with that? Or do you just need multiple outputs for different types of LaTeX styles (or whatever)?

2 Likes

Indeed! One way to think about it is that there could be many export programs, even to the same file type. So it would really be more like:

cluster1 cluster2 cluster3 whateverIn whatever json JSON whateverIn->json latex1Out latex1 json->latex1Out latex2Out latex2 json->latex2Out latex3Out latex3 json->latex3Out HTML1Out HTML1 json->HTML1Out HTML2Out HTML2 json->HTML2Out HTML3Out HTML3 json->HTML3Out

The advantage is that import is factored out from export, even if there is a lot of experimentation going on in what the structure of the export is within a given format — getting to a “templating” situation where there is a single workflow but lots of “templates” is still progress.

In the case of your (very :sunglasses:) tree builder, I would suggest you have actually written two “renderers”:

  1. your own renderer code to draw a tree to a <canvas> element (wow!).
  2. A LaTeX generator which produces code that can be pasted into a LaTeX document

If I may, here’s how you might consider using a “JSON in the Middle” approach for this project. Right now, you’ve got a system which has a kind of “wizard” UI — buttons are generated dynamically as the user is progressing down the tree (again, :sunglasses:).

I would argue that there is a kind of conversion going on here, but it’s an implicit one, and in your current design, it’s done progressively. So you fill out an input and then click to add a node, and then 1) the LaTeX tree and 2) the canvas are updated. So it’s essentially like (yes, I am having too much fun with the graphviz plugin:

click click update canvas update canvas click->update canvas update LaTeX update LaTeX click->update LaTeX

But those things could be factored I to their own modules, I guess you would say:

gui gui JSON JSON gui->JSON update canvas update canvas JSON->update canvas update LaTeX update LaTeX JSON->update LaTeX

In other words, you think of it as a two-step process. This is good because maybe you can re-use the canvas or latex generation code. Or maybe you can plug your tree structure into some other visualization library.

I think it might sound like I’m in the land of wishful thinking in terms of how realistic it is that JSON is any more “universal” than anything else. It’s not, it’s just that it’s much easier to parse in most programming languages.

Tree data specifically is kind of quasi-standardized — glottolog for instance offers something called newick format, which is actually from biology. (Their JSON format is kind of disappointing actually since it just stick newick in a string.)

But you basically have nodes with attributes
and a list of children — these are typically represented as an object with an array of other nodes as the value of a “children” or “data” property. In fact your tikz thing is quite close to that.

Anyway. I’m not sure I’m making much sense but I do think there is a commonality form the problem you’re solving and the one @rgriscom was solving in his Flextext to Plaintext post. First build the data structure, then build the output whatever could work in both cases.

2 Likes

Love this topic and have a lot to say—will respond fully later but for now let me link to a relevant project: Pepper (corpus-tools.org)

edit: oh also, Pandoc: https://pandoc.org/

1 Like

Thanks for the links, Luke! It looks like Pepper already has an ELAN import module, but maybe it doesn’t have a JSON export module.

So it sounds like they are essentially trying to do what @pathall had in mind, but use the Salt model in the middle instead of JSON:

The aim of Salt is to consolidate all kinds of annotations within a single model. For doing so we need a powerful base structure, which can cover all the different necessities at once. A very well-known and powerful structure in mathematics and informatics is the common graph, which is widely used for modeling very different kinds of data. The graph structure has a further benefit in that it helps to keep the model simple with its small set of different model elements. Our graph structure is rather simple, it only contains four model elements: node, relation, label and layer.

Could we say then that Salt has the advantage of comprehensive support for all potentially relevant features, but JSON has the advantage of ease-of-use/interoperability?

2 Likes

@pathall I think your core observation is spot on, that there could be a lot more re-use of reformatting code if we could all agree on an interlingua format. The issue, which you observe, is that this will only be an interlingua insofar as it is actually used by everyone, and if there’s anything we know it’s that every project is going to have different perspectives on data and how to format it.

So how do you pull off a format like this? At least, two extremes have been pursued, one in which you attempt to anticipate every possible need, and one in which you try to strip away everything from the data model until only core structures remain. SALT is an exemplification of the latter approach, and is an exemplification of the former approach FoLiA (and outside of linguistics, TEI).

My opinionated take here is that the issue with SALT is that it’s essentially unusable except for the task of being a format conversion interlingua (and I’d argue is somewhat inadequate even for that, though this is a separate matter), and that formats like FoLiA, even though they are so big, still cannot anticipate every formatting need and because of their size are daunting to learn and therefore not attractive for use by anyone who simply wants to get their data from A to B. (FoLiA is also a product of the XML era, and XML has its issues with approachability. See an example.)

What if we had FoLiA with JSON instead of XML? I think that would clear away the XML part of its usability issues, but having an enormous data format standard will turn off many users, and for users that do use your format anyway, they will likely not use it correctly in ways that would often be subtle but not inconsequential on the scale of all usages of the standard.

So what’s the way out? It doesn’t seem like any attempted universal-to-all-of-linguistics format has yet panned out, but within certain domains, there have been remarkable successes. The CoNLL-U format has become the de facto standard for NLP tasks like syntactic dependency parsing and part of speech tagging. It doesn’t attempt to serve all of linguistics or even all of NLP, but emerged for the named entity recognition task and came to prominence with its adoption by the Universal Dependencies project, which is explicitly concerned just with morphological, syntactic, and lexical annotations. What would a restriction look like for language documentation? Perhaps a focus just on the four core kinds of annotation (orthographic transcription, phonological transcription with morphological segmentation, interlinear glosses, and free translation) could allow a universal format to succeed in giving people who do language documentation an easy way to go between all their other formats.

Could we say then that Salt has the advantage of comprehensive support for all potentially relevant features, but JSON has the advantage of ease-of-use/interoperability?

I’d almost agree, though I think it’s a little generous to say that SALT has “support” for all relevant features. It’s a bit like saying that a toolshed and a forest will “allow” you to construct a sailboat. Technically true, but you’d probably want some more specialized materials and tools.

One more sidenote: it’s important when thinking about these issues to distinguish the content of the data from the syntax of the data format. If linguistic annotations were cocktails, then the content of the data (morphological analyses, syntax trees, glosses) is like the ingredients, and the syntax of the data format (JSON, XML, SALT) is like the glass it’s served in. Let’s not push this metaphor too far, but what makes a mimosa a mimosa is that it has orange juice and something like champagne, and while it is often served in a flute, it would not change the fact that it is a mimosa if it were served in a martini glass, a wine glass, a margarita glass, or a rocks glass. Similarly, IGT is still essentially IGT, whether it’s expressed in the syntax of XML, JSON, or SALT: you have three or four parallel layers which all mean certain things, and it does not change the character of the data itself whether it is expressed in XML or JSON. It’s true that JSON tends to be more usable for most people who write code, but I think the much harder problem that needs to be solved here is that of universalizing the content of the data, rather than settling on the syntax of the data format.

2 Likes

Thanks so much for your thoughts @ldg, especially on the content-syntax distinction. I think you are right that if you restrict yourself to language documentation or basic descriptive linguistics then there is a more or less established standard for the content side (interlinearized text). That is what @pathall has proposed for the docling.js data types, too.

I wonder what the remaining issues related to universalizing the data content within the scope of LD are then? I remember Pat mentioned the issue of tone. Templatic/non-concatenative morphology has also caused me a lot of problems in the past when using FLEx/ELAN. It is when you get to the parsing and glossing stage that everything complexifies exponentially.

The value of a system like docling.js, though, even if it doesn’t solve these problems directly, is that it opens the path for linguists to get more closely involved in the design of the data structures that they use, and to better integrate their data into other aspects of their work (e.g. writing publications, presenting, etc.). This is the path forward, right? The general level of data literacy in the LD sub-discipline is continuing to increase each year, and we are getting closer and closer to a critical mass of linguists with basic coding skills.

(Edited for pessimism! :D)