Metadata thoughts

ok, so after our chirila party (plus other discussion) this afternoon I was thinking about metadata and viewers and such, pinging @Dani and @coralie on this topic too, and also @pathall 's metadata formats thoughts from earlier.
Pat, my Bardi metadata problems are somewhat similar (actually very similar) to my Chirila referencing/source problems, I think. Lots of different formats, lots of partial data, not clear “organizing unit” (so can’t use LaMeta, which uses the “session”; an “object” is the closest, but objects are linked). Zotero may actually be closest, but it has closed metadata categories which work for bibliographic references but don’t cover all the linguistic metadata (e.g. that Ryan Sullivant talks about); But it would in principle be a nice way to share materials with communities, or to use for a database that also has a community-fronted view. Zotero is very nice in general because it allows pdf annotation, note taking, tagging, linking related refs (e.g. transcripts for tapes, etc). It’s not usable as is because it doesn’t have good file metadata (ie not Omeka), and you can’t add custom categories - everything custom has to be a tag. It’s highly exportable (including to json) and has various visaulizations and sharing possibilities. You can do linked attachments (e.g. have an “item” that’s a pdf, along with an audio file in the same record, and a .eaf file, but the .eaf file is a link so it’s always linked to the most recent version).
But ultimately, this comes back to our issues with “one tool to rule them all” vs splintered resources - it’s really helpful to have everything linked, or in one place, but the “things” in the “everything” are so heterogeneous that it doesn’t make a lot of sense to link all the pieces together.

3 Likes

Boy, do I feel this. Every fieldmethods class and project I have been through has had problems of this sort. I still have lots of data guilt about much of my previous work because of the state of the outputs: it’s the heterogeneity that is such a challenge.

I thought I would share one approach that I’ve been working on that might at least be an interesting point of comparison for you. Given my JSON obsession it will not be surprising that I strive to get everything into JSON, but I feel like there are pretty significant advantages worth considering.

Click me for tangent with sand worms

I found myself thinking about this this morning when I happened upon A Research Guide to Frank Herbert’s Dune. A collection like this is in some ways similar to what we end up with in documentation/linguistics: all kinds of resources, in all kinds of states of processing. Some typescript manuscripts. Some manuscripts with marginalia. Some recordings. Some transcriptions of recordings. Etc, etc. I feel like the librarian’s “research guide” as

Metadata for Loma stuff

As you know I have been collaborating with @clriley and Balla on some stuff related to Loma [glottolog:loma1259]. There are all kinds of artefacts popping up as we continue to try to work on getting the Loma syllabary ready for Unicode. And I do mean all kinds of stuff: images, scans of books, references to books, scans of manuscripts, transcriptions of scans of manuscripts, emails, audio, video, transcriptions of audio and video, etc, etc. And then there are all the derived “born-digital” derivatives that we want to have available — docling.js-style JSON files with strictly structured data.

This stuff is, like your collection (although on a much smaller scale), very heterogeneous. It’s hard for me to imagine a generic metadata model that would harmonize over such different kinds of things. So, I have been experimenting with an approach that simply doesn’t even try to harmonize things (at least at first). The model is: create a metadata object in a JSON file, and include at least a title and a description field, and whatever else seems useful.

Early attempts at automating the process

Here’s the ChatGPT summary of the code I wrote to generate the metadata index file, it’s pretty good!

This code imports the walk function from the fs module in the deno standard library. It then creates an object called metadataIndex, which contains an empty files array. The code then uses the walk function to traverse the current directory and collect all JSON files that match the specified criteria (files with the .json extension, but not those that contain AUD or everson in the path, and only those that contain a hyphen in their filename). For each matching file, the code reads the file’s content, parses it as JSON, extracts the metadata, and adds an object containing the file’s path and metadata to the files array in metadataIndex. Finally, the code writes the metadataIndex object to a file called loma-metadata-index.json.

Here’s the actual code:

`generate-metadata-index.js`
import { walk } from "https://deno.land/std@0.168.0/fs/mod.ts"

let fileMetadataIndex = {
  metadata: {
    title: "Index of JSON files in current corpus of Loma",
    description: "Metadata objects extracted from all JSON data in this archive.",
    view: "./loma-metadata-index.html",
    sourceData: "./loma-metadata-index.json"
  }, 
  files: []
}

let files = walk('./', { 
  exts: ['.json'],
  includeDirs: false,
  skip: [
    /AUD*/,
    /everson/
  ],
  match: [
    /-/
  ]
})

for await (const e of files){
  if (e.isFile && e.name.endsWith('json')){
    let path = e.path
    
    try {
      let json = await Deno.readTextFile(path)
      let data = JSON.parse(json)

      let metadata
      if(!data.metadata) {
        metadata = { "error": `${path} has no metadata.` }
      } else {
        metadata = data.metadata
      }
      
      fileMetadataIndex.files.push({
        path: `./${path}`,
        metadata
      })
    
    } catch(error){ d
      console.log(e.path, error)
    }
  }
}
    
Deno.writeTextFileSync('loma-metadata-index.json', JSON.stringify(fileMetadataIndex, null, 2))



I run that script once in a while like this:

$ deno run --allow-read --allow-write generate-metadata-index.js

The result of running this code is that a file called loma-metadata-index.json is generated and saved. Then, that can be viewed with a web component I wrote called <data-viewer>. (There’s a demo for how that component works here.)

Ultimately I use a small HTML file called loma-metadata-index.html which includes a <data-viewer> (and a css file to format it a bit). It’s pretty short:

<!doctype html>
<html lang="en">
<head>
  <title>Loma Metadata Index</title>
  <meta charset="utf-8">
  <link 
    rel="stylesheet" 
    href="https://docling.net/book/docling/components/data-viewer/data-viewer.css">
</head>
<body>
<header><h1>Loma Metadata Index</h1></header>

<data-viewer src="loma-metadata-index.json"></data-viewer>

<script type=module src='https://docling.net/book/docling/components/data-viewer/DataViewer.js'></script>
</body>
</html>

The ultimate result of all this kerfluffle is this:


Now, that in and of itself isn’t like… great. It’s more like a starting point. the <data-viewer> component is very generic (it isn’t even called <metadata-viewer>!), but something new could be created. Just a few easy ideas:

  • search (!!!)
  • different ways to sort
  • a convention for adding dates of creation and ingression (then we could sort by “what’s new?”)

Still, I find this thing very useful, for several reasons:

Blockheaded search over this is still useful Even the dumb browser-based find function can help me find random stuff: “Oh, what was that thing so-and-so was working on…” I just search for so-and-so across that page. Much easier than trying to find something in a heterogenous file hierarchy.

Messy can still be useful. I just added a directory full of scans of pamphlets. I haven’t done it yet, but I know what the next step is: create a file called primer-index.json or something like that, which includes all the necessary citations and source history (the kind of thing Ryan mentioned in that article). Then, I can rerun my script and re-deploy everything. Note that this is not the end of the story for this content: ultimately maybe we would run OCR over it and then store the resulting output, with its own metadata, maybe we produce some derivative online quizzes or some such. The point is that throwing up some metadata is always a good idea, especially when you can automatically generate a usable index of that metadata.

Sorry for the long post — and this approach may or may not be directly applicable to your Bardi or Chirila collections — but I thought it might be an interesting point of comparison.

1 Like

both that and the lack of a clear dataflow - that things can come in at any stage of the project, and that we need reverse compatibility. If it were just one set of heterogenous datasources we could normalise them and that’d be “done”

2 Likes