Preserving community control with community-owned repos

Here’s something that’s been swimming in my head for years. I see a lot of indigenous language activists justifiably hesitant to hand their language over to universities or for-profit companies. There’s a lot that I worry about: what if the host becomes unable to host? What if the host goes rogue and sells their data without consent? What if the speaker dies and their children want the data removed and the host says no? All things we have to worry about because people suck, and all things I want to ensure that I never have the power to do.

So can we be super idealistic and make every step from data entry to publishing a language site on the web be completely community-owned? I think we can. Here’s how:

  1. Someone (“the coder”) makes a fancy user interface for some data. Crucially, the files that define the user interface and the files that hold the data live in the same top-level folder.
  2. The coder puts the files in a repository (“repo”).
  3. A community member forks the repo - they make a copy of the original ur-repo. The community member has complete control over the fork - the coder doesn’t even have to know the fork exists.
  4. The community member adds all their language data to the repo.*
  5. The community member publishes the repo**, making the website defined within accessible to whoever they choose.

Bonus: say the coder makes an update and oooh shiny new feature. Because the community member’s fork of the repo is still connected to that original repo, the community member can click a single button and inherit all the updates as well.

* You might ask, doesn’t this mean the community member has to edit the repo files themself? Thankfully, there’s an app for that called Decap CMS (CMS = Content Management System). Decap is basically a few small little files that you add to your repo, and those files define a pretty user interface that lets you log in and edit the files of the repo it’s in. (Put more technically: it commits changes to its own repo.)

Aside: The annoying thing comes with the “log in” part. The most popular repo-hosting site is GitHub, but they don’t let you handle logins like that for free. (Technical: you need an authentication server to authenticate through Github.) The second most popular site is GitLab, and thankfully they have a different way of doing logins that is free. (Technical: it’s called “Authorization Code with PKCE Flow”, not that I know what that means.) This does mean that, if you want to use Decap CMS to allow community member-owners to edit language data without editing files directly, you have to put the repo on GitLab instead of the more popular GitHub.

** How do you publish? It depends. If you have a “one-way” site, where the user just sees a buncha language data, end of story, GitLab gives you a one-click “publish” button and boom, it’s on the internet.

I’ve heard some communities clamor for more sophisticated websites that perhaps let users create accounts and save their progress, etc. This is a “two-way” site that needs to store user-supplied data as it comes in. GitLab won’t let you do this, so you actually need to send the GitLab code to another company that’s set up to handle “two-way” sites and store user data. (Technical: you need a PaaS that lets you run a server and a database; GitLab just serves static files.) Fly.io seems to be the best free option for this right now. You have to put a credit card down, but the free limits should be plenty to serve most communities, and I imagine there are ways to limit things before they start charging you.

COST There’s fortunately enough free stuff out there to make this idea work. GitLab will probably be free forever since storing a code repo and publishing a “one-way” site are so standard that nobody will ever pay for them. Decap CMS is open-source, so that’s free forever as well (can’t take it back once the files are given away for free). Put together, GitLab and Decap CMS should allow the creation of a “one-way” site for free, forever.

The “two-way” site is trickier. It’s more expected that you’ll have to pay for that kinda stuff, and I think the cheapest I’ve found outside of Fly.io came in at like 17 bucks a month or something. (I leave it to Pat whether he wants to reveal how much the docling forum costs!) It’s less certain that Fly.io will remain free forever. For years, the go-to free way to do this was Heroku, but a yearish ago they killed their free plan. Grrr!

EASE The “one-way” site requires a community member to make a GitLab account and click a few buttons. The communities I work with have incredibly capable, computer-literate people that could follow well-written directions, but I don’t know how much computer literacy is safe to assume more generally. The “two-way” site, if it uses Fly.io, requires a lot of click clack computer garbo I don’t even know how to do. (Heroku lets you click buttons, but I guess you gotta pay for the convenience.) My dream would be to create a “control panel” program that you’d run on your computer. It would know all the steps to do and provide pretty buttons that did them for you (technical: provide a UI on top of the CLI for whatever PaaS is being used to deploy).

More generally, I still suffer from “computers are scary” syndrome and I understand all too well how it can be an obstacle. And I write software for a living! I live in constant fear.

WHO OWNS IT?

The good: not the linguist!

The bad: The site would be subject to the whims of the third-party tools it relies on. I did my best to choose tools with good reputations. but things ens***tify all the time these days. However, even something linguist-controlled or for-profit-company-controlled would still rely on certain third-party tools and have to place its trust in something out of its control.

Am I closer to my ideal of community-owned web publishing? I think so.

Will we ever achieve that completely? Short of a community member keeping a literal physical computer running 24/7, probably not.

Would communities find value in this or is this just a fun thought experiment? Iunno.

(BTW: is this what “data sovereignty” means? I avoided using that term since I’m still not sure :slight_smile:)

You may find this article interesting: How Indigenous Groups Are Leading the Way on Data Privacy - Scientific American

All the questions you ask at the beginning NEED to be considered. But here are a few other points that need to be consider as well:

  • long-term storage: languages archives (I’m assuming that is included under you “university”) are supposedly funded for the long-term, including upgrading storage and formats, and maintaining accessibility even as technology changes. Archivists are professionals who specialize in this stuff. What is the risk to NOT “handing over the data”? What guarantees can the community make that the data will be accessible, let alone exist, in a 100 years when (worst case scenario) all native speakers have died and their grandchildren are looking for resources to revive the language?

  • acess management, IPR, rights, etc: good archives will have policies about honoring the access management and copyrights given by the depositors; I don’t the level of legal binding, but it is worth finding out and comparing it to any legal binding the community can put in place; granted that policies can change and funding can be lost, etc. But what are the risks of NOT taking advantages of professional support?

IMO, the main thing to ask is: Does the community who wants to own and manage the data have a accurate cost-benefit analysis? Does the programmer or linguist they might be relying on as a consultant have a good grasp on the various risks?

I’m all for community language rights and protecting data from inequitable exploitation (intentional or otherwise) by for-profit entities. (I think that might be a definition of data sovereignty, but I don’t know either) However, I have a gnawing unease when I hear about efforts like those in the article above, because I wonder if the idealistic decision was a fully informed one.

Related to your questions: who owns the data uploaded to GitHub? What if they go rogue?

1 Like