Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replicating the services database #119

Open
progval opened this issue Apr 20, 2024 · 1 comment
Open

Replicating the services database #119

progval opened this issue Apr 20, 2024 · 1 comment

Comments

@progval
Copy link
Collaborator

progval commented Apr 20, 2024

Currently, sable_services writes its database as a single JSON file on disk. This is similar to what Atheme does, so we know it works are least at Libera.Chat's scale.

While this can easily be replicated to other services, it means sable_services going down causes an outage where people cannot login, channel ops cannot be opped, etc. This happens on Libera.Chat from time to time.

Given Sable's distributed architecture, we can do better here. @spb's idea is to have multiple sable_services nodes, one of which would be a leader and would stream its database to the other.
The database could remain a single JSON file, but it might become a scaling concern to copy this file over and over. We see a few options to solve this:

  1. use a database that supports streaming replication, like PostgreSQL.
  2. make sable_services nodes coordinate over the Sable network, and each have their own independent database
  3. make sable_services nodes share a single replicated database (Cassandra, something on top of Ceph, CockroachDB, ...)

With options 1 and 2, if we want high availability,it means sable_services needs to somehow have a leader election, because we can't allow write to the same objects from multiple nodes at the same time. PostgreSQL does not provide a solution to this, and expects users to tell it when to switch between follower/leader state.

And option 3 may be unsustainable for Libera, as all solutions I'm aware of in this space require extensive specialized knowledge with that solution (maybe not CockroachDB though? I've never tried it). In particular, Cassandra and Ceph are designed to work with petabyte-scale data, which is far beyond what we need here. Additionally, they often come with constraints/caveats in what software developers can do with the database.

@spb
Copy link
Collaborator

spb commented Apr 20, 2024

While this can easily be replicated to other services, it means sable_services going down causes an outage where people cannot login, channel ops cannot be opped, etc. This happens on Libera.Chat from time to time.

Minor correction: in the current code, if services are down then you can't log in and can't add new channel access, but anyone already logged in can use channel access that they already have. This already makes short downtime less of a problem than it currently is, so what I'm most concerned about is preventing data loss when switching over.

More generally:

Option 3 would have the advantage of not needing the rest of the network to care about which node is active; anything that requires services involvement could just be sent to any node with that capability. The main requirement we'd have on the database in that scenario is that changes are committed immediately and can't later fail or be rolled back; I suspect the main roadblock here would be the operational ones you mentioned.

Between options 1 and 2 I suspect it comes down to trading off development versus operational effort, which is probably a conversation to have internally before committing to either route.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants