Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High availability #169

Open
6 tasks
mostafa opened this issue Feb 26, 2023 · 3 comments
Open
6 tasks

High availability #169

mostafa opened this issue Feb 26, 2023 · 3 comments
Labels
enhancement New feature or request epic To be broken down into multiple tasks needs investigation Investigation is needed to flesh out the details and possibly create new tickets
Milestone

Comments

@mostafa
Copy link
Member

mostafa commented Feb 26, 2023

This is to ensure HA of GatewayD by running a cluster of machines that can connect together and serve clients. So, plan and create tickets for all the following features and start implementing them.

  • Distributed state management (using gossip protocols)
  • High-availability
  • Fail-over (fault detection)
  • Clustering
  • Service mesh
  • Control plane?

Resources

@mostafa mostafa converted this from a draft issue Feb 26, 2023
@mostafa mostafa self-assigned this Feb 26, 2023
@mostafa mostafa added the enhancement New feature or request label Feb 26, 2023
@mostafa mostafa changed the title Clustering Clustering and service mesh Feb 26, 2023
@mostafa mostafa changed the title Clustering and service mesh High availability Oct 31, 2023
@mostafa mostafa moved this from ✨ New to 📋 Backlog in GatewayD Core Public Roadmap Oct 31, 2023
@mostafa mostafa added this to the v0.9.x milestone Oct 31, 2023
@mostafa mostafa added the epic To be broken down into multiple tasks label Nov 1, 2023
@mostafa mostafa removed their assignment Dec 9, 2023
@mostafa mostafa added the needs investigation Investigation is needed to flesh out the details and possibly create new tickets label May 1, 2024
@mostafa mostafa modified the milestones: v0.9.x, v0.10.x Oct 15, 2024
@sinadarbouy
Copy link
Collaborator

For this issue, I think we can solve it by using github.com/hashicorp/raft (as mentioned in the issue description) to handle the state and coordination between nodes.

Here’s how I see it working:

  1. Expose a Raft Port: We’ll need to open up an extra port for Raft. Then, during startup, all the nodes can connect and form a Raft cluster.

  2. Single Raft Cluster for All Config Groups: Instead of having a separate Raft cluster for each configuration group, we can just have one for all of them. It should simplify things and reduce overhead.

  3. Handling Stateful Parameters: We can store stateful parameters as key-value pairs, similar to how we handle it in the Redis plugin(configurationGroup-Configurationblock-Key). Raft will help ensure all nodes stay in sync with these values.

  4. Fetching State Variables from Files: For things like connection counts, we can store them in a file and fetch them when needed. Since this usually happens in the OnOpen phase and during connection setup, performance shouldn’t be an issue.

With this approach, if we have three instances of GatewayD running, they can all receive requests, but they’ll rely on Raft to fetch the stateful variables through a voting process, ensuring everything stays consistent before creating connection between the client and DB.

If this approach sounds good, I can start working on it.

@mostafa
Copy link
Member Author

mostafa commented Oct 17, 2024

After some investigation and the fact that Gossip protocol libraries are old and unmaintained, I think the go to approach is to use Raft, considering that Kafka also used it to move away from ZooKeeper. I think we should stick with simplicity and ease of use, as you also mentioned, rather than creating a Raft per tenant. We can also consider storing the state variables in SQLite or ObjectBox.

Let's create another ticket and link it to this one.

@sinadarbouy
Copy link
Collaborator

I checked again, and it turns out we don’t need to store our state in a file. HashiCorp Raft already uses BoltDB to handle the Raft logs for persistence and recovery. We can just use sync.Map to keep our state in memory since we’re only working with simple key-value data. Since we don’t have complex data, skipping a database like ObjectBox should be fine, as long as we rely on Raft for consistency and recovery.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request epic To be broken down into multiple tasks needs investigation Investigation is needed to flesh out the details and possibly create new tickets
Projects
Status: 📋 Backlog
Development

No branches or pull requests

2 participants