Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elastic search: add a synchronization mechanism #7689

Open
Betree opened this issue Nov 26, 2024 · 0 comments · May be fixed by opencollective/opencollective-api#10521
Open

Elastic search: add a synchronization mechanism #7689

Betree opened this issue Nov 26, 2024 · 0 comments · May be fixed by opencollective/opencollective-api#10521
Assignees
Labels
api Issues that require some work on the API (https://github.com/opencollective/opencollective-api)

Comments

@Betree
Copy link
Member

Betree commented Nov 26, 2024

Needs

  1. We want all data to be reflected.
  2. We want data updates to be reflected.
  3. We don't want a significant delay between an item creation and its addition to the search (e.g. when creating an expense, it should be surfaced in the search in less than a minute).
  4. If the synchronization mechanism crashes for any reason, there must be a way to catch up with the missed entries.
  5. In some cases, like a collective being unhosted, we want to trigger a full re-indexing of the account and its related data (expenses, comments, orders...etc).

The naive solution: a CRON job

Today, the api/scripts/search.ts script lets us fully synchronize the DB with Elastic Search, supporting incremental updates based on deletedAt/updatedAt. The simplest approach would therefore be to re-use these functions in a CRON job called every 10 minutes. While this would satisfy (1), (2), and (4):

  • The latency (up to 10 minutes) would not be acceptable.
  • Performance could become an issue, depending on how updatedAt/deletedAt values are indexed in different tables. For this reason, simply running the same CRON with a 1-minute interval is not a great solution.
  • (5) could only be implemented through hacky solutions.

Using a messages queue

According to personal research and Elastic Search official recommendations, the best practice for synchronizing Postgres with Elastic Search in use cases like ours is:

  1. Have a message queue for search indexation requests
  2. Whenever something is updated/created/deleted, push a message in this queue to document that the item needs indexing
  3. Have a consumer job that takes the last 𝑥 items in the queue and indexes them using Elastic Search's bulk method

This guarantee:

  • Speed of synchronization
  • Predictable/configurable delays
  • No performance overload

Combined with our existing scripts that are able to fix discrepancies between Postgres & Elastic Search (and that we could automate in a daily script if needed), this should be able to satisfy all needs listed above.

Which message queue?

There are many solutions out there, but 2 stand out for us:

  1. RabbitMQ
  2. Postgres Listen/notify

The main benefit of RabbitMQ is the persistence of the queue: if the sync process goes down, it will be able to gracefully catch up.

This benefit can also be achieved through the synchronization scripts that already exist. Compared to RabbitMQ, Postgres listen/notify has some other advantages:

  • It doesn't require any new service (simpler setup, less failure points)
  • It gives us the native ability to hook on Postgres triggers to automatically synchronize tables, even when the update is made manually with a different tool.

Technical specs

Setup
Based on the configuration, the synchronization job will be started either alongside the server (preferred option in dev) or as a standalone (to better isolate it in production).

Messages

  • delete:{index}:{id}
  • sync:{index}:{id}: Synchronizes a single entry with the DB
  • sync_full:{index}:{id}: Synchronizes an entry and all its relations
@Betree Betree added the api Issues that require some work on the API (https://github.com/opencollective/opencollective-api) label Nov 26, 2024
@Betree Betree self-assigned this Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Issues that require some work on the API (https://github.com/opencollective/opencollective-api)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant