You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We don't want a significant delay between an item creation and its addition to the search (e.g. when creating an expense, it should be surfaced in the search in less than a minute).
If the synchronization mechanism crashes for any reason, there must be a way to catch up with the missed entries.
In some cases, like a collective being unhosted, we want to trigger a full re-indexing of the account and its related data (expenses, comments, orders...etc).
The naive solution: a CRON job
Today, the api/scripts/search.ts script lets us fully synchronize the DB with Elastic Search, supporting incremental updates based on deletedAt/updatedAt. The simplest approach would therefore be to re-use these functions in a CRON job called every 10 minutes. While this would satisfy (1), (2), and (4):
The latency (up to 10 minutes) would not be acceptable.
Performance could become an issue, depending on how updatedAt/deletedAt values are indexed in different tables. For this reason, simply running the same CRON with a 1-minute interval is not a great solution.
(5) could only be implemented through hacky solutions.
Using a messages queue
According to personal research and Elastic Search official recommendations, the best practice for synchronizing Postgres with Elastic Search in use cases like ours is:
Have a message queue for search indexation requests
Whenever something is updated/created/deleted, push a message in this queue to document that the item needs indexing
Have a consumer job that takes the last 𝑥 items in the queue and indexes them using Elastic Search's bulk method
This guarantee:
Speed of synchronization
Predictable/configurable delays
No performance overload
Combined with our existing scripts that are able to fix discrepancies between Postgres & Elastic Search (and that we could automate in a daily script if needed), this should be able to satisfy all needs listed above.
Which message queue?
There are many solutions out there, but 2 stand out for us:
RabbitMQ
Postgres Listen/notify
The main benefit of RabbitMQ is the persistence of the queue: if the sync process goes down, it will be able to gracefully catch up.
This benefit can also be achieved through the synchronization scripts that already exist. Compared to RabbitMQ, Postgres listen/notify has some other advantages:
It doesn't require any new service (simpler setup, less failure points)
It gives us the native ability to hook on Postgres triggers to automatically synchronize tables, even when the update is made manually with a different tool.
Technical specs
Setup
Based on the configuration, the synchronization job will be started either alongside the server (preferred option in dev) or as a standalone (to better isolate it in production).
Messages
delete:{index}:{id}
sync:{index}:{id}: Synchronizes a single entry with the DB
sync_full:{index}:{id}: Synchronizes an entry and all its relations
The text was updated successfully, but these errors were encountered:
Betree
added
the
api
Issues that require some work on the API (https://github.com/opencollective/opencollective-api)
label
Nov 26, 2024
Needs
The naive solution: a CRON job
Today, the
api/scripts/search.ts
script lets us fully synchronize the DB with Elastic Search, supporting incremental updates based ondeletedAt
/updatedAt
. The simplest approach would therefore be to re-use these functions in a CRON job called every 10 minutes. While this would satisfy (1), (2), and (4):updatedAt
/deletedAt
values are indexed in different tables. For this reason, simply running the same CRON with a 1-minute interval is not a great solution.Using a messages queue
According to personal research and Elastic Search official recommendations, the best practice for synchronizing Postgres with Elastic Search in use cases like ours is:
This guarantee:
Combined with our existing scripts that are able to fix discrepancies between Postgres & Elastic Search (and that we could automate in a daily script if needed), this should be able to satisfy all needs listed above.
Which message queue?
There are many solutions out there, but 2 stand out for us:
The main benefit of RabbitMQ is the persistence of the queue: if the sync process goes down, it will be able to gracefully catch up.
This benefit can also be achieved through the synchronization scripts that already exist. Compared to RabbitMQ, Postgres listen/notify has some other advantages:
Technical specs
Setup
Based on the configuration, the synchronization job will be started either alongside the server (preferred option in dev) or as a standalone (to better isolate it in production).
Messages
delete:{index}:{id}
sync:{index}:{id}
: Synchronizes a single entry with the DBsync_full:{index}:{id}
: Synchronizes an entry and all its relationsThe text was updated successfully, but these errors were encountered: