-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Determine how to handle recipes when a less-recent earliest source URL is found #71
Comments
I think that this is essentially a redirection problem: we have content that has been available using a given identifier that has been updated to use a different identifier. HTTP -- the transport layer for our APIs -- provides status codes that are intended for exactly that purpose, and it may be possible to update the recipe retrieval API endpoints to respond with those redirection codes. (note: it would alternatively be possible to have the API endpoint silently return the data from the updated content when a client asks for content for the previous identifier... I think it's more transparent to indicate to the client that the content has been moved to an updated location) For that approach, a database table to store the redirections that have taken place will be required, and the information from that table would need to be made available in the search index. Additionally, the Finally, we would need to detect the migration of recipe content from the previous identifier to an updated identifier during-or-before reindexing of recipes, and remove (delete) the stale recipes from the search index. Note: deletion of stale recipes from the search index is not required for redirection handling, but is required so that users do not see duplicate entries when searching across recipes (that is, when using access methods that do not specify individual identifiers). At the moment, I do not believe that we have any API endpoints that accept more than a single recipe identifier during a request, and so there should not be any challenging edge-cases to consider due to the presence of mixed stale-and-fresh recipe contents when querying by recipe IDs plural -- because that form of query is not supported. |
Question: is it possible to achieve the intended result without adding a secondary search index purely to store redirections? Thoughts: perhaps the search-indexed The benefit of doing that would be that the |
To implement this and resolve #47 at the same time, I think that the recipe query-by-ID endpoint should be updated to:
The |
Rollout plan:
|
Taking that approach could cause a fairly significant reduction in the throughput of recipe indexing due to an additional database query round-trip, so I think that we should instead try to detect a change-of-identifier elsewhere in the code at runtime. The most likely place to do this seems to be the |
This has been implemented, except for the suggestion to use HTTP response codes for recipe redirection; currently this is handled within the server. That decision may be worth revisiting later. |
Is your feature request related to a problem? Please describe.
As described in #58, the system performs de-duplication of recipe IDs by finding the earliest-known crawl URL for any given recipe before it is re-indexed, and this value is used to create the recipe ID. This allows multiple recipes that originate from the same earliest-known origin URL to share the same recipe identifier.
Under some circumstances, a recipe may be indexed into the search engine and then it later becomes known that an earlier-crawled URL in fact refers to the same recipe. This presents a challenge, because the code to determine the recipe ID will now produce a different value for the recipe. Subsequent reindexing of the recipe will use the updated identifier, but any existing copy of the recipe in the search index will not be removed. This causes stale data to remain in the system, and some people may have kept bookmarks to the old recipes.
This was indirectly hinted by a comment in issue #58.
Describe the solution you'd like
Primarily, the requirements are:
Items one and three are similar; a review of the application functionality and related API calls may be required to determine what client-side and/or server-side changes are necessary to address those.
Describe alternatives you've considered
Allowing duplicates to exist within the search index was considered, but the presence of stale data would lead to confusing and suboptimal user experiences.
Allowing search-by-URL instead of using opaque recipe IDs has also been considered - there are some good arguments for this.
Additional context
Closely-related to #47.
The text was updated successfully, but these errors were encountered: