Determine how to handle recipes when a less-recent earliest source URL is found #71

jayaddison · 2023-08-10T08:34:53Z

Is your feature request related to a problem? Please describe.
As described in #58, the system performs de-duplication of recipe IDs by finding the earliest-known crawl URL for any given recipe before it is re-indexed, and this value is used to create the recipe ID. This allows multiple recipes that originate from the same earliest-known origin URL to share the same recipe identifier.

Under some circumstances, a recipe may be indexed into the search engine and then it later becomes known that an earlier-crawled URL in fact refers to the same recipe. This presents a challenge, because the code to determine the recipe ID will now produce a different value for the recipe. Subsequent reindexing of the recipe will use the updated identifier, but any existing copy of the recipe in the search index will not be removed. This causes stale data to remain in the system, and some people may have kept bookmarks to the old recipes.

This was indirectly hinted by a comment in issue #58.

Describe the solution you'd like
Primarily, the requirements are:

Bookmarked hyperlinks to recipes should continue to work despite the discovery of less-recent source URLs for those recipes.
Recipes should be de-duplicated in the search index.
Application functionality that retrieves recipe data based on a recipe ID should receive fresh data despite the discovery of a less-recent source URL for the recipe.

Items one and three are similar; a review of the application functionality and related API calls may be required to determine what client-side and/or server-side changes are necessary to address those.

Describe alternatives you've considered
Allowing duplicates to exist within the search index was considered, but the presence of stale data would lead to confusing and suboptimal user experiences.

Allowing search-by-URL instead of using opaque recipe IDs has also been considered - there are some good arguments for this.

Additional context
Closely-related to #47.

jayaddison · 2023-08-10T08:47:26Z

I think that this is essentially a redirection problem: we have content that has been available using a given identifier that has been updated to use a different identifier. HTTP -- the transport layer for our APIs -- provides status codes that are intended for exactly that purpose, and it may be possible to update the recipe retrieval API endpoints to respond with those redirection codes.

(note: it would alternatively be possible to have the API endpoint silently return the data from the updated content when a client asks for content for the previous identifier... I think it's more transparent to indicate to the client that the content has been moved to an updated location)

For that approach, a database table to store the redirections that have taken place will be required, and the information from that table would need to be made available in the search index.

Additionally, the api service could be updated to check the relevant search index for the presence of redirections and to respond with an appropriate HTTP status code and the updated location to retrieve content from.

Finally, we would need to detect the migration of recipe content from the previous identifier to an updated identifier during-or-before reindexing of recipes, and remove (delete) the stale recipes from the search index.

Note: deletion of stale recipes from the search index is not required for redirection handling, but is required so that users do not see duplicate entries when searching across recipes (that is, when using access methods that do not specify individual identifiers).

At the moment, I do not believe that we have any API endpoints that accept more than a single recipe identifier during a request, and so there should not be any challenging edge-cases to consider due to the presence of mixed stale-and-fresh recipe contents when querying by recipe IDs plural -- because that form of query is not supported.

jayaddison · 2023-08-10T08:51:33Z

Question: is it possible to achieve the intended result without adding a secondary search index purely to store redirections?

Thoughts: perhaps the search-indexed hidden property could be used to avoid the need to delete recipes. And a singleton redirected_id property -- perhaps in combination with a timestamp to indicate when the redirection was first detected -- could be placed onto the recipe in the OpenSearch recipes search index.

The benefit of doing that would be that the api service shouldn't have to perform two queries (and thus two network roundtrips) when querying for a recipe by ID.

jayaddison · 2023-08-10T09:07:04Z

To implement this and resolve #47 at the same time, I think that the recipe query-by-ID endpoint should be updated to:

Attempt to retrieve the identified recipe (no change).
Return an HTTP 404 (not found) status if the recipe is not found (no change).
Return an HTTP 302 (found at other location) status if the redirected_to property exists on the recipe document (updated logic).
Return an HTTP 404 (not found) status if the recipe document has the hidden property set to true.

The hidden property should be set to true on recipes when a content migration takes place so that they can be excluded from recipe search results using existing search logic.

jayaddison · 2023-08-10T09:26:40Z

Rollout plan:

Database content updates
- Add redirected_id and redirected_at columns to the Recipe table in the database. (Add redirection columns to recipe database model #72)
- Set the values of the redirected_id and redirected_at fields within an overridden implementation of the Storable.index method on the Recipe model.
- Reindex all affected recipes in production and confirm that the additional columns are populated as expected on the associated database records.
Recipe search index content updates
- Update the Recipe.hidden property to also (additionally) return True when the redirected_id is non-empty.
- Update the Recipe.to_doc method to populate the redirected_id and redirected_at fields in the recipe search index.
Display layer updates
- ~~Update the recipe lookup-by-ID logic to implement the algorithm described in the previous comment in this thread.~~

jayaddison · 2023-08-10T10:12:00Z

Set the values of the redirected_id and redirected_at fields within an overridden implementation of the Storable.index method on the Recipe model.

Taking that approach could cause a fairly significant reduction in the throughput of recipe indexing due to an additional database query round-trip, so I think that we should instead try to detect a change-of-identifier elsewhere in the code at runtime. The most likely place to do this seems to be the backend recipe worker module because that's where an identifier for a recipe to reindex is received, and where the resulting request to reindex that recipe subsequent occurs.

jayaddison · 2023-12-13T01:45:43Z

This has been implemented, except for the suggestion to use HTTP response codes for recipe redirection; currently this is handled within the server. That decision may be worth revisiting later.

jayaddison added the enhancement New feature or request label Aug 10, 2023

jayaddison mentioned this issue Aug 10, 2023

Remove ingredient descriptions that begin with the word 'very' #70

Closed

jayaddison mentioned this issue Aug 10, 2023

Add redirection columns to recipe database model #72

Merged

This was referenced Dec 11, 2023

Inconsistencies in recrawling/reindexing framework #82

Open

Redirect recipes during indexing when an earlier-known-origin-URL is discovered #84

Merged

Add recipe redirection support to the API openculinary/api#118

Merged

jayaddison closed this as completed Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determine how to handle recipes when a less-recent earliest source URL is found #71

Determine how to handle recipes when a less-recent earliest source URL is found #71

jayaddison commented Aug 10, 2023

jayaddison commented Aug 10, 2023

jayaddison commented Aug 10, 2023 •

edited

Loading

jayaddison commented Aug 10, 2023

jayaddison commented Aug 10, 2023 •

edited

Loading

jayaddison commented Aug 10, 2023 •

edited

Loading

jayaddison commented Dec 13, 2023

Determine how to handle recipes when a less-recent earliest source URL is found #71

Determine how to handle recipes when a less-recent earliest source URL is found #71

Comments

jayaddison commented Aug 10, 2023

jayaddison commented Aug 10, 2023

jayaddison commented Aug 10, 2023 • edited Loading

jayaddison commented Aug 10, 2023

jayaddison commented Aug 10, 2023 • edited Loading

jayaddison commented Aug 10, 2023 • edited Loading

jayaddison commented Dec 13, 2023

jayaddison commented Aug 10, 2023 •

edited

Loading

jayaddison commented Aug 10, 2023 •

edited

Loading

jayaddison commented Aug 10, 2023 •

edited

Loading