Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine how to handle recipes when a less-recent earliest source URL is found #71

Closed
jayaddison opened this issue Aug 10, 2023 · 6 comments
Labels
enhancement New feature or request

Comments

@jayaddison
Copy link
Member

Is your feature request related to a problem? Please describe.
As described in #58, the system performs de-duplication of recipe IDs by finding the earliest-known crawl URL for any given recipe before it is re-indexed, and this value is used to create the recipe ID. This allows multiple recipes that originate from the same earliest-known origin URL to share the same recipe identifier.

Under some circumstances, a recipe may be indexed into the search engine and then it later becomes known that an earlier-crawled URL in fact refers to the same recipe. This presents a challenge, because the code to determine the recipe ID will now produce a different value for the recipe. Subsequent reindexing of the recipe will use the updated identifier, but any existing copy of the recipe in the search index will not be removed. This causes stale data to remain in the system, and some people may have kept bookmarks to the old recipes.

This was indirectly hinted by a comment in issue #58.

Describe the solution you'd like
Primarily, the requirements are:

  • Bookmarked hyperlinks to recipes should continue to work despite the discovery of less-recent source URLs for those recipes.
  • Recipes should be de-duplicated in the search index.
  • Application functionality that retrieves recipe data based on a recipe ID should receive fresh data despite the discovery of a less-recent source URL for the recipe.

Items one and three are similar; a review of the application functionality and related API calls may be required to determine what client-side and/or server-side changes are necessary to address those.

Describe alternatives you've considered
Allowing duplicates to exist within the search index was considered, but the presence of stale data would lead to confusing and suboptimal user experiences.

Allowing search-by-URL instead of using opaque recipe IDs has also been considered - there are some good arguments for this.

Additional context
Closely-related to #47.

@jayaddison
Copy link
Member Author

I think that this is essentially a redirection problem: we have content that has been available using a given identifier that has been updated to use a different identifier. HTTP -- the transport layer for our APIs -- provides status codes that are intended for exactly that purpose, and it may be possible to update the recipe retrieval API endpoints to respond with those redirection codes.

(note: it would alternatively be possible to have the API endpoint silently return the data from the updated content when a client asks for content for the previous identifier... I think it's more transparent to indicate to the client that the content has been moved to an updated location)

For that approach, a database table to store the redirections that have taken place will be required, and the information from that table would need to be made available in the search index.

Additionally, the api service could be updated to check the relevant search index for the presence of redirections and to respond with an appropriate HTTP status code and the updated location to retrieve content from.

Finally, we would need to detect the migration of recipe content from the previous identifier to an updated identifier during-or-before reindexing of recipes, and remove (delete) the stale recipes from the search index.

Note: deletion of stale recipes from the search index is not required for redirection handling, but is required so that users do not see duplicate entries when searching across recipes (that is, when using access methods that do not specify individual identifiers).

At the moment, I do not believe that we have any API endpoints that accept more than a single recipe identifier during a request, and so there should not be any challenging edge-cases to consider due to the presence of mixed stale-and-fresh recipe contents when querying by recipe IDs plural -- because that form of query is not supported.

@jayaddison
Copy link
Member Author

jayaddison commented Aug 10, 2023

Question: is it possible to achieve the intended result without adding a secondary search index purely to store redirections?

Thoughts: perhaps the search-indexed hidden property could be used to avoid the need to delete recipes. And a singleton redirected_id property -- perhaps in combination with a timestamp to indicate when the redirection was first detected -- could be placed onto the recipe in the OpenSearch recipes search index.

The benefit of doing that would be that the api service shouldn't have to perform two queries (and thus two network roundtrips) when querying for a recipe by ID.

@jayaddison
Copy link
Member Author

To implement this and resolve #47 at the same time, I think that the recipe query-by-ID endpoint should be updated to:

  • Attempt to retrieve the identified recipe (no change).
  • Return an HTTP 404 (not found) status if the recipe is not found (no change).
  • Return an HTTP 302 (found at other location) status if the redirected_to property exists on the recipe document (updated logic).
  • Return an HTTP 404 (not found) status if the recipe document has the hidden property set to true.

The hidden property should be set to true on recipes when a content migration takes place so that they can be excluded from recipe search results using existing search logic.

@jayaddison
Copy link
Member Author

jayaddison commented Aug 10, 2023

Rollout plan:

  • Database content updates

    • Add redirected_id and redirected_at columns to the Recipe table in the database. (Add redirection columns to recipe database model #72)
    • Set the values of the redirected_id and redirected_at fields within an overridden implementation of the Storable.index method on the Recipe model.
    • Reindex all affected recipes in production and confirm that the additional columns are populated as expected on the associated database records.
  • Recipe search index content updates

    • Update the Recipe.hidden property to also (additionally) return True when the redirected_id is non-empty.
    • Update the Recipe.to_doc method to populate the redirected_id and redirected_at fields in the recipe search index.
  • Display layer updates

@jayaddison
Copy link
Member Author

jayaddison commented Aug 10, 2023

Set the values of the redirected_id and redirected_at fields within an overridden implementation of the Storable.index method on the Recipe model.

Taking that approach could cause a fairly significant reduction in the throughput of recipe indexing due to an additional database query round-trip, so I think that we should instead try to detect a change-of-identifier elsewhere in the code at runtime. The most likely place to do this seems to be the backend recipe worker module because that's where an identifier for a recipe to reindex is received, and where the resulting request to reindex that recipe subsequent occurs.

@jayaddison
Copy link
Member Author

This has been implemented, except for the suggestion to use HTTP response codes for recipe redirection; currently this is handled within the server. That decision may be worth revisiting later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

1 participant