You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As it seems, at least as of 2017, one of the scrapers (epicurious) did not throw away URLs for some reasons. This could be either an acceptable weakness by a design decision, or a missing feature in the design.
Actual problem: some of parsed Epicurious recipes do not contain the element "ingredients". It is just not there.
An URL formally qualifies to be a recipe by template, by actually is not. (extreme examples, untitled either /recipes/food/views/reserve-this-recipe-id-for-future-use-51234840).
Some other reason.
I do understand this is rather a problem with data outliers than with the scrapers, so maybe there is a need to clarify how much scraping intelligence and data model sensitivity is required at this level, and if none, how to best implement/integrate it as it seems be a generally relevant use case in this context. (i.e. one outlier is marginal problem, but across many large datasets this can sum up to a bigger issue in terms of data quality).
The issue needs to be of course validated with an up to date version.
URL of recipes producing recipe data without ingredients:
The recipes given do not have ingredients listed on the site, indeed. The scrapers are functioning as intended so I'll close the issue. Feel free to reopen if I had missed your point 🙂
This package is intended to be a super simple tool handling the operation of parsing the html. If no data in the html is found - the scrapers won't assume anything. They will return defaulting values and that's it.
Depending on your use case and aim you can:
omit saving/analyzing the recipes with missing data
try to implement clever mechanism that gathers missed information based on what's at hand
However, that decision is beyond the package responsibilities.
One can ask for advice on how to normalize recipes data, speed up scraping, elude bot protection mechanisms and whatever else comes across when building scraping data related project, but these things are not the recipe-scrapers job.
As it seems, at least as of 2017, one of the scrapers (epicurious) did not throw away URLs for some reasons. This could be either an acceptable weakness by a design decision, or a missing feature in the design.
Actual problem: some of parsed Epicurious recipes do not contain the element "ingredients". It is just not there.
untitled
either/recipes/food/views/reserve-this-recipe-id-for-future-use-51234840
).I do understand this is rather a problem with data outliers than with the scrapers, so maybe there is a need to clarify how much scraping intelligence and data model sensitivity is required at this level, and if none, how to best implement/integrate it as it seems be a generally relevant use case in this context. (i.e. one outlier is marginal problem, but across many large datasets this can sum up to a bigger issue in terms of data quality).
The issue needs to be of course validated with an up to date version.
URL of recipes producing recipe data without ingredients:
The text was updated successfully, but these errors were encountered: