You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a page is crawled, some data is extracted. Sometimes, the complete data on a given piece of information is split across many pages. It may be necessary to store some state when crawling a page so that when a "child" page is crawled, this information is available.
For example, a page /author is crawled and information on the author is saved in a DB, with an ID. The URL /author/book1 is then enqueued, but if this page is crawled in a stateless way, it has no way to link the information back to the previously crawled author (it could find the author name in the book page, but let's pretend it's not there, or even if it was, there are maybe many authors with the same name, or there might be a typo, etc.).
Not sure yet if this should be managed by gocrawl or not. Should seed URLs also be allowed to have state? How much of a pain will it be to implement, complexify the API?
The text was updated successfully, but these errors were encountered:
When a page is crawled, some data is extracted. Sometimes, the complete data on a given piece of information is split across many pages. It may be necessary to store some state when crawling a page so that when a "child" page is crawled, this information is available.
For example, a page /author is crawled and information on the author is saved in a DB, with an ID. The URL /author/book1 is then enqueued, but if this page is crawled in a stateless way, it has no way to link the information back to the previously crawled author (it could find the author name in the book page, but let's pretend it's not there, or even if it was, there are maybe many authors with the same name, or there might be a typo, etc.).
Not sure yet if this should be managed by gocrawl or not. Should seed URLs also be allowed to have state? How much of a pain will it be to implement, complexify the API?
The text was updated successfully, but these errors were encountered: