URL Crawler

Context

This is a simple CLI application that can crawl a single URL / domain, extract all links and store only the non-external ones. It is written in Rust 1.75.0 and makes use of Tokio async runtime for asynchronicity/concurrency.

Please note: to make this project production ready, take care of all the edge cases and confirm scope etc will take a bit more time. I am aware of many trade-offs and future improvements that can be made to it and have documented these below.

Instructions

cargo has to be installed in order to run the application - https://doc.rust-lang.org/cargo/getting-started/installation.html

To run the app from the terminal, cd into the root of the project and then run - cargo run -- --url <url_to_crawl>.

To run tests - cargo test

Additional cli options can be provided:

--workers_n <number_of_workers_to_create> (defaults to 1)
--delay <delay_in_seconds> (to delay requests to the host, defaults to 2)
--print <bool> (whether data store should be printed at the end of the crawl, default to false)

Components

URL Frontier - a very simple implementation of a component that manages URLs. The component makes use of crossbeams SeqQueue which is a thread-safe queue.
Data store - a simple in-memory data store that uses a HashMap to track downloaded and visited URLs
Link - links/urls maker and filter
Fetch - Http client abstraction
Parser - Content parser and links extractor

Basic flow

URL Frontier gets a seed url
N number of tasks (green threads) get created
Each task gets a pointer to
- URL frontier, to populate it with new URLs
- Data store, to track visited and downloaded URLs
URL Frontier pops a url and it is checked for visited status
Data from URL gets downloaded
URL gets marked as visiting in the data store
Content gets parsed and links extracted
Urls/links get filtered based on the initial / seed URL
Each new URL gets updates / added to the data store (if not already in it)
Each new, unvisited url gets added to the URL Frontier to be crawled later

Potential future improvements / trade-offs (in no particular order)

User input validation and initial URL validation
Instead of using a simple HashMap as data store use a combination of in-memory and disk databases (and Docker compose to bring all the components up)
Store webpage content and compare in the future crawls to avoid fetching stale data/pages
For JS only sites a different technique is needed, i.e. a webdriver
Store date/time when a URL was visited and compare whether it potentially can be stale
Instead of storing URLs, to save space, store URL checksums
In order to expand crawling to other domain, separate queues can be introduced inside of the URL frontier component
If a request to a url fails, a retry mechanism can be implemented (or URL can be enqueued and tried again)
A graph data structure (instead of a queue) that links all pages in order to construct a sitemap can be introduced
Politeness factor / delay can also be set to a float type (to work on milliseconds level)
Check for "https://" when reading CLI commands and if doesn't exist, prepend
Add test coverage tooling setup
Data collector instead of printing info at the end of the crawl
Differentiate between different link types (images, text, etc.)
Mocking dependencies and checking for number of calls to each method and add a couple of separate integrations tests to execute()
Handling of fragment urls (i.e. /about#section1)
Better handling of domain specific politeness / delay
Provide an abstraction to write output to a file
Environment dependent logger

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

URL Crawler

Context

Instructions

Components

Basic flow

Potential future improvements / trade-offs (in no particular order)

About

Releases

Packages

Languages

davidc6/url-crawler

Folders and files

Latest commit

History

Repository files navigation

URL Crawler

Context

Instructions

Components

Basic flow

Potential future improvements / trade-offs (in no particular order)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages