Async multi-threaded Web Crawler based on Vert.x framework.
Contains two main commands:
- crawl - for crawling the Web
- find - for multi-threaded searching word in directory recursively
- Crawls only HTML files
- Only UTF-8 encoding is supported (for searching too)
- The option --resolveLinks is experimental and works only with anchor links (tag <a>) which point the html files
mvn clean package
vertx-crawler-1.0-SNAPSHOT-fat.jar file will be created at target directory
java -jar target/vertx-crawler-1.0-SNAPSHOT-fat.jar crawl [--conf=<config>] [--dir=<directory>] [--depth=<depth>]
[--delay=<delay>] [--downloads=<downloads>]
[--loaders=<loaders>] [--parsers=<parsers>]
[--linksToFiles=<linksToFiles>]
[--storeOriginals=<storeOriginals>] url
Options and Arguments:
--conf <config> Specifies configuration that should be
provided to the verticle. <config>
should reference either a text file
containing a valid JSON object which
represents the configuration OR be a
JSON string. There is a sample config
file in project root.
--dir <directory> Specifies directory for downloaded
files. Defaults is 'output'.
--depth <depth> Specifies how deeply crawler must dig.
Defaults is 5.
--delay <delay> Specifies how many milliseconds must be
delayed between requests.
Defaults is 200.
--downloads <downloads> Specifies how many simultaneous
downloads can be started for one loader.
Defaults is 10.
--loaders <loaders> Specifies how many loaders instances
will be deployed. Defaults is 1.
--parsers <parsers> Specifies how many parsers instances
will be deployed. Defaults is available
processors.
--resolveLinks <resolveLinks> (Experimental) specifies would be links
with relative urls resolved to absolute
ones (otherwise they will not be
clickable). Defaults is true.
--linksToFiles <linksToFiles> (Experimental) specifies would be links
in html documents changed to point the
downloaded files. Have effect only
if --resolvedLinks is true.
Defaults is false.
--storeOriginals <storeOriginals> Specifies would be original html
documents stored after updating links or
not. Defaults is false.
<url> Web site url for crawling.
java -jar target/vertx-crawler-1.0-SNAPSHOT-fat.jar find [--dir=<directory>] [--ext=<extension>]
[--finders=<finders>] [--sensitive] [--whole] word
Options and Arguments:
--dir <directory> Specifies directory for searching. Defaults is
current directory.
--ext <extension> Specifies file extension for searching. If is
present then word will be searching only in files
with same extension. Defaults is * means all
extensions.
--finders <finders> Specifies how many finders instances will be
deployed. Defaults is 2.
--sensitive Will the search case sensitive or not. Defaults
is false.
--whole Search for whole word only or not. Defaults is
false.
<word> The word for searching.