GitHub - mmamedov/favicon-finder: Domain Favicon Finder

Favicon Finder

Attempts to find favicon given a domain name or directly from Alexa's top ranked domains CSV file.

How to Use

App runs from command line. See configuration options in config/params.php file.

Single domain lookup

php app.php example.com

Top Alexa domains lookup

Unzip Alexa's top domains file into input/ directory. File is too large to completely load in memory, so each worker will only load it's processing portion of the file.

Script below spawns 200 PHP processes in the background each with 1000 domains to lookup (400 processes with 500 domains when $doubleWorkers set to true ), in order to get favicons for the Alexa top 200k ranked URLs file.
This is a low-intensity CPU task, with each process using ~10MB of RAM.

Each process has it's own log and output CSV files, stored in output/worker_csv and output/worker_log directories. Additionally each worker will save it's runtime stats in output/workers.log.

Please note that this might take several hours depending on your machine and network speed. Usually with a good connection processes complete in 1-2 hours.

php init_200k_worker.php 1

Check number of PHP processes running. This should return a bit more than 200 (or 400 depending on configuration), as it will also capture grep command, and anything else matching to php in the output.

ps aux | grep -c php

Check number of domains processed so far and saved in CSV files

wc -l output/worker_csv/*.csv

In the end concatenate all CSV files into a single CSV file. This file should have 200k rows

cat output/worker_csv/*.csv > all.csv

To include in your application using composer:

composer require mmamedov/favicon-finder

To run from command line, clone this repo and run

composer install -o

Prerequisites

PHP 7.3 or higher (7.3 specific CURL options were used, i.e. CURLINFO_SCHEME)
Latest CURL / SSL libs

How it works

FaviconFinder uses Inspectors to lookup favicons. Currently there are 2 Inspectors implemented, they are called one after another, if previous Inspector fails to find result.

`HeadersInspector`

Looks at HTTP response headers, and visits redirects in the Location header as necessary. Starts with https://<domain>/favicon.ico location, as this is the most likely location.

`HtmlInspector`

Gets HTML code from https://<domain>, and looks for variations of <link rel> tag for favicon location. It uses CURL library and follows redirects to reach to actual page URL.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
config		config
input		input
output		output
src		src
.gitignore		.gitignore
README.md		README.md
app.php		app.php
app_csv.php		app_csv.php
composer.json		composer.json
init_10k_worker.php		init_10k_worker.php
init_200k_worker.php		init_200k_worker.php

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Favicon Finder

How to Use

Single domain lookup

Top Alexa domains lookup

Prerequisites

How it works

`HeadersInspector`

`HtmlInspector`

About

Releases

Packages

Languages

mmamedov/favicon-finder

Folders and files

Latest commit

History

Repository files navigation

Favicon Finder

How to Use

Single domain lookup

Top Alexa domains lookup

Prerequisites

How it works

HeadersInspector

HtmlInspector

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`HeadersInspector`

`HtmlInspector`

Packages