Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to resolve onion hostnames when using http proxy #443

Open
codsane opened this issue Dec 3, 2019 · 3 comments
Open

Unable to resolve onion hostnames when using http proxy #443

codsane opened this issue Dec 3, 2019 · 3 comments
Labels

Comments

@codsane
Copy link

codsane commented Dec 3, 2019

grab-site caught my eye, so I've begun the process of attempting to move my onion archive project over to grab-site, rather than a bunch of wget scripts. I think wpull is great, and just the replacement I was looking for, however I am having a hard time getting wpull to play nicely with my proxies and resolve onion hostnames.

Thanks to a wonderful project called multitor, I have setup an http proxy that acts as a gateway for Tor connections. This has allowed me to simply set http_proxy and https_proxy, run my wget scripts against .onion URLs, and archive them like any other website.

What I expect: Utilize the --http-proxy and --https-proxy options to set my http proxies, run my wpull scripts against .onion URLs, and archive them like any other website.

What happened: Upon running wpull against a .onion URL and http proxies passed, wpull was unable to fetch any of the requests.

Wpull command:

wpull "https://3g2upl4pq6kufc4m.onion" \
    --warc-file wpull_test \
    --no-check-certificate \
    --no-robots --user-agent "Mozilla/5.0 (X11; Linux x86_64; rv:70.0) Gecko/20100101 Firefox/70.0" \
    --wait 0.5 --random-wait --waitretry 600 \
    --page-requisites --recursive --level inf \
    --escaped-fragment --strip-session-id \
    --sitemaps \
    --tries 3 --retry-connrefused --retry-dns-error \
    --timeout 60 --session-timeout 21600 \
    --delete-after --database wpull-test.db \
    --output-file wpull_test.log \
    --http-proxy="0.0.0.0:16379" --https-proxy="0.0.0.0:16379"

OS: Tested across multiple environments. The first running Debian 9, and the second within the grab-site docker container.

Python version: Python 3.6.4 on Debian 9, Python 3.7.3 within the grab-site container.

Wpull version: v2.0.3 on Debian 9, v2.1.15 v3.0.7 appears to be the version of wpull inside grab-site.
In one instance, I reverted to v1.2.3 as it seemed v2.0's network stack had "various other problems that did not exist in 1.2.3" -issue#406

Log/Output:

codsane@server:~/wpull$ ./wpull_test.sh 
Link to pillage: http://3g2upl4pq6kufc4m.onion
INFO Fetching ‘http://3g2upl4pq6kufc4m.onion/’.
ERROR Fetching ‘http://3g2upl4pq6kufc4m.onion/’ encountered an error: Connect network error: 
/usr/local/lib/python3.6/site-packages/wpull/protocol/http/client.py:185: UserWarning: HTTP session did not complete.
  warnings.warn(_('HTTP session did not complete.'))
@codsane
Copy link
Author

codsane commented Dec 3, 2019

As a possible temporary workaround, I've finally been able to get a proxychains-ng configuration working on my system. Under proxychains, wpull seems to behave properly with onion URLs.

As my original intentions were to get my onion archive working under grab-site, I may look to see if this is a solution which may allow onion links to work with grab-site. Proxies seem to be broken in grab-site anyways, so it is possible proxychains works as a temporary workaround there as well.

codsane@server:~/wpull$ proxychains4 ./wpull_test.sh 
[proxychains] config file found: /etc/proxychains.conf
[proxychains] preloading /usr/lib/libproxychains4.so
[proxychains] DLL init: proxychains-ng 4.14
Link to pillage: https://3g2upl4pq6kufc4m.onion
[proxychains] DLL init: proxychains-ng 4.14
[proxychains] DLL init: proxychains-ng 4.14
INFO Fetching ‘https://3g2upl4pq6kufc4m.onion/’.
[proxychains] Strict chain  ...  0.0.0.0:16379  ...  3g2upl4pq6kufc4m.onion:443  ...  OK
  100.0% [=========================] 6.0 KiB 0:00:03 1.3 KiB/s
INFO Fetched ‘https://3g2upl4pq6kufc4m.onion/’: 200 OK. Length: 6174 [text/html; charset=UTF-8].
/usr/local/lib/python3.6/site-packages/wpull/protocol/http/client.py:185: UserWarning: HTTP session did not complete.
  warnings.warn(_('HTTP session did not complete.'))
INFO Fetching ‘https://3g2upl4pq6kufc4m.onion/robots.txt’.
  100.0% [=========================] 26.0 B 0:00:04 6.2 B/s
INFO Fetched ‘https://3g2upl4pq6kufc4m.onion/robots.txt’: 200 OK. Length: 26 [text/plain; charset=UTF-8].
INFO Fetching ‘https://3g2upl4pq6kufc4m.onion/sitemap.xml’.
  100.0% [=========================] 2.5 KiB 0:00:05 509.9 B/s
INFO Fetched ‘https://3g2upl4pq6kufc4m.onion/sitemap.xml’: 200 OK. Length: 2553 [text/xml; charset=UTF-8].
INFO Fetching ‘https://3g2upl4pq6kufc4m.onion/assets/icons/meta/DDG-iOS-icon_152x152.png’.
  100.0% [=========================] 2.0 KiB 0:00:06 328.8 B/s
INFO Fetched ‘https://3g2upl4pq6kufc4m.onion/assets/icons/meta/DDG-iOS-icon_152x152.png’: 200 OK. Length: 2034 [image/png].
INFO Fetching ‘https://3g2upl4pq6kufc4m.onion/util/u414.js’.
  100.0% [=========================] 75.6 KiB 0:00:07 6.9 KiB/s
INFO Fetched ‘https://3g2upl4pq6kufc4m.onion/util/u414.js’: 200 OK. Length: 77421 [application/x-javascript].
INFO Fetching ‘https://3g2upl4pq6kufc4m.onion/o1838.css’.
  100.0% [=========================] 18.5 KiB 0:00:08 459.2 B/s
...

@JustAnotherArchivist
Copy link
Contributor

JustAnotherArchivist commented Dec 4, 2019

grab-site currently uses the ludios_wpull fork, not this repo. I have no idea what version "2.1.15" is supposed to be; ludios_wpull is using version numbers of "3.0.x".

Connect network error: is not exactly overflowing with details (there's supposed to be a more detailed error message after that colon). Does wpull print anything useful with --debug?

@codsane
Copy link
Author

codsane commented Dec 4, 2019

Ahh yeah that must've been a mix-up, sorry. I ran a multitude of tests yesterday so it is very possible I wasn't actually inside the container when I ran wpull --version. I've corrected the original post as 3.0.7 is actually the correct version, and the version the below tests are running on.

Not realizing there was a --debug option was a bit of an oversight, I'll admit. I reran my tests with --debug on.

Using built-in proxies:

wpull "https://3g2upl4pq6kufc4m.onion" \
    --warc-file http-proxy_onion_test \
    --no-robots --user-agent "Mozilla/5.0 (X11; Linux x86_64; rv:70.0) Gecko/20100101 Firefox/70.0" \
    --wait 0.5 --random-wait --waitretry 60 \
    --tries 3 --retry-connrefused --retry-dns-error \
    --timeout 60 --session-timeout 21600 \
    --output-file http-proxy_onion_test.log \
    --http-proxy="172.17.0.1:16379" --https-proxy="172.17.0.1:16379" \
    --debug

Output: http-proxy_onion_test.log

Using proxychains:

proxychains4 wpull "https://3g2upl4pq6kufc4m.onion" \
    --warc-file proxychains_onion_test \
    --no-robots --user-agent "Mozilla/5.0 (X11; Linux x86_64; rv:70.0) Gecko/20100101 Firefox/70.0" \
    --wait 0.5 --random-wait --waitretry 60 \
    --tries 3 --retry-connrefused --retry-dns-error \
    --timeout 60 --session-timeout 21600 \
    --output-file proxychains_onion_test.log \
    --debug

Output: proxychains_onion_test.log

Extra test (for sanity), using built-in proxies against a clearnet site:

wpull "https://duckduckgo.com" \
    --warc-file http-proxy_clearnet_test \
    --no-robots --user-agent "Mozilla/5.0 (X11; Linux x86_64; rv:70.0) Gecko/20100101 Firefox/70.0" \
    --wait 0.5 --random-wait --waitretry 60 \
    --tries 3 --retry-connrefused --retry-dns-error \
    --timeout 60 --session-timeout 21600 \
    --output-file http-proxy_clearnet_test.log \
    --http-proxy="172.17.0.1:16379" --https-proxy="172.17.0.1:16379" \
    --debug

Output: http-proxy_clearnet_test.log


I did notice that the DNS lookup addresses reported by wpull differ between the tests.

As can see in proxychains_onion_test.log, line 282 shows the lookup address is reported as the hostname provided to wpull, 3g2upl4pq6kufc4m.onion.

However when passing http-proxies to wpull, line 285 shows the lookup address as the IP for the proxy I passed, in this case 172.17.0.1. The same behavior can be seen in the clearnet test log as well.

After taking a look at the source of the generic error, Network error we've been referring to, perhaps this is just a DNS issue? Proxychains proxies DNS requests by default, so maybe that is why requests seem to be getting through. Perhaps wpull's proxy implementation needs to be amended to proxy DNS requests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants