Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent GET requests lead to ClientConnectorError(8, 'nodename nor servname provided, or not known') #3549

Closed
bsolomon1124 opened this issue Jan 16, 2019 · 13 comments
Labels

Comments

@bsolomon1124
Copy link

bsolomon1124 commented Jan 16, 2019

Long story short

I am stumped by a problem seemingly related to asyncio + aiohttp whereby, when sending a large number of concurrent GET requests, over 85% of the requests raise an aiohttp.client_exceptions.ClientConnectorError exception that ultimately stems from

socket.gaierror(8, 'nodename nor servname provided, or not known')

while sending single GET requests or doing the underlying DNS resolution on the host/port does not raise this exception.

(But hey, at least we know that #2423 is working 😉 .)

Expected behaviour

Successful DNS resolution.

Actual behaviour

Currently, 21205 of 24934 input URLs fail resolution, raising from aiohttp.TCPConnector._resolve_host() coroutine.

Steps to reproduce

While in my real code I'm doing a good amount of customization such as using a custom TCPConnector instance, I can reproduce the issue using just the "default" aiohttp class instances & arguments, exactly as below.

I've followed the traceback and the root of the exception is related to DNS resolution. It comes from the _create_direct_connection method of aiohttp.TCPConnector, which calls ._resolve_host().

I have also tried:

  • Using (and not using) aiodns
  • sudo killall -HUP mDNSResponder
  • Using family=socket.AF_INET as an argument to TCPConnector (though I am fairly sure this is used by aiodns anyway)
  • With ssl=True and ssl=False

All to no avail.


Full code to reproduce is below. The input URLs are at https://gist.github.com/bsolomon1124/fc625b624dd26ad9b5c39ccb9e230f5a.

import asyncio
import itertools

import aiohttp
import aiohttp.client_exceptions

from yarl import URL

ua = itertools.cycle(
    (
        "Mozilla/5.0 (X11; Linux i686; rv:64.0) Gecko/20100101 Firefox/64.0",
        "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.10; rv:62.0) Gecko/20100101 Firefox/62.0",
        "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.13; ko; rv:1.9.1b2) Gecko/20081201 Firefox/60.0",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
    )
)

async def get(url, session) -> str:
    async with await session.request(
        "GET",
        url=url,
        raise_for_status=True,
        headers={'User-Agent': next(ua)},
        ssl=False
    ) as resp:
        text = await resp.text(encoding="utf-8", errors="replace")
        print("Got text for URL", url)
        return text

async def bulk_get(urls) -> list:
    async with aiohttp.ClientSession() as session:
        htmls = await asyncio.gather(
            *(
                get(url=url, session=session)
                for url in urls
            ),
            return_exceptions=True
        )
        return htmls


# See https://gist.github.com/bsolomon1124/fc625b624dd26ad9b5c39ccb9e230f5a
with open("/path/to/urls.txt") as f:
    urls = tuple(URL(i.strip()) for i in f)

res = asyncio.run(bulk_get(urls))  # urls: Tuple[yarl.URL]

c = 0
for i in res:
    if isinstance(i, aiohttp.client_exceptions.ClientConnectorError):
        print(i)
        c += 1

print(c)  # 21205 !!!!! (85% failure rate)
print(len(urls))  # 24934

Printing each exception string from res looks like:

Cannot connect to host sigmainvestments.com:80 ssl:False [nodename nor servname provided, or not known]
Cannot connect to host giaoducthoidai.vn:443 ssl:False [nodename nor servname provided, or not known]
Cannot connect to host chauxuannguyen.org:80 ssl:False [nodename nor servname provided, or not known]
Cannot connect to host www.baohomnay.com:443 ssl:False [nodename nor servname provided, or not known]
Cannot connect to host www.soundofhope.org:80 ssl:False [nodename nor servname provided, or not known]
# And so on...

What's frustrating is that I can ping these hosts with no problem and even call the underlying ._resolve_host():

Bash/shell:

 [~/] $ ping -c 5 www.hongkongfp.com
PING www.hongkongfp.com (104.20.232.8): 56 data bytes
64 bytes from 104.20.232.8: icmp_seq=0 ttl=56 time=11.667 ms
64 bytes from 104.20.232.8: icmp_seq=1 ttl=56 time=12.169 ms
64 bytes from 104.20.232.8: icmp_seq=2 ttl=56 time=12.135 ms
64 bytes from 104.20.232.8: icmp_seq=3 ttl=56 time=12.235 ms
64 bytes from 104.20.232.8: icmp_seq=4 ttl=56 time=14.252 ms

--- www.hongkongfp.com ping statistics ---
5 packets transmitted, 5 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 11.667/12.492/14.252/0.903 ms

Python:

In [1]: import asyncio 
   ...: from aiohttp.connector import TCPConnector 
   ...: from clipslabapp.ratemgr import default_aiohttp_tcpconnector 
   ...:  
   ...:  
   ...: async def main(): 
   ...:     conn = default_aiohttp_tcpconnector() 
   ...:     i = await asyncio.create_task(conn._resolve_host(host='www.hongkongfp.com', port=443)) 
   ...:     return i 
   ...:  
   ...: i = asyncio.run(main())                                                                                                                               

In [2]: i                                                                                                                                                     
Out[2]: 
[{'hostname': 'www.hongkongfp.com',
  'host': '104.20.232.8',
  'port': 443,
  'family': <AddressFamily.AF_INET: 2>,
  'proto': 6,
  'flags': <AddressInfo.AI_NUMERICHOST: 4>},
 {'hostname': 'www.hongkongfp.com',
  'host': '104.20.233.8',
  'port': 443,
  'family': <AddressFamily.AF_INET: 2>,
  'proto': 6,
  'flags': <AddressInfo.AI_NUMERICHOST: 4>}]

Information on the exception itself:

The exception is aiohttp.client_exceptions.ClientConnectorError, which wraps socket.gaierror as the underlying OSError.

Since I have return_exceptions=True in asyncio.gather(), I can get the exception instances themselves for inspection. Here is one example:

In [18]: i
Out[18]:
aiohttp.client_exceptions.ClientConnectorError(8,
                                               'nodename nor servname provided, or not known')

In [19]: i.host, i.port
Out[19]: ('www.hongkongfp.com', 443)

In [20]: i._conn_key
Out[20]: ConnectionKey(host='www.hongkongfp.com', port=443, is_ssl=True, ssl=False, proxy=None, proxy_auth=None, proxy_headers_hash=None)

In [21]: i._os_error
Out[21]: socket.gaierror(8, 'nodename nor servname provided, or not known')

In [22]: raise i.with_traceback(i.__traceback__)
---------------------------------------------------------------------------
gaierror                                  Traceback (most recent call last)
~/Scripts/python/projects/clab/lib/python3.7/site-packages/aiohttp/connector.py in _create_direct_connection(self, req, traces, timeout, client_error)
    954                 port,
--> 955                 traces=traces), loop=self._loop)
    956         except OSError as exc:

~/Scripts/python/projects/clab/lib/python3.7/site-packages/aiohttp/connector.py in _resolve_host(self, host, port, traces)
    824                 addrs = await \
--> 825                     self._resolver.resolve(host, port, family=self._family)
    826                 if traces:

~/Scripts/python/projects/clab/lib/python3.7/site-packages/aiohttp/resolver.py in resolve(self, host, port, family)
     29         infos = await self._loop.getaddrinfo(
---> 30             host, port, type=socket.SOCK_STREAM, family=family)
     31

/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/base_events.py in getaddrinfo(self, host, port, family, type, proto, flags)
    772         return await self.run_in_executor(
--> 773             None, getaddr_func, host, port, family, type, proto, flags)
    774

/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py in run(self)
     56         try:
---> 57             result = self.fn(*self.args, **self.kwargs)
     58         except BaseException as exc:

/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/socket.py in getaddrinfo(host, port, family, type, proto, flags)
    747     addrlist = []
--> 748     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
    749         af, socktype, proto, canonname, sa = res

gaierror: [Errno 8] nodename nor servname provided, or not known

The above exception was the direct cause of the following exception:

ClientConnectorError                      Traceback (most recent call last)
<ipython-input-22-72402d8c3b31> in <module>
----> 1 raise i.with_traceback(i.__traceback__)

<ipython-input-1-2bc0f5172de7> in get(url, session)
     19         raise_for_status=True,
     20         headers={'User-Agent': next(ua)},
---> 21         ssl=False
     22     ) as resp:
     23         return await resp.text(encoding="utf-8", errors="replace")

~/Scripts/python/projects/clab/lib/python3.7/site-packages/aiohttp/client.py in _request(self, method, str_or_url, params, data, json, cookies, headers, skip_auto_headers, auth, allow_redirects, max_redirects, compress, chunked, expect100, raise_for_status, read_until_eof, proxy, proxy_auth, timeout, verify_ssl, fingerprint, ssl_context, ssl, proxy_headers, trace_request_ctx)
    474                                 req,
    475                                 traces=traces,
--> 476                                 timeout=real_timeout
    477                             )
    478                     except asyncio.TimeoutError as exc:

~/Scripts/python/projects/clab/lib/python3.7/site-packages/aiohttp/connector.py in connect(self, req, traces, timeout)
    520
    521             try:
--> 522                 proto = await self._create_connection(req, traces, timeout)
    523                 if self._closed:
    524                     proto.close()

~/Scripts/python/projects/clab/lib/python3.7/site-packages/aiohttp/connector.py in _create_connection(self, req, traces, timeout)
    852         else:
    853             _, proto = await self._create_direct_connection(
--> 854                 req, traces, timeout)
    855
    856         return proto

~/Scripts/python/projects/clab/lib/python3.7/site-packages/aiohttp/connector.py in _create_direct_connection(self, req, traces, timeout, client_error)
    957             # in case of proxy it is not ClientProxyConnectionError
    958             # it is problem of resolving proxy ip itself
--> 959             raise ClientConnectorError(req.connection_key, exc) from exc
    960
    961         last_exc = None  # type: Optional[Exception]

ClientConnectorError: Cannot connect to host www.hongkongfp.com:443 ssl:False [nodename nor servname provided, or not known

Your environment

  • Python 3.7.1
  • aiohttp client 3.5.4
  • Occurs on Mac OSX High Sierra and Ubuntu 18.04

DNS info

Why do I not think this is a problem with DNS resolution at the OS level itself?

I can successfully ping the IP address of my ISP’s DNS Servers, which are given in (Mac OSX) System Preferences > Network > DNS:

 [~/] $ ping -c 2 75.75.75.75
PING 75.75.75.75 (75.75.75.75): 56 data bytes
64 bytes from 75.75.75.75: icmp_seq=0 ttl=57 time=16.478 ms
64 bytes from 75.75.75.75: icmp_seq=1 ttl=57 time=21.042 ms

--- 75.75.75.75 ping statistics ---
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 16.478/18.760/21.042/2.282 ms
 [~/] $ ping -c 2 75.75.76.76
PING 75.75.76.76 (75.75.76.76): 56 data bytes
64 bytes from 75.75.76.76: icmp_seq=0 ttl=54 time=33.904 ms
64 bytes from 75.75.76.76: icmp_seq=1 ttl=54 time=32.788 ms

--- 75.75.76.76 ping statistics ---
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 32.788/33.346/33.904/0.558 ms

 [~/] $ ping6 -c 2 2001:558:feed::1
PING6(56=40+8+8 bytes) 2601:14d:8b00:7d0:6587:7cfc:e2cc:82a0 --> 2001:558:feed::1
16 bytes from 2001:558:feed::1, icmp_seq=0 hlim=57 time=14.927 ms
16 bytes from 2001:558:feed::1, icmp_seq=1 hlim=57 time=14.585 ms

--- 2001:558:feed::1 ping6 statistics ---
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max/std-dev = 14.585/14.756/14.927/0.171 ms
 [~/] $ ping6 -c 2 2001:558:feed::2
PING6(56=40+8+8 bytes) 2601:14d:8b00:7d0:6587:7cfc:e2cc:82a0 --> 2001:558:feed::2
16 bytes from 2001:558:feed::2, icmp_seq=0 hlim=54 time=12.694 ms
16 bytes from 2001:558:feed::2, icmp_seq=1 hlim=54 time=11.555 ms

--- 2001:558:feed::2 ping6 statistics ---
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max/std-dev = 11.555/12.125/12.694/0.569 ms
@aio-libs-bot
Copy link

GitMate.io thinks the contributor most likely able to help you is @asvetlov.

Possibly related issues are #141 (body for HTTP GET request), #2547 (Drop deprecated request.GET property), #99 (Request for 0.8.4), #388 (request.GET ignores blank values), and #436 (Request.GET drops empty params).

@asvetlov
Copy link
Member

Thanks for comprehensive report.
I recall a similar problem under high load.
DNS did not respond as fast as a program requested.
Request throttling solved it for us.
Another solution could be a custom caching DNS server which is connected to several upstream DNS. We tried it but it requires a more complex server configuration (sorry, I don't recall exact configs).

@bsolomon1124
Copy link
Author

After some further investigation, this issue does not appear to be directly caused by aiohttp/asyncio but rather limitations/limits stemming from both:

  • The capacity/rate-limiting of your DNS Servers
  • The max number of open files at the system level.

Firstly, for those looking to get some beefed-up DNS servers (I will probably not go that route), the big-name options seem to be:

  • 1.1.1.1 (Cloudflare)
  • 8.8.8.8 (Google Public DNS)
  • Amazon Route 53

(Good intro to DNS for those like me for whom network concepts are lacking.)

The first thing that I did was to run the above on a beefed-up AWS EC2 instance - h1.16xlarge running Ubuntu which is IO optimized. I can't say this in itself helped, but it certainly cannot hurt. I'm not too familiar with the default DNS server used by an EC2 instance, but the OSError with errno == 8 from above went away when replicating the above script.

However, that presented a new exception in its place, OSError with code 24, "Too many open files." My hotfix solution (not arguing this is the most sustainable or safest) was to increase the max file limits. I did this via:

sudo vim /etc/security/limits.conf
# Add these lines
root    soft    nofile  100000
root    hard    nofile  100000
ubuntu    soft    nofile  100000
ubuntu    hard    nofile  100000

sudo vim /etc/sysctl.conf
# Add this line
fs.file-max = 2097152

sudo sysctl -p

sudo vim /etc/pam.d/commmon_session
# Add this line
session required pam_limits.so

sudo reboot

I am admittedly feeling around in the dark, but coupling this with asyncio.Semaphore(1024) (example here) led to exactly 0 of the either 2 exceptions above being raised. Of the ~25k input URLs, only ~100 GET requests returned exceptions, mainly due to those websites being legitimately broken, with total time to completion coming in within a few minutes, acceptable in my opinion.

@asvetlov
Copy link
Member

Thanks for sharing your solution!

@owurman
Copy link

owurman commented Aug 22, 2019

Thank you for this. It was the only clue I found as to why I was getting this error, which had to do with the number of open files, and not at all with a DNS error as indicated.

@BVemployee
Copy link

BVemployee commented Jan 10, 2020

The solution I found was: One, if possible don't create a new ClientSession object per request, if you can re-use the same session that fixed the DNS error for me. For the file error I limited the number of requests I was issuing using an asyncio.Semaphore, with a value of like 1k.

@tmo-trustpilot
Copy link

I was also getting this, and again nothing to do with DNS, but the connection pool limit on the shared session (docs). You can make it unlimited like this:

connector = aiohttp.TCPConnector(limit=0)  # need unlimited connections
async with aiohttp.ClientSession(connector=connector) as session:
   ...

You can also change the open file limit from within you python code with the resource library. You can't adjust the hard limit so if it's not RLIM_INFINITY on your platform you'll have to adjust that (or take the value from getrlimit), but bumping up the soft limit got past the open files exceptions for me:

import resource
resource.setrlimit(resource.RLIMIT_NOFILE, (2 ** 14, resource.RLIM_INFINITY))

@ozgunozerk
Copy link

Although I have tried semaphore approach in bsolomon1124's comment and the advices given in tmo-trustpilot's comment, and additionally: i'm not creating a new session per request, I'm reusing these sessions ->

I'm still getting the error ('nodename nor servname provided, or not known').

Any idea why? Or any update on the topic?

@PeqNP
Copy link

PeqNP commented Mar 11, 2022

I had some luck using @tmo-trustpilot's solution. I used a semaphore of 32. The default resource limits:

>>> import resource
>>> resource.getrlimit(resource.RLIMIT_NOFILE)
(256, 9223372036854775807)

I don't understand how limiting the connections to 32 would somehow consume more than 256 of the available open file resources.

My specific use case is calling a little over 2k URLs with different hosts. I did not seem to encounter this when calling the same host. Does that make a difference? Is the ClientSession intended to be used for a single host?

@PeqNP
Copy link

PeqNP commented Mar 11, 2022

I may have answered my own question. I just saw this in the docs under the "Client Quickstart" guide:

More complex cases may require a session per site, e.g. one for Github and other one for Facebook APIs. Anyway making a session for every request is a very bad idea.

I would love to get some feedback on this, but it seems like I may have to map a session for every host.

@PeqNP
Copy link

PeqNP commented Mar 12, 2022

Sorry for the spam. I believe I better understand what's going on. Please let me know if this is correct.

I was making thousands of requests against 2,637 unique domains. Is it possible that a new file descriptor is opened for each domain? If that's the case, that would definitely explain the issue.

I also created a proxy wrapper class around the ClientSession class in order to create a new ClientSession for each domain to see if that would solve the problem. It didn't. I also didn't see any difference in speed. However, by using the proxy it allows me to configure each host's session w/ different timeouts (a future nice-to-have).

Time to execute when using proxy:

6.78s user 3.83s system 11% cpu 1:30.31 total

Time to execute w/o proxy:

7.57s user 3.68s system 12% cpu 1:29.07 total

I hope this helps someone else. In short, @tmo-trustpilot's suggestion of increasing resource did the trick. Thank you!

I would love to know if my assumption is correct regarding a new fd being created per host.

@PeqNP
Copy link

PeqNP commented Mar 12, 2022

OMG I'm such a dummy. I thought this was still an open issue... but, nope, it's closed. Well, prob won't get a response then. I have become "that person" reviving dead threads. fml 😂

@jtlz2
Copy link

jtlz2 commented Dec 20, 2022

Does

ulimit -n

help here (if wanting to undertake a temporary (and per-shell) test)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants