Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leaking file descriptors when generating too many http requests #3279

Closed
adrai opened this issue Feb 4, 2022 · 34 comments
Closed

Leaking file descriptors when generating too many http requests #3279

adrai opened this issue Feb 4, 2022 · 34 comments
Assignees
Labels
guidance General information and guidance, answers to FAQs, or recommended best practices/resources. workaround-available This issue has a work around available.

Comments

@adrai
Copy link

adrai commented Feb 4, 2022

This issue is extracted from this original issue: #3019, like requested: #3019 (comment)

#3019 is concentrated to the readFile (leaking file descriptor) "issue" and this issue is concentrated to the network (leaking file descriptor) "issue"


It seems lambda is not "waiting" for the file descriptors to be closed.

This can be observed especially, when having warm lambda executions with a lot of sdk calls, like for DynamoDb or S3, etc...
This each http request opens a network socket which results in an open file descriptor.
Since by default in Node.js the socket timeout is set to 120000ms (2 minutes) it may be the lambda is already finished, but the sockets are still open. When "restarting" the lambda for the next invocations, those file descriptors may still be open.
This leads to this type of EMFILE errors:

Error: connect EMFILE 52.94.5.100:443 - Local (undefined:undefined)
Error: getaddrinfo EMFILE dynamodb.eu-west-1.amazonaws.com
Error: A system error occurred: uv_os_homedir returned EMFILE (too many open files)

These basic tests shows the count (and leaks) of the emfile count:

Tests originally done in this issue here: #3019 (comment)

Details

image
image

compared to this tests: #3019 (comment)

Defining a custom requestHandler, with a very low socketTimeout reduces drastically the emfiles count:

  requestHandler: new NodeHttpHandler({
      socketTimeout: 10000 // <- this decreases the emfiles count, the Node.js default is 120000
  })
Details

image
image

That's why I suggest to set a low socket Timeout by default, like proposed here: #3019 (comment)

proposal:

// https://github.com/aws/aws-sdk-js-v3/blob/main/packages/node-http-handler/src/node-http-handler.ts#L62
socketTimeout: socketTimeout || 10000

and probably also here?

// https://github.com/aws/aws-sdk-js-v3/blob/main/packages/node-http-handler/src/node-http2-handler.ts#L44
this.requestTimeout = requestTimeout || 10000;

PS. btw. it seems it got worse (more file descriptors) when updating from v3.46.0 to v3.49.0

@adrai adrai added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Feb 4, 2022
@adrai
Copy link
Author

adrai commented Feb 6, 2022

maybe also relevant: #2569 @artur-ma ?

@trivikr
Copy link
Member

trivikr commented Feb 7, 2022

The fix in #3285 is most likely going to fix this issue too, as the number of file reads is reduced from 16 to 2 for client+operation call.

ToDo: Reproduce this issue using repo https://github.com/samswen/lambda-emfiles when the fix from #3285 is out.

@adrai
Copy link
Author

adrai commented Feb 7, 2022

The fix in #3285 is most likely going to fix this issue too, as the number of file reads is reduced from 16 to 2 for client+operation call.

ToDo: Reproduce this issue using repo https://github.com/samswen/lambda-emfiles when the fix from #3285 is out.

No, I don't think so.
This is not about readFile, but about http requests.
That's why you told me to open a new issue.

@adrai
Copy link
Author

adrai commented Feb 7, 2022

This issue (http requests) is also more important for you @ffxsam, right?

@trivikr
Copy link
Member

trivikr commented Feb 7, 2022

No, I don't think so.

Yup, I'm also not sure and this request will require a deep dive.
That's why I did not mark this issue as fixes, but refs in the PR. And mentioned that we'll try to repro with https://github.com/samswen/lambda-emfiles once fix is out.

This is not about readFile, but about http requests.

Looking at the bug description, it appears to be because of readFile because of the following reasons:

  • EMFILE errors are thrown when there are too many open files.
  • It seems it got worse (more file descriptors) when updating from v3.46.0 to v3.49.0, where we added another call to readFile call for defaults.

That's why you told me to open a new issue.

The reason I asked to create a separate request is for better tracking.

@adrai
Copy link
Author

adrai commented Feb 7, 2022

fyi: the tests I did where without extra readFile calls, so these are all coming from http requests.
So the tests run with:

import sharedIniFileLoader from '@aws-sdk/shared-ini-file-loader'
Object.assign(sharedIniFileLoader, { loadSharedConfigFiles: async () => ({ configFile: {}, credentialsFile: {} }) })

That's why my proposal is to have a very low socketTimeout by default.
In my tests setting a low socketTimeout decreased the EMFILE count from up to over 600 to under 100 => 6x.

@trivikr trivikr removed the needs-triage This issue or PR still needs to be triaged. label Feb 7, 2022
adrai added a commit to adrai/aws-sdk-js-v3 that referenced this issue Feb 8, 2022
…tion, to prevent (aws#3279)

to prevent leaking file descriptors, like described in aws#3279
@trivikr
Copy link
Member

trivikr commented Feb 10, 2022

PS. btw. it seems it got worse (more file descriptors) when updating from v3.46.0 to v3.49.0
In my tests setting a low socketTimeout decreased the EMFILE count from up to over 600 to under 100 => 6x.

This clearly shows that most of the EMFILE errors are caused by file reads being fixed in #3285

  • Updating from v3.46.0 to v3.49.0 added additional readFile call to read defaults.
  • The test setting removes readFile calls for all configurations (except credentials).

The default socketTimeout for the SDK will be application to all environments.
I'll revisit this issue after #3285 is out (most likely will be released with v3.51.0 on 02/11)

@adrai
Copy link
Author

adrai commented Feb 14, 2022

v3.51.0 has been released, any update regarding #3279 / #3291 ?

@dobrynin
Copy link

Also facing this. Thank for the deep dive @adrai.

@Muthuveerappanv
Copy link

I'm facing the same issue as well, I was doing 10000 dynamodb updates in 1 chunk and benchmarked the performance with 3.33.x and with 3.49 ran into this EMFILE issues, only option was to reduced the chunk size to 800 (even 1000 didn't work), or rollback to 3.46

For many in production (like me) this is a breaking change. Could we have some updates soon?

@adrai thank your for the detailed description

@adrai
Copy link
Author

adrai commented Feb 19, 2022

best advice from my side is really to stay at 3.46 and/or set a lower socketTimeout

@trivikr
Copy link
Member

trivikr commented Feb 19, 2022

@Muthuveerappanv we've reduced readFile calls in #3285 which was released in v3.51.0.

Can you test with >3.51.0 and share your results?

@adrai
Copy link
Author

adrai commented Feb 19, 2022

fyi: with 3.51.0 in my setup, I still get EMFILE errors, too many open sockets... because of the socketTimeout... but it seems it is a very little bit better, but still not good

@Muthuveerappanv
Copy link

My load test results still hasn't improved, didn't pass even for 1000 parallel dynamodb updates (promise.all), not much improvement [had 10000 updates working fine in 3.33.x, now my chunk size is 800] > @Muthuveerappanv we've reduced readFile calls in #3285 which was released in v3.51.0.

Can you test with >3.51.0 and share your results?

@Muthuveerappanv
Copy link

fyi: with 3.51.0 in my setup, I still get EMFILE errors, too many open sockets... because of the socketTimeout... but it seems it is a very little bit better, but still not good

I concur, I tried with version 3.52.0

@trivikr
Copy link
Member

trivikr commented Feb 21, 2022

I concur, I tried with version 3.52.0

@Muthuveerappanv is the issue in 3.52.0 worse than that in 3.46.0?
We think it should be improved as readFile calls has reduced from 12 to 2.

@adrai
Copy link
Author

adrai commented Feb 22, 2022

tldr; the amount of EMFILE generated by readFile is nothing compared to the amount of EMFILE generated by the http requests... that's why you will not notice a decrease of 10 EMFILEs when there are other hundreds or thousands of EMFILEs caused by the http requests

@Muthuveerappanv
Copy link

I concur, I tried with version 3.52.0

@Muthuveerappanv is the issue in 3.52.0 worse than that in 3.46.0? We think it should be improved as readFile calls has reduced from 12 to 2.

i wouldn't say worse, but definitely hasn't gotten better either. the sdk isn't coping well with the burst http requests made, as it did prior to 3.46.0

@Muthuveerappanv
Copy link

@trivikr - any ETA on this bug?

@trivikr trivikr added the workaround-available This issue has a work around available. label Mar 15, 2022
@trivikr
Copy link
Member

trivikr commented Mar 15, 2022

There's no ETA. Marking this issue as workaround-available as customer can always reduce socketTimeout during client creation in their high traffic lambdas

import { S3 } from "@aws-sdk/client-s3";

const client = new S3({
  requestHandler: new NodeHttpHandler({
      socketTimeout: 10000 // <- this decreases the emfiles count, the Node.js default is 120000
  })
);

@trivikr
Copy link
Member

trivikr commented Mar 15, 2022

Marking this as guidance, as the issue is specific to high traffic lambdas.
Whenever we visit this issue in future, we need to discuss impact of passing the custom socketTimeout value for all environments we support.

@trivikr trivikr added the guidance General information and guidance, answers to FAQs, or recommended best practices/resources. label Mar 15, 2022
@trivikr trivikr removed the bug This issue is a bug. label Mar 15, 2022
@adrai
Copy link
Author

adrai commented Mar 15, 2022

The question is: Why was this workaround NOT necessary in <= v3.46.0 ?

@trivikr
Copy link
Member

trivikr commented Mar 15, 2022

Why was this workaround NOT necessary in <= v3.46.0 ?

We suspect that the v3.47.0 added support for defaults mode in #3192, which increased the number of readFile calls. I do not see any other change which can cause it from https://github.com/aws/aws-sdk-js-v3/releases/tag/v3.47.0 notes.

If you can provide concrete data for comparisons between, it will assist us to debug this further:

  • v3.46.0 which doesn't contain default mode
  • v3.47.0 which contains default mode.
  • v3.51.0 which reduces readFile calls.

By concrete data, I mean at what number of concurrent requests on Lambda do the EMFILE error occur in these versions.

The previous responses were generic, like the one below:

i wouldn't say worse, but definitely hasn't gotten better either

@adrai
Copy link
Author

adrai commented Mar 15, 2022

Here I compared v3.46.0 with v3.49.0: #3019 (comment)

@adrai
Copy link
Author

adrai commented Mar 15, 2022

Also @Muthuveerappanv has some insights: #3279 (comment)

@adrai
Copy link
Author

adrai commented Mar 16, 2022

@ffxsam should have also an example with concurrent S3 requests: #3270

@adrai
Copy link
Author

adrai commented Mar 19, 2022

@trivikr I think I found what is causing all this extra open EMFILEs....

It's probably exactly what @AllanZhengYP commented here: https://github.com/aws/aws-sdk-js-v3/blame/main/packages/node-http-handler/src/node-http-handler.ts#L75

When doing all these hundreds of concurrent requests, the code is not waiting for the this.config to be ready, and will initialize a loooot of new http(s) clients here: https://github.com/aws/aws-sdk-js-v3/blame/main/packages/node-http-handler/src/node-http-handler.ts#L64

All this was introduced in v3.47.0 with this commit: 9152e21

I tested with this little hack, and it seems to work much better like this:
image
// cc: @mcollina you may know of a more performant way to sync up these concurrent calls?

btw: to generate some concurrent requests, it is enough to do something like this:

import { S3Client, GetObjectCommand } from '@aws-sdk/client-s3'
const s3Client = new S3Client({ region: 'eu-west-1' })

for (let index = 0; index < 1000; index++) {
  s3Client.send(new GetObjectCommand({
    Bucket: 'some-bucket',
    Key: 'some-key'
  })).then(() => {}).catch(() => {})  
}

@moltar
Copy link

moltar commented Mar 20, 2022

Ooo nice find @adrai!!

@adrai
Copy link
Author

adrai commented Mar 27, 2022

@trivikr any updates on this?

@trivikr
Copy link
Member

trivikr commented Mar 27, 2022

@AllanZhengYP will take a look in this issue.

@adrai
Copy link
Author

adrai commented Apr 12, 2022

@alexforsyth @trivikr any update on this?
When will we be able to update from v3.46.0 to a newer fixed version?

@kuhe
Copy link
Contributor

kuhe commented Apr 21, 2022

hi @adrai,

3.74.0 has been released, which includes a fix based on your screenshot.

In my test lambda using the code snippet with the looped .send(new GetObjectCommand(...)) you gave, and wrapped in https://github.com/samswen/lambda-emfiles, the comparison of 3.72.0 and 3.74.0 shows a greatly reduced emfile count, from about 1 per loop iteration to a total of 50, which I believe is consistent with the default maxSockets value in the node http handler config.

@adrai
Copy link
Author

adrai commented Apr 21, 2022

Your code changes looks good... thank you.
Will test this in the next prod deployment in the next days...

@github-actions
Copy link

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs and link to relevant comments in this thread.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
guidance General information and guidance, answers to FAQs, or recommended best practices/resources. workaround-available This issue has a work around available.
Projects
None yet
6 participants