Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Preview of improved Functions http scaling behavior #38

Closed
davidebbo opened this issue Mar 12, 2018 · 13 comments
Closed

Discussion: Preview of improved Functions http scaling behavior #38

davidebbo opened this issue Mar 12, 2018 · 13 comments

Comments

@davidebbo
Copy link
Contributor

Discussion thread for Azure/app-service-announcements#90.

@nzthiago
Copy link
Member

@davidebbo - do you know if the flag is still needed? Or is it enabled for anyone by default for all functions?

@davidebbo
Copy link
Contributor Author

Great question: we actually had it enabled by default for a few days, but had to turn it off due to some issue. So for now, yes, you still need the flag. Normally, in another few weeks, it will be default again. /cc @suwatch

@Fabian-Schmidt
Copy link

Does this change only apply to consumption based plans?
Can App Service plans also benefit from the changes?

@davidebbo
Copy link
Contributor Author

@Fabian-Schmidt yes, it's only for Consumption.

@rikvandenberg
Copy link

rikvandenberg commented May 14, 2018

To add to the discussion. A blog I found from @JamesRandall led me here.
https://www.azurefromthetrenches.com/azure-functions-significant-improvements-in-http-trigger-scaling/

There seems to be an very strong decrease from small peaks we experienced in our HTTP functions. We previously experienced consistent 6ms responses to which suddenly increased to ~600ms.

@davidebbo
Copy link
Contributor Author

@rikvandenberg please provide more details. Are you referring to cold start, or is that the response time you see always? Is this under high load scenario?

@rikvandenberg
Copy link

@davidebbo I'll try my best to best explain and what we are seeing.

Intro
So we have two simple azure function that does the following.

  1. DistanceFunction: Calculate a geographical distance between 1 origin (lat/lon) and a maximum of 25 destination locations (lat/lon).

Uses System.Device.Location.GeoCoordinate

  1. RouteFunction: Calculate a driving distance between 1 origin (lat/lon) and a maxmimum of 25 destination locations (lat/lon).

The route function uses Google Maps API and caches the results in to redis cache.

We call both functions with the exact same parameters in terms of origin and destination locations at the same time asynchronously from an ASP.NET WebAPI application. We are using a Task.WaitAll(tasks, 500) to also timeout after 500ms.

We require this timeout/performance to prevent the user request from blocking and we wish to continue. Thus we pre-emptively continue our request, as upon refresh, the route information will most likely be in the cache.

Performance Test

  • Over the span of ~9 minutes we received 265 requests, ~130 for each function. Not a high load scenario.
  • We always used the EXACT same origin and destinations so that it will always use the CACHED results.
  • Our livestream metrics in Application Insights indicated at the time our azure function had 10 cloud role instances

Performance Test Results

  • From those 265 requests, 13 peaked above the 500ms. Results from AI analytics
  • 12 of those requests are caused by the Route function as it has a heavier workload.
  • The peak doesn't seem to correlate with a "cold" cloud_roleinstance as it has been warmed up.
  • WEBSITE_HTTPSCALEV2_ENABLED=1

Possible Causes
It seems to me that sometimes the switch to another cloud_roleinstance is the causes of these "random" peaks, but I can't think of any other explainable cause 🤔

If you have any suggestions on how to approach these tests to give you more insights as well, please let me know and I'll see what I can do.

@davidebbo
Copy link
Contributor Author

/cc @suwatch who is the expert.

@rikvandenberg What you're observing is likely the flipside of the new scale behavior. It's scaling faster (you're getting 10 instances), but at the same time you end up hitting more cold starts (one per instance). We may still need to tweak the system further to balance things.

BTW, WEBSITE_HTTPSCALEV2_ENABLED is now on by default which is likely why you might have seen a change. You can also try setting WEBSITE_HTTPSCALEV2_ENABLED=0 to revert to previous behavior.

Would you say that you're functions are CPU bound, or more I/O bound? I would think the later, as waiting for the Google Map result should take very little resources. In that sense, it is odd that it decided to scale this much. @suwatch will dig into it further.

@suwatch
Copy link

suwatch commented May 16, 2018

It seems to me that sometimes the switch to another cloud_roleinstance is the causes of these "random" peaks, but I can't think of any other explainable cause

@rikvandenberg Thank for reporting. This was a result of unwarranted cold starts. Our current scale implementation has a flaw when it comes to low load with occasional burst of concurrent requests. The spikes caused us to scale out to more instances. Since the spike was not sustaining, our scale in logic kicked in and removed the instances. This happened alternately every 1-2 mins and, as a result, a moving set on instances was assigned to the function. This explained the 10 instances from Application Insight. They were not at the same time - but rather a different set over a 10 min period. For each new instance assigned, it caused cold start (spike of long latency).

Good news is we have improved this logic by making the scale in less aggressive for this specific situation. The ETA will be 2 weeks. We will let you know to retry your scenario.

@rikvandenberg
Copy link

@suwatch Thanks for the quick response! I look forward to testing this improvement.

@suwatch
Copy link

suwatch commented Jun 3, 2018

@rikvandenberg The fix rollout takes longer than expected. It will be likely be another week before the fix is available across. If you are eager to experiment, try create a function app in West Central US location where the fix is available. Otherwise, wait for a week.

@suwatch
Copy link

suwatch commented Jun 20, 2018

@rikvandenberg The improvement has been rolled out completely. Do try when you get a chance and provide any feedbacks.

@rikvandenberg
Copy link

rikvandenberg commented Jun 21, 2018

@suwatch We just had our sprint planning yesterday and have some time available to do a small test. I'll try to use the same test scenario. I'll let you know when we have got something.

@fabiocav fabiocav closed this as completed Dec 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants