Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve/standardize rate limiting logic in Monitor #4894

Open
wants to merge 3 commits into
base: 2.1
Choose a base branch
from

Conversation

DomGarguilo
Copy link
Member

This PR uses Guavas Suppliers.memoizeWithExpiration() for caching, replacing the manual caching logic. This allows us to:

  • Remove synchronized from the get methods since Suppliers.memoizeWithExpiration() is thread safe
  • Remove the class-level maps that were previously being used for caching. Now, the fetch() methods that update the data, directly return the data instead of using these maps. These fetch() methods are called directly from the suppliers now.
  • Remove the age-off logic that was being done in the fetch() methods. It should be noted that the functionality has slightly changed. Since we just re-create a new map whenever the data is refreshed in the fetch() methods, there are no old values that need to be aged off.

@DomGarguilo DomGarguilo added this to the 2.1.4 milestone Sep 17, 2024
@DomGarguilo DomGarguilo self-assigned this Sep 17, 2024
Copy link
Contributor

@keith-turner keith-turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the age-off logic that was being done in the fetch() methods. It should be noted that the functionality has slightly changed. Since we just re-create a new map whenever the data is refreshed in the fetch() methods, there are no old values that need to be aged off.

That is a nice simplification.

@DomGarguilo
Copy link
Member Author

Remove the age-off logic that was being done in the fetch() methods. It should be noted that the functionality has slightly changed. Since we just re-create a new map whenever the data is refreshed in the fetch() methods, there are no old values that need to be aged off.

That is a nice simplification.

Yea definitely nice to be able to remove some code. Wasn't sure if this would be an issue though since before, the maps would contain values that would only be aged off after they were 15 minutes old. And now, since the data is recreated once every minute, all of the returned data will only ever be that old. Not too sure what the implications of this are or what the old values were being used for.

@keith-turner
Copy link
Contributor

Wasn't sure if this would be an issue though since before, the maps would contain values that would only be aged off after they were 15 minutes old.

I misread that code when I was reviewing. I thought they were aging off after 15 milliseconds, so did not think it was an issue. But its actually 15 minutes. It would be good to understand that change in behavior a bit better. So it seems like some things used to hang around for 15 minutes in the monitor display even after they were gone?

@DomGarguilo
Copy link
Member Author

So it seems like some things used to hang around for 15 minutes in the monitor display even after they were gone?

Yea, so after looking at the code a bit more I think that the purpose of the age-off logic was to remove entries whose server was no longer present. The code that collects this data iterates through each server returned from context.instanceOperations().getTabletServers() and uses the HostAndPort of that server as the key in the map. So if there is new data for that server, the old data will just be replaced, but in the case where there is some data for a server and then that server dies, then it will not be replaced by anything. So after 15 mins, if the server did not come back up, those entries will be aged off. It seems like we want to keep this functionality so I'm going to work on adding this age-off functionality back.

@keith-turner
Copy link
Contributor

@DomGarguilo for these these dead servers that hang around in the monitor for 15mins, do you know if their last contact time continues to increase in monitor display?

@DomGarguilo
Copy link
Member Author

DomGarguilo commented Sep 23, 2024

@DomGarguilo for these these dead servers that hang around in the monitor for 15mins, do you know if their last contact time continues to increase in monitor display?

Okay so i was a bit mistaken. The info for servers that have dies did stick around in the map and then it depends on what the calling code does with it. For example, ScansResource sets up the rest endpoint for scans and it filters the data to only return scans from tservers that are alive, where as for scan server scans, it will return all of the data in the map. This means that if a scan server dies, as long as the 15 min age-off window has not passed, scans that were running on that scan server will still be displayed. This is just one example of how this potentially stale data is used in the monitor. I am taking a look at the other usages of this data that is kept around. I am not sure why we want to keep this data around on the monitor for scan servers but not for tservers.

Edit:

I am also seeing this scenario where I start a script that scans a table on a running scan server, then once that scan appears on the active scans page in the monitor, I kill the scan server and eventually it looks like that scan from the script starts running on the tserver. At that point the active scans page shows two active scans, one on the dead scan server and that same scan on a tserver. This seems like a bug to me.

@DomGarguilo
Copy link
Member Author

Regarding my previous comment:

I looked into the other spots where data was being pulled from these maps in the Monitor code (tserverScans, sserverScans, and allCompactions). Unless I am missing something, the only one that seems to use this stale data before it is aged off is the scan servers scenario outlined above. This leads me to believe that we might not need to keep this age off functionality unless it is deemed necessary for the scan server scans info.

}
return Map.copyOf(tserverScans);
public Map<HostAndPort,ScanStats> getScans() {
return tserverScansSupplier.get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like these methods returned an immutable collection previously, but might not be doing that now. Might want to change the fetch* methods to return immutable collections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use common technique in monitor code to rate limit refreshing data
3 participants