Improve/standardize rate limiting logic in Monitor #4894

DomGarguilo · 2024-09-17T19:46:15Z

This PR uses Guavas Suppliers.memoizeWithExpiration() for caching, replacing the manual caching logic. This allows us to:

Remove synchronized from the get methods since Suppliers.memoizeWithExpiration() is thread safe
Remove the class-level maps that were previously being used for caching. Now, the fetch() methods that update the data, directly return the data instead of using these maps. These fetch() methods are called directly from the suppliers now.
Remove the age-off logic that was being done in the fetch() methods. It should be noted that the functionality has slightly changed. Since we just re-create a new map whenever the data is refreshed in the fetch() methods, there are no old values that need to be aged off.

keith-turner

Remove the age-off logic that was being done in the fetch() methods. It should be noted that the functionality has slightly changed. Since we just re-create a new map whenever the data is refreshed in the fetch() methods, there are no old values that need to be aged off.

That is a nice simplification.

DomGarguilo · 2024-09-18T16:26:23Z

Remove the age-off logic that was being done in the fetch() methods. It should be noted that the functionality has slightly changed. Since we just re-create a new map whenever the data is refreshed in the fetch() methods, there are no old values that need to be aged off.

That is a nice simplification.

Yea definitely nice to be able to remove some code. Wasn't sure if this would be an issue though since before, the maps would contain values that would only be aged off after they were 15 minutes old. And now, since the data is recreated once every minute, all of the returned data will only ever be that old. Not too sure what the implications of this are or what the old values were being used for.

keith-turner · 2024-09-18T18:54:20Z

Wasn't sure if this would be an issue though since before, the maps would contain values that would only be aged off after they were 15 minutes old.

I misread that code when I was reviewing. I thought they were aging off after 15 milliseconds, so did not think it was an issue. But its actually 15 minutes. It would be good to understand that change in behavior a bit better. So it seems like some things used to hang around for 15 minutes in the monitor display even after they were gone?

DomGarguilo · 2024-09-20T16:00:02Z

So it seems like some things used to hang around for 15 minutes in the monitor display even after they were gone?

Yea, so after looking at the code a bit more I think that the purpose of the age-off logic was to remove entries whose server was no longer present. The code that collects this data iterates through each server returned from context.instanceOperations().getTabletServers() and uses the HostAndPort of that server as the key in the map. So if there is new data for that server, the old data will just be replaced, but in the case where there is some data for a server and then that server dies, then it will not be replaced by anything. So after 15 mins, if the server did not come back up, those entries will be aged off. It seems like we want to keep this functionality so I'm going to work on adding this age-off functionality back.

keith-turner · 2024-09-20T21:48:28Z

@DomGarguilo for these these dead servers that hang around in the monitor for 15mins, do you know if their last contact time continues to increase in monitor display?

DomGarguilo · 2024-09-23T18:08:44Z

@DomGarguilo for these these dead servers that hang around in the monitor for 15mins, do you know if their last contact time continues to increase in monitor display?

Okay so i was a bit mistaken. The info for servers that have dies did stick around in the map and then it depends on what the calling code does with it. For example, ScansResource sets up the rest endpoint for scans and it filters the data to only return scans from tservers that are alive, where as for scan server scans, it will return all of the data in the map. This means that if a scan server dies, as long as the 15 min age-off window has not passed, scans that were running on that scan server will still be displayed. This is just one example of how this potentially stale data is used in the monitor. I am taking a look at the other usages of this data that is kept around. I am not sure why we want to keep this data around on the monitor for scan servers but not for tservers.

Edit:

I am also seeing this scenario where I start a script that scans a table on a running scan server, then once that scan appears on the active scans page in the monitor, I kill the scan server and eventually it looks like that scan from the script starts running on the tserver. At that point the active scans page shows two active scans, one on the dead scan server and that same scan on a tserver. This seems like a bug to me.

DomGarguilo · 2024-09-23T18:25:23Z

Regarding my previous comment:

I looked into the other spots where data was being pulled from these maps in the Monitor code (tserverScans, sserverScans, and allCompactions). Unless I am missing something, the only one that seems to use this stale data before it is aged off is the scan servers scenario outlined above. This leads me to believe that we might not need to keep this age off functionality unless it is deemed necessary for the scan server scans info.

dlmarion · 2024-10-03T13:43:05Z

server/monitor/src/main/java/org/apache/accumulo/monitor/Monitor.java

-    }
-    return Map.copyOf(tserverScans);
+  public Map<HostAndPort,ScanStats> getScans() {
+    return tserverScansSupplier.get();


It looks like these methods returned an immutable collection previously, but might not be doing that now. Might want to change the fetch* methods to return immutable collections.

Improve/standardize rate limiting logic in monitor

f0f7c2a

DomGarguilo added this to the 2.1.4 milestone Sep 17, 2024

DomGarguilo requested a review from keith-turner September 17, 2024 19:46

DomGarguilo self-assigned this Sep 17, 2024

DomGarguilo linked an issue Sep 17, 2024 that may be closed by this pull request

Use common technique in monitor code to rate limit refreshing data #4885

Open

keith-turner approved these changes Sep 17, 2024

View reviewed changes

DomGarguilo mentioned this pull request Sep 25, 2024

Use Metric.getDescription() for all micrometer instruments #4925

Merged

dlmarion reviewed Oct 3, 2024

View reviewed changes

DomGarguilo added 2 commits October 3, 2024 13:54

Return unmodifiable maps from fetch methods

1dc836b

refactor common code into new method

26c8a3d

dlmarion approved these changes Oct 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve/standardize rate limiting logic in Monitor #4894

Improve/standardize rate limiting logic in Monitor #4894

DomGarguilo commented Sep 17, 2024

keith-turner left a comment

DomGarguilo commented Sep 18, 2024

keith-turner commented Sep 18, 2024

DomGarguilo commented Sep 20, 2024

keith-turner commented Sep 20, 2024

DomGarguilo commented Sep 23, 2024 •

edited

Loading

DomGarguilo commented Sep 23, 2024

dlmarion Oct 3, 2024

Improve/standardize rate limiting logic in Monitor #4894

Are you sure you want to change the base?

Improve/standardize rate limiting logic in Monitor #4894

Conversation

DomGarguilo commented Sep 17, 2024

keith-turner left a comment

Choose a reason for hiding this comment

DomGarguilo commented Sep 18, 2024

keith-turner commented Sep 18, 2024

DomGarguilo commented Sep 20, 2024

keith-turner commented Sep 20, 2024

DomGarguilo commented Sep 23, 2024 • edited Loading

DomGarguilo commented Sep 23, 2024

dlmarion Oct 3, 2024

Choose a reason for hiding this comment

DomGarguilo commented Sep 23, 2024 •

edited

Loading