Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group is sometimes missing state updates from members #3578

Closed
florian-h05 opened this issue Apr 25, 2023 · 26 comments
Closed

Group is sometimes missing state updates from members #3578

florian-h05 opened this issue Apr 25, 2023 · 26 comments
Assignees
Labels
external bug A problem or unintended behavior of an external library

Comments

@florian-h05
Copy link
Contributor

Expected Behavior

If a member of a group changes or updates it's state, the group's StateChangeListener should get informed.

Current Behavior

Most times when a group member updates or changed its state, the group's listeners are not informed.
The weird behaviour is, that this is not always the case.

protected void notifyListeners(final State oldState, final State newState) {
// if nothing has changed, we send update notifications
Set<StateChangeListener> clonedListeners = new CopyOnWriteArraySet<>(listeners);
ExecutorService pool = ThreadPoolManager.getPool(ITEM_THREADPOOLNAME);
try {
final boolean stateChanged = newState != null && !newState.equals(oldState);
clonedListeners.forEach(listener -> pool.execute(() -> {
try {
listener.stateUpdated(GenericItem.this, newState);
if (stateChanged) {
listener.stateChanged(GenericItem.this, oldState, newState);
}
} catch (Exception e) {
logger.warn("failed notifying listener '{}' about state update of item {}: {}", listener,
GenericItem.this.getName(), e.getMessage(), e);
}
}));
} catch (IllegalArgumentException e) {
logger.warn("failed comparing oldState '{}' to newState '{}' for item {}: {}", oldState, newState,
GenericItem.this.getName(), e.getMessage(), e);
}
}

I found out that:

  1. notifyListeners is called and clonedListeners contains the listeners as expected.
  2. for each of clonedListeners pool.execute is called.
  3. Without any exception being thrown and logged, listener.stateUpdate and listener.stateChanged if stateChanged are never called.

Possible Solution

Steps to Reproduce (for Bugs)

  1. Create a Group:Switch:OR(ON,OFF)
  2. Put in at least one Switch Item
  3. Switch that group member from the UI and have a look at the logs. Sometimes, the group Item won't update it's state.

Context

Your Environment

  • Version used: 4.0.0.M2
  • Environment name and version (e.g. Chrome 111, Java 17, Node.js 18.15, ...):
  • Operating System and version (desktop or mobile, Windows 11, Raspbian Bullseye, ...): Debian 11 64-bit
@florian-h05 florian-h05 added the bug An unexpected problem or unintended behavior of the Core label Apr 25, 2023
@J-N-K J-N-K self-assigned this Apr 25, 2023
@J-N-K
Copy link
Member

J-N-K commented Apr 25, 2023

I've seen it once today but wasn't able to reproduce it. Can you reproduce it? In that case I would like to provide a bundle for debugging that.

@florian-h05
Copy link
Contributor Author

It happens very very regularly, so yes I can reproduce it, even though I have no idea what is different if it works.

@J-N-K
Copy link
Member

J-N-K commented Apr 25, 2023

Update the core bundle:

update org.openhab.core https://janessa.me/esh/org.openhab.core-4.0.0-SNAPSHOT.jar

and stop openHAB, edit <userdata>/etc/org.apache.karaf.log.cfg and change the last line to

pattern = %d{HH:mm:ss.SSS} | %-5.5p | %-20.20t | %-36.36c{1.} | %-3.3X{bundle.id} | %h{%m}%n

After starting again you now see the thread name in column three of the output and also additional logging when GroupItem.stateUpdated is called, similar to:

18:07:51.497 | ERROR | OH-items-3           | org.openhab.core.items.GroupItem     | 149 | Called stateUpdated for TestSwitch (Type=SwitchItem, State=OFF, Label=TestSwitch, Category=, Groups=[GroupSwitchTest]) with state 'OFF' (old state is 'ON')

Can you post the log when the issue occurs (including about ten lines before the missed update).

I see that several threads are calling stateUpdated and it might be this is a race condition where the first event is not yet fully processed (and the new state not set) and the next event already starts processing. In that case oldState would not be the newState of the previous run, but still oldState of the previous run.

@florian-h05
Copy link
Contributor Author

florian-h05 commented Apr 25, 2023

I‘ve just switched my Switch Item Balkon_Steckdose, which is member of Group gSteckdose, several times between on and off, this is the log from the switching with then lines above:

19:03:45.966 | INFO  | OH-eventexecutor-1   | openhab.event.ItemStateChangedEvent  | 172 | Item 'Memory_Used' changed from 1262 to 1279
19:03:46.019 | INFO  | OH-eventexecutor-1   | openhab.event.ItemStateChangedEvent  | 172 | Item 'Memory_Available' changed from 1725 to 1708
19:03:48.804 | INFO  | OH-eventexecutor-1   | openhab.event.ItemStateChangedEvent  | 172 | Item 'Doorbell_Image' changed from raw type (image/jpeg): 31286 bytes to raw type (image/jpeg): 31255 bytes
19:03:49.040 | INFO  | OH-eventexecutor-1   | openhab.event.ItemStateChangedEvent  | 172 | Item 'Memory_Used' changed from 1279 to 1283
19:03:49.111 | INFO  | OH-eventexecutor-1   | openhab.event.ItemStateChangedEvent  | 172 | Item 'Memory_Available' changed from 1708 to 1704
19:03:49.967 | INFO  | OH-eventexecutor-1   | openhab.event.ItemStateChangedEvent  | 172 | Item 'KNX_DatumZeit' changed from 2023-04-25T19:02:49.000+0200 to 2023-04-25T19:03:49.000+0200
19:03:52.135 | INFO  | OH-eventexecutor-1   | openhab.event.ItemStateChangedEvent  | 172 | Item 'Memory_Used' changed from 1283 to 1288
19:03:52.164 | INFO  | OH-eventexecutor-1   | openhab.event.ItemStateChangedEvent  | 172 | Item 'Storage_Available' changed from 21250 to 21249
19:03:52.213 | INFO  | OH-eventexecutor-1   | openhab.event.ItemStateChangedEvent  | 172 | Item 'Memory_Available' changed from 1704 to 1699
19:03:53.834 | INFO  | OH-eventexecutor-1   | openhab.event.ItemStateChangedEvent  | 172 | Item 'Doorbell_Image' changed from raw type (image/jpeg): 31255 bytes to raw type (image/jpeg): 31338 bytes
19:03:54.594 | INFO  | OH-eventexecutor-1   | openhab.event.ItemCommandEvent       | 172 | Item 'Balkon_Steckdose' received command ON
19:03:54.665 | INFO  | OH-eventexecutor-1   | openhab.event.ItemStateChangedEvent  | 172 | Item 'Balkon_Steckdose' changed from OFF to ON
19:03:55.164 | INFO  | OH-eventexecutor-1   | openhab.event.ItemCommandEvent       | 172 | Item 'Balkon_Steckdose' received command OFF
19:03:55.241 | INFO  | OH-eventexecutor-1   | openhab.event.ItemStateChangedEvent  | 172 | Item 'Memory_Used' changed from 1288 to 1295
19:03:55.246 | INFO  | OH-eventexecutor-1   | openhab.event.ItemStateChangedEvent  | 172 | Item 'Balkon_Steckdose' changed from ON to OFF

FYI, this is how it looks if it works (which it does in 1 of 100 tries currently):

19:04:54.689 | INFO  | OH-eventexecutor-1   | hab.event.GroupItemStateChangedEvent | 172 | Item 'gSteckdose' changed from OFF to ON through Balkon_Steckdose                                                 
19:04:54.703 | ERROR | OH-items-3           | org.openhab.core.items.GroupItem     | 149 | Called stateUpdated for Balkon_Steckdose (Type=SwitchItem, State=ON, Label=Balkon Steckdose, Category=poweroutlet,
 Tags=[PowerOutlet], Groups=[FF_Balcony, gSteckdose]) with state 'OFF' (old state is 'NULL')

@J-N-K
Copy link
Member

J-N-K commented Apr 25, 2023

Thanks. I added additional logging around the scheduling. Can you update (and restart) again? The additional logging is only for Balkon_Steckdose, otherwise your log would be totally unreadable.

@florian-h05
Copy link
Contributor Author

florian-h05 commented Apr 25, 2023

I now get the following logs:

20:00:15.402 | INFO  | OH-eventexecutor-1   | openhab.event.ItemStateChangedEvent  | 172 | Item 'Balkon_Steckdose' changed from OFF to ON
20:00:15.480 | INFO  | OH-eventexecutor-1   | openhab.event.ItemStateChangedEvent  | 172 | Item 'Memory_Used' changed from 1236 to 1231
20:00:15.530 | INFO  | OH-eventexecutor-1   | openhab.event.ItemStateChangedEvent  | 172 | Item 'Memory_Available' changed from 1752 to 1756
20:00:17.048 | INFO  | OH-eventexecutor-1   | openhab.event.ItemStateChangedEvent  | 172 | Item 'Doorbell_Image' changed from raw type (image/jpeg): 30931 bytes to raw type (image/jpeg): 31000 bytes
20:00:18.556 | INFO  | OH-eventexecutor-1   | openhab.event.ItemStateChangedEvent  | 172 | Item 'Memory_Used' changed from 1231 to 1233
20:00:18.614 | INFO  | OH-eventexecutor-1   | openhab.event.ItemStateChangedEvent  | 172 | Item 'Memory_Available' changed from 1756 to 1754
20:00:19.332 | INFO  | OH-eventexecutor-1   | openhab.event.ItemStateChangedEvent  | 172 | Item 'ups_upsLoadCurrent' changed from 21 % to 20 %
20:00:19.344 | INFO  | OH-eventexecutor-1   | openhab.event.ItemStateChangedEvent  | 172 | Item 'ups_calculatedPower' changed from 88.2 to 84.0
20:00:19.348 | INFO  | OH-eventexecutor-1   | openhab.event.ItemCommandEvent       | 172 | Item 'ups_upsUnnormal' received command OFF
20:00:19.357 | INFO  | OH-eventexecutor-1   | penhab.event.ItemStatePredictedEvent | 172 | Item 'ups_upsUnnormal' predicted to become OFF
20:00:20.520 | INFO  | OH-eventexecutor-1   | openhab.event.ItemCommandEvent       | 172 | Item 'Balkon_Steckdose' received command OFF
20:00:20.592 | ERROR | OH-eventexecutor-1   | org.openhab.core.items.GenericItem   | 149 | Scheduled update from Balkon_Steckdose (Type=SwitchItem, State=OFF, Label=Balkon Steckdose, Category=poweroutlet, Tags=[PowerOutlet], Groups=[FF_Balcony, gSteckdose]) for 'OFF' with counter 18 for listener FF_Balcony (Type=GroupItem, Members=3, State=NULL, Label=Balkon, Category=balcony, Tags=[Terrace], Groups=[gFF])
20:00:20.597 | ERROR | OH-eventexecutor-1   | org.openhab.core.items.GenericItem   | 149 | Scheduled update from Balkon_Steckdose (Type=SwitchItem, State=OFF, Label=Balkon Steckdose, Category=poweroutlet, Tags=[PowerOutlet], Groups=[FF_Balcony, gSteckdose]) for 'OFF' with counter 19 for listener gSteckdose (Type=GroupItem, BaseType=SwitchItem, Members=1, State=OFF, Label=Steckdosen, Category=poweroutlet, Groups=[gHome])
20:00:20.600 | ERROR | OH-eventexecutor-1   | org.openhab.core.items.GenericItem   | 149 | Scheduled update from Balkon_Steckdose (Type=SwitchItem, State=OFF, Label=Balkon Steckdose, Category=poweroutlet, Tags=[PowerOutlet], Groups=[FF_Balcony, gSteckdose]) for 'OFF' with counter 20 for listener org.openhab.core.persistence.internal.PersistenceManagerImpl@41ba995d
20:00:20.610 | INFO  | OH-eventexecutor-1   | openhab.event.ItemStateChangedEvent  | 172 | Item 'Balkon_Steckdose' changed from ON to OFF
20:00:21.302 | INFO  | OH-eventexecutor-1   | openhab.event.ItemCommandEvent       | 172 | Item 'Balkon_Steckdose' received command ON
20:00:21.383 | ERROR | OH-eventexecutor-1   | org.openhab.core.items.GenericItem   | 149 | Scheduled update from Balkon_Steckdose (Type=SwitchItem, State=ON, Label=Balkon Steckdose, Category=poweroutlet, Tags=[PowerOutlet], Groups=[FF_Balcony, gSteckdose]) for 'ON' with counter 21 for listener FF_Balcony (Type=GroupItem, Members=3, State=NULL, Label=Balkon, Category=balcony, Tags=[Terrace], Groups=[gFF])
20:00:21.386 | ERROR | OH-eventexecutor-1   | org.openhab.core.items.GenericItem   | 149 | Scheduled update from Balkon_Steckdose (Type=SwitchItem, State=ON, Label=Balkon Steckdose, Category=poweroutlet, Tags=[PowerOutlet], Groups=[FF_Balcony, gSteckdose]) for 'ON' with counter 22 for listener gSteckdose (Type=GroupItem, BaseType=SwitchItem, Members=1, State=OFF, Label=Steckdosen, Category=poweroutlet, Groups=[gHome])
20:00:21.389 | ERROR | OH-eventexecutor-1   | org.openhab.core.items.GenericItem   | 149 | Scheduled update from Balkon_Steckdose (Type=SwitchItem, State=ON, Label=Balkon Steckdose, Category=poweroutlet, Tags=[PowerOutlet], Groups=[FF_Balcony, gSteckdose]) for 'ON' with counter 23 for listener org.openhab.core.persistence.internal.PersistenceManagerImpl@41ba995d

@J-N-K
Copy link
Member

J-N-K commented Apr 25, 2023

So it seems they are really scheduled but not executed. I don't understand how that can happen. According to the documentation a rejected execution should result in a RejectedExecutionException but that obviously does not happen. I have now changed to a scheduled thread pool, let's see if that helps.

@florian-h05
Copy link
Contributor Author

Hmm, very weird.

Should I now update the bundle again and restart?

@J-N-K
Copy link
Member

J-N-K commented Apr 25, 2023

Yes please.

@florian-h05
Copy link
Contributor Author

florian-h05 commented Apr 25, 2023

Unfortunately this did not help. What‘s really interesting is, that it works for none of my Group:Switch:OR(ON,OFF), but works fine e.g. for my contacts group Group:Contacts:OR(OPEN, CLOSED).

EDIT: No, it does not work for my contacts group either.

EDIT: After setting the threadpool size of the items threadpool to 20, I got it working (at least for a short time):

21:40:15.734 | INFO  | OH-eventexecutor-1   | openhab.event.ItemCommandEvent       | 172 | Item 'Balkon_Steckdose' received command OFF                                                                      
21:40:15.815 | ERROR | OH-eventexecutor-1   | org.openhab.core.items.GenericItem   | 149 | Scheduled update from Balkon_Steckdose (Type=SwitchItem, State=OFF, Label=Balkon Steckdose, Category=poweroutlet, Tags=[PowerOutlet], Groups=[FF_Balcony, gSteckdose]) for 'OFF' with counter 6 for listener FF_Balcony (Type=GroupItem, Members=3, State=NULL, Label=Balkon, Category=balcony, Tags=[Terrace], Groups=[gFF])  
21:40:15.825 | ERROR | OH-items-6           | org.openhab.core.items.GenericItem   | 149 | Executing job '6'                                                                                                 
21:40:15.828 | ERROR | OH-items-6           | org.openhab.core.items.GroupItem     | 149 | Called stateUpdated for Balkon_Steckdose (Type=SwitchItem, State=OFF, Label=Balkon Steckdose, Category=poweroutlet, Tags=[PowerOutlet], Groups=[FF_Balcony, gSteckdose]) with state 'OFF' (old state is 'NULL')                                                                                                                
21:40:15.825 | ERROR | OH-eventexecutor-1   | org.openhab.core.items.GenericItem   | 149 | Scheduled update from Balkon_Steckdose (Type=SwitchItem, State=OFF, Label=Balkon Steckdose, Category=poweroutlet, Tags=[PowerOutlet], Groups=[FF_Balcony, gSteckdose]) for 'OFF' with counter 7 for listener gSteckdose (Type=GroupItem, BaseType=SwitchItem, Members=1, State=ON, Label=Steckdosen, Category=poweroutlet, Groups=[gHome])                                                                                                                                                                                                  
21:40:15.840 | ERROR | OH-items-13          | org.openhab.core.items.GenericItem   | 149 | Executing job '7'                                                                                                 
21:40:15.846 | ERROR | OH-items-13          | org.openhab.core.items.GroupItem     | 149 | Called stateUpdated for Balkon_Steckdose (Type=SwitchItem, State=OFF, Label=Balkon Steckdose, Category=poweroutlet, Tags=[PowerOutlet], Groups=[FF_Balcony, gSteckdose]) with state 'OFF' (old state is 'ON')                                                                                                                  
21:40:15.856 | ERROR | OH-items-6           | org.openhab.core.items.GroupItem     | 149 | Called stateUpdated for gSteckdose (Type=GroupItem, BaseType=SwitchItem, Members=1, State=OFF, Label=Steckdosen, Category=poweroutlet, Groups=[gHome]) with state 'OFF' (old state is 'NULL')                                                                                                                                  
21:40:15.857 | ERROR | OH-eventexecutor-1   | org.openhab.core.items.GenericItem   | 149 | Scheduled update from Balkon_Steckdose (Type=SwitchItem, State=OFF, Label=Balkon Steckdose, Category=poweroutlet, Tags=[PowerOutlet], Groups=[FF_Balcony, gSteckdose]) for 'OFF' with counter 8 for listener org.openhab.core.persistence.internal.PersistenceManagerImpl@60578b33                                             
21:40:15.871 | ERROR | OH-items-6           | org.openhab.core.items.GenericItem   | 149 | Executing job '8'                                                                                                 
21:40:15.894 | INFO  | OH-eventexecutor-1   | hab.event.GroupItemStateChangedEvent | 172 | Item 'gSteckdose' changed from ON to OFF through Balkon_Steckdose                                                 
21:40:15.914 | INFO  | OH-eventexecutor-1   | openhab.event.ItemStateChangedEvent  | 172 | Item 'Balkon_Steckdose' changed from ON to OFF

@florian-h05
Copy link
Contributor Author

I am currently experimenting a bit, and setting the threadpool size to 400 seems to migitate the problem, which also is type of proof that there is some problem with threading.

Reading a bit inside org.openhab.core.common.ThreadPoolManager to find out with which settings openHAB creates ThreadPoolExecutors:
For org.openhab.core.common.QueueingThreadPoolExecutor, the maximum pool size is set to the pool size defined in the config file or the default (which seams to be 5).
It seems that the tasks are rejected, however the RejectedExecutionException is not thrown, because inside the openHAB Core class QueueingThreadPoolExecutor, a handler for rejected executions is set.

@florian-h05
Copy link
Contributor Author

I have added some logging to the QueueingThreadPoolExecutor for the items thread pool, and without manually setting a high number for the thread pool size, for my 900+ Items the QueueingThreadPoolExecutor is basically queueing tasks the whole time because there are not enough threads for execution, and the queue thread seems to not be able to execute the queue in a reasonable amount of time. The general implementation with the queue thread must work, for MapDbPersistence it works fine.

Setting the thread pool size of the items thread pool high enough solves this problem.

@J-N-K
Copy link
Member

J-N-K commented Apr 25, 2023

How many tasks are scheduled for each update? And how many state updates do you see (per second)? I guess we should figure out why it takes so long to process the queue.

@florian-h05
Copy link
Contributor Author

florian-h05 commented Apr 25, 2023

For each update, at least one task for persistence is scheduled plus for each group the Item is a a member of. Most of my Items are at least member of one group, so per state update at least 2, usually 3 tasks scheduled.
The problem is, that there are nested groups.
If I control on Balkon_Steckdose, only the FF_Balcony and the gSteckdose groups and the Persistence listeners are informed by a task, but if I change my light, the gLicht, F2_Florian, F2_Florian_Licht groups are informed. On update of F2_Florian, gF2 is informed, and then gHome. Plus I‘d persistence in each step as well. Switching my light causes 35 tasks to be executed.

I cannot really say how many state updates are there per second, however my added logging tells me how many tasks are executed or queued per second: In peaks, there are over 30 messages per second. With a thread pool size of 200, there are regularly tasks queued which are executed short time later when there is less load.
But with the default thread pool size of 5 (if I got that right from the code), I think tasks are faster added to the queue than executed.

I have to admit that I have 900 Items and a lot of traffic, however I most aren‘t so talky (e.g. my KNX), the system info binding might be a candidate for many Item state updates in short time and very often.

@J-N-K
Copy link
Member

J-N-K commented Apr 26, 2023

I think persistence might indeed be the problem here. The state change listener receives, checks and stores the values from the state change event in the same thread. This might block threads for a longer time and exhaust the thread pool.

@florian-h05
Copy link
Contributor Author

I have verified this by removing the setting for the items thread pool size and removing all persistence services. Persistence is the problem here!
I can also see in the logs, that tasks are queued now and short time after the queued tasks are executed, then finally [re.common.QueueingThreadPoolExecutor] - Queue for thread pool 'items' fully processed - terminating queue thread.

@J-N-K
Copy link
Member

J-N-K commented Apr 26, 2023

That's a very tricky one. Originally my idea was to just schedule the events in the PersistenceManager but that is not possible. The reason is that we pass the Item to the persistence services and if we do that, the state may already have changed when the persistence service executes .getState.

@florian-h05
Copy link
Contributor Author

florian-h05 commented Apr 27, 2023

What if the jobs for the PersistenceManager listener were scheduled in its own thread pool?

@J-N-K
Copy link
Member

J-N-K commented Apr 27, 2023

I think the "real" issue is that she JavaDoc of PersistenceService.store

/**
     * Stores the current value of the given item.
     * <p>
     * Implementors should keep in mind that all registered {@link PersistenceService}s are called synchronously. Hence
     * long running operations should be processed asynchronously. E.g. <code>store</code> adds things to a queue which
     * is processed by some asynchronous workers (Quartz Job, Thread, etc.).
     *
     * @param item the item which state should be persisted.
     */

is not well implemented in the persistence services. I checked at least InfluxDB and there the store operation is executes synchronously.

@florian-h05
Copy link
Contributor Author

So the root problem is not the core architecture but the implementation in the persistence services, right?

FYI I am using InfluxDB and MapDB.

@J-N-K
Copy link
Member

J-N-K commented Apr 27, 2023

Can you try updating the influxdb bundle with https://janessa.me/esh/org.openhab.binding.persistence.influxdb-4.0.0-SNAPSHOT.jar? I think I found a good solution for that. It also enables batch updating which will be very useful for #3000.

@florian-h05
Copy link
Contributor Author

I wasn‘t able to use the bundle you provided, however I compiled your InfluxDB PR and deployed it, but the problem persists.

It seems that it is better now. With 40 threads for the items thread-pool is sometimes able to execute the queued tasks, however the problem persists.

@J-N-K
Copy link
Member

J-N-K commented Apr 27, 2023

If it needed 900 before that's at least a factor 20 better 😀

Was that with influxdb only or did you also re-enable mapdb?

@florian-h05
Copy link
Contributor Author

It needed 400 before, but this means still much better.
Even though 40 might be not enough for now to avoid having to wait to long for the execution of queued jobs.

MapDB was re-enabled.

@florian-h05
Copy link
Contributor Author

Good news!
I‘ve just tested it with only InfluxDB persistence installed and then it works fine without having to increase the thread pool size! The QueueingThreadPoolExecutor is only using the queue from time to time and is able to execute the queue very soon after creating it.

So it is verified that MapDB is causing trouble then…

@J-N-K
Copy link
Member

J-N-K commented May 4, 2023

Since it has been confirmed that this is an issue in adding, I'll close it here.

@J-N-K J-N-K closed this as completed May 4, 2023
@J-N-K J-N-K added external bug A problem or unintended behavior of an external library and removed bug An unexpected problem or unintended behavior of the Core labels May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
external bug A problem or unintended behavior of an external library
Projects
None yet
Development

No branches or pull requests

2 participants