-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ubus call poe info
status is randomly stuck on (cold/re)boot to 'unknown' for all ports
#10
Comments
I'm going to abstain from speculating until we can get some debug logs ( |
Trying to redirect the output from The laptop was connected to port 3 all the time, 1 ssh session for log the 2nd ssh session for even p2-p24 uneven p1-p23 part1 / part2 pastebin upload size limit |
AnalysysThere's your problem:
MCU responds with packet Further down:
Here we become de-synchronized Our view of what we received is likely not what the MCU sent. There's probably a bunch of lost error messages about bad and mismatched packets. You might see this if you run:
I've seen this before when two-instances of realtek-poe were running simultaneously. I doubt it's the case here since we're more than 3000 packets in before this happens. Next stepsNot much we can do, but implement better error logging, and see:
|
Capturing the real However all previous logs have been done by disabling the realtek-poe service and starting from cold-boot so no extra instance is running. This time I do however have a log now of the funny sticky disabled ports 4 & 7 (possible other github issue #9 ), after reboot. So this time the output is after a reboot not coldoot. PoE power comes back on after reboot although no realtek-poe service is started so no ubus output but after starting realtek-poe -d this was the output.
|
Still on CI22 but when doing ~10 warm reboot(s), and starting Maybe related to boot timing or its just a timing issue at all like mentioned in that issue that resending failed commands may be good enough. |
Just tried CI29 and after the first warm reboot all ports show status unknown, although PoE is being delivered to (re)connected devices. Doing a However
|
First cold-boot with CI29 also shows all ports as unknown, its still random however sometimes it initializes the status for all ports correct and sometimes it does not. Any idea how to log deamon output from |
|
A lot of lines with errors which keeps on going until I |
Ugh! Error logging still sucks. Look higher up. There should be a different message that causes it all. I'm looking for stuff like:
|
First I had to increase the logd size in |
Seems like upgrading that message to a warning worked a lot better in my head. Got the message back down in priority with CI#34. I'm still scratching my head at how to fix these 'bad checksum' packets |
Just tried CI34 but this package is broken, it won't start the init.d service therefor no
EDIT1: |
I'm sorry. I introduced a double-free() when rebasing, and didn't catch it. Adresses that snafu in CI35 |
ubus call poe info
status is randomly stuck on (cold/re)boot to 'unkown' for all portsubus call poe info
status is randomly stuck on (cold/re)boot to 'unknown' for all ports
Some early testing of the CI41 package already shows the PoE port 1 comes up by default
Although the service crashed the PoE mechanism is still active and delivering PoE to all ports. |
Although the status hangs on |
Hey, @walterav1984 . I'm sorry about the crashes in CI41. I had split up the logging into a separate PR, and when I rebased the retry mechanism, I accidentally introduced a use-after free. The retry mechanism is still in "draft" because there are some issues with this solution. I have another solution in mind (#19) that and I'm hoping to have ready for testing soon. No point in testing that yet, since the CI build doesn't include logging of error packets. To address your log and why ports worked:
This is the "port power stats" command. If we're sending this, then most likely, we've configured the ports. So I would expect the ports to work even though the status isn't updated. |
@walterav1984, can you please give CI #44 a spin ? |
Tried CI44 and did around ~30 PoE device port swaps without a problem as in device status searching/requesting/delevering power was all correct. However already at first reboot all ports were stuck at Stupid enough I did not log the errors and went straight to CI45 which does contain fix CI44 if I'm not mistaken. Doing 2 fine reboots with CI45, the 3rd reboot all ports are stuck again at status
Doing 2 fine reboots also with CI46, the 3rd reboot all ports are stuck again at status
Doing a
EDIT: |
CI#44 is the same as CI#45. When the pull request was merged, it triggered a new CI, which became CI#45. I don't think the I also think this is more likely to occur on a cold boot. We're not sending a "global disable" command, so if ports have already been configured, there's less work for the MCU to do. This would be consistent with the hypothesis that the MCU is overloaded. Could this also be a completely random transmission failure? A slow optoisolator design? Possibly, but unlikely. Might be a good idea to read out the serial port error counters, if they exist. |
I can test this in practice by reading the serial line, though I am wondering whether something more sensitive than a SaleaeLogic clone would make sense here: perhaps when the system is busy, other data lines near UART1 are causing noise to spread. A similar idea is that the RTL838x might not be that well-constructed, or that this implementation of it (the GS1900-24HPv1 board) is not ideal for operation when the system is busy doing other things. Note that the SoC package can optionally switch from UART1 from pins 116 and 117 to a separate SPI line (based on the GMII_INTF_SET register). I hae no doubts that this is set correctly, I'm just making the point that it might not be the cleanest line implementation on the board,and the operation of other peripherals might be harming communication b/w the MCU and ttyS1. It's clear that delaying when realtek-poe starts also lowers the rate of problems, we can reproduce that. How, is a different question, and how we fix that is yet another (given the existence of #15 anyhow). |
If we're looking for signal integrity issues, we need an oscilloscope, not logic analyzer. Does the signal reach he correct voltage level? Does it cross the isolation barrier cleanly? We don't see these issues on on the vendor firmware, right? What do you think are the chances of this being an analog problem at a rate of 19200? |
Does this still happen with CI57 ? |
Given we haven't seen any complaints about this in the past two weeks, I am closing this as resolved. |
Is it still usefull to test some of these late CI builds on RC 22.03, since |
Only if you see issues with the package in 22.03. |
Just tested 22.03 with the realtek-poe package from snapshot since I didn't find one yet in the 22.03 release. After the first cold boot, all ports are stuck on status 'unknown' again and poe power output is not working for those configured ports. The usual Did patch CI57 already land in realtek-poe_382c60e7-1_mips_4kec.ipk or is it still usefull to test? A workaround for me would probably be a couple of seconds sleep and a poe restart in '/etc/rc.local'. |
I see. I need to update the packages feed. CI57 should fix this in the meantime. Please make a note of it here if it doesn't. |
Actually, CI 58 is the latest and greatest (realtek-poe v1.0) |
Just tried CI58 which still manifests the
Is there a way to start the |
|
After first coldboot this is the output:
Dont' mind the poe config I used, just enabled ports 1-8 +9 +17 so nothing strange there... |
Looks like I need to add a command to dump the state machine status. Meanwhile, you can edit |
Did some test in the past and recently with |
Even simpler: edit |
Did try that earlier with single/double ' " ` or even rewriting in variable $() but starting realtek-poe from /etc/init.d/poe won't start at all as soon as there is a parameter/argument added to the program in that script. Starting it from /etc/rc.local is possible however that one is ran later during the boot so it didn't trigger the unknown port issue. |
Does this not work? |
🤦♂️ It does indeed work with debug output see pastebin. This is after first reboot with no PoE devices connected only management on eth3. |
Gah! pastebin is a minefield of advertisements and trackers. TLDRThis will take a bit to enginerd a code fix. A temporary workaround is to make realtek-poe start up later in the boot process. Analysis
That's problematic because the actual packet the MCU sent is What's happening is an off-by-n error in interpreting packets from the stream. It's very likely the MCU was still transmitting something when realtek-poe started, and that threw us off. If we had a serial trace, we'd likely see the bootloader send commands, then realtek-poe start up as the MCU is still replying. The fix is a better mechanism to handle checksum errors. That needs to account for both extra bytes in the stream (current shituation), and bytes missing. The latter needs to discard the entire packet, which affects the reply handler as well. |
🤦♂️ To quote myself... and repeat the past!
Here a full log from coldboot with unknown status: |
Thank you! Actually, the full log is even more interesting. Initialization starts off and proceeds fine:
And then the MCU issues an error:
So far so good. That packet is something realtek-poe can handle. However, our problems start right after that:
Here, bytes start missing. I'm not sure about the The fix is likely what I already discussed. I have some ideas. Reopening issue. |
ZyXEL GS1900-24HPv1
realtek-poe CI22 version
Sometimes it shows correct realtime status info after (cold/re) boot for the
ubus call poe info
and sometimes its stuck inunknown
state for all ports although they are actuallyDelivering power
fine with PoE devices powered on. A/etc/init.d/poe restart
is the only remedy to make the status correct again with the downside of temporary disabling PoE.The text was updated successfully, but these errors were encountered: