Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAN0 failure with RPI4b and Pican-M hat caused by signalk-rpi-monitor at low intervals #1626

Open
naugehyde opened this issue Sep 17, 2023 · 46 comments

Comments

@naugehyde
Copy link

Previously reported that CAN0 randomly stops streaming data (via canboatjs) from a Pican-M hat on a RPI4b connected to a N2k network. [#1622]

Okay so after a LOT of investigation I've been able to isolate the problem and the bug is (mostly) recreatable.

The problem stems from the signalk-rpi-monitor interval. When the monitor plugin is set to report at an interval of say 5s (the default is 60s) CAN0 is likely to fail after 20min or so. The lower the interval the more likely the failure will occur in short order.

I haven't isolated the root cause of the problem (I'm guessing that it has something to do with nodejs's intermittent garbage collection cycle) but it's a fairly serious problem as the only way to get one's instruments back is to restart the Signalk server process.

I'd bet that even at the default 60s interval the CAN bus will, in time, fail (silently). At sea, this could be problematic to say the least.

To recreate the problem you'll need a CAN0 connection of some kind on an RPI 3/4b (only tested on 4b w/Pican-M) and signalk-rpi-monitor installed and its interval set to a sufficiently low value.

For the time being, boaters with signalk-rpi-monitor installed on a RPI with a CAN0 bus should, at the very least, take warning and definitely not set their rpi-monitor interval to a value less than the default.

As for a longer term solution, further investigation into canboatjs is warranted as the failure is not occurring at the system level (candumpjs/candump continue to function normally after signalk's CAN0 data connection fails).

@tkurki
Copy link
Member

tkurki commented Sep 17, 2023

Whoa. Thanks for investigating and sharing. Created sberl/signalk-rpi-monitor#9 with link here and posted link to this on Slack.

@tkurki
Copy link
Member

tkurki commented Sep 17, 2023

Maybe we close the other issue and edit the title of this one to mention CAN and signalk-rpi-monitor?

@naugehyde naugehyde changed the title CAN0 failure with RPI4b and Pican-M hat [UPDATE] CAN0 failure with RPI4b and Pican-M hat caused by signalk-rpi-monitor Sep 17, 2023
@naugehyde
Copy link
Author

I'll be interested to see if others can recreate the problem on different hardware and different canboat connections. I have a waveshare CAN hat rig that I'll try later today.

In the meantime, setting the @canboat/canboatjs option noDataReceivedTimeout to an absurdly large value (60s) to see if the problem is mitigated.

I suspect the problem is only triggered by signalk-rpi-monitor (which of necessity uses the spawn() function repeatedly) but may be the inevitable result of a garbage collecting program environment.

@naugehyde naugehyde changed the title CAN0 failure with RPI4b and Pican-M hat caused by signalk-rpi-monitor CAN0 failure with RPI4b and Pican-M hat caused by signalk-rpi-monitor at low intervals Sep 17, 2023
@naugehyde
Copy link
Author

One stopgap solution against any plugin (not just rpi-monitor) that's doing whatever rpi-monitor is doing and causing CAN0 to go down is to set the "noDataReceivedTimeout" suboption for the connection in your settings.json.

e.g.:
"pipedProviders": [
{
"enabled": true,
"id": "can0",
"pipeElements": [
{
"options": {
"logging": true,
"subOptions": {
"noDataReceivedTimeout":2,
"interface": "can0",
"type": "canbus-canboatjs",
"uniqueNumber": 1159732,
"useCanName": true,
"filtersEnabled": false,
"filters": [],
"overrideTimestamp": true
},
"type": "NMEA2000"
},
"type": "providers/simple"
}
]
},

@mairas
Copy link
Contributor

mairas commented Sep 18, 2023

@naugehyde how is the can0 interface configured? In /etc/network/interfaces or /etc/network/interfaces.d? Could you share the configuration?

@mairas
Copy link
Contributor

mairas commented Sep 18, 2023

My theory: PICAN-M uses MCP2515 as the CAN controller (just like most other Pi devices out there). That IC has an RX queue of only 2 frames. Signalk-rpi-monitor spawns external processes to read the information. Even though this should be an async operation, I dunno, maybe this still causes some blocking in nodejs main loop. If this is the case, it could make the RX buffers sometimes overflow. I believe overflows cause the RX error counter to increase, and once REC goes above 127, the CAN controller enters an Error-Passive state, causing any transmissions to cease.

@naugehyde
Copy link
Author

naugehyde commented Sep 18, 2023 via email

@mairas
Copy link
Contributor

mairas commented Sep 18, 2023

@naugehyde could you please provide can0 configuration settings I requested above?

@mairas
Copy link
Contributor

mairas commented Sep 18, 2023

Note that IF the issue is due to the RX buffers overflowing, canboatjs has nothing to do with it and it can do nothing about it.

@naugehyde
Copy link
Author

naugehyde commented Sep 18, 2023 via email

@mairas
Copy link
Contributor

mairas commented Sep 18, 2023

I recommend adding "restart-ms 100" to the pre-up line. That makes the kernel reset the CAN controller after 100 ms if the controller enters bus-off state. That might not apply in this case because bus-off kills both RX and TX, and you said RX still works. But it's a good in any case.

OK, and finally found:

If you can get TX to stop again (the error state you reported), try the following command (before applying the above change):

ip -details -statistics link show can0

What does it report as state? Normally it should be ERROR-ACTIVE. ERROR-PASSIVE means that the controller is in passive state; TX is disabled but RX still works. And BUS-OFF is completely dead.

With this, we can decisively narrow down the issue; either to the CAN controller & kernel driver or userspace.

@naugehyde
Copy link
Author

naugehyde commented Sep 18, 2023 via email

@naugehyde
Copy link
Author

naugehyde commented Sep 18, 2023 via email

@naugehyde
Copy link
Author

naugehyde commented Sep 18, 2023 via email

@mairas
Copy link
Contributor

mairas commented Sep 19, 2023

OK, interface is fine both before and after. And it's not about dropped packets because the dropped count didn't change, and error count remains zero. So it is a userspace issue!

@naugehyde
Copy link
Author

naugehyde commented Sep 19, 2023 via email

@tkurki
Copy link
Member

tkurki commented Sep 22, 2023

@naugehyde 👏 for investigating this! Let's concentrate communications here for the time being, as we dont' know what combination triggers the problem.

I think the best way forward would be to create as low level repro for this as we can, starting with just socketcan and spawn. Could you try something like this?

@naugehyde
Copy link
Author

Thanks for the code. Will put it to use soon.

If it's a resource issue owing to the overhead of frequent shell creation, Nodejs's garbage collection cycle could be running long enough for the CAN0 nodejs stream to programmatically timeout . If this is true (will test inside SK first) running the test outside of SK will change the running memory footprint so I may have to adjust the V8 heap size to invoke the timeout.

@mpvader
Copy link
Contributor

mpvader commented Sep 22, 2023

Hey all, possibly off topic, but did want to share here:

that Pican-M is using the MCP2515, which has only two RX buffers.

That puts quite the requirement on speedy processing by the main CPU.

When designing hardware on our end, and needing an external CAN controller (as opposed to one inside a microcontroller, which almost always has more RX buffers than only 2); we quite quickly discarded the MCP2515 as an option.

And worked on/with the MCP2518 or 2519 instead. Much more buffers. And in case of N2K, especially the existence of fastpackets adds to the requirement of having sufficient RX buffers.

Supported in Linux kernel 5.10 and up if I’m not mistaken.

in other words: on my boat, I’d try to keep it free from MCP2515. Missing packets leads to the weirdest of issues, very hard to diagnose.

@mairas
Copy link
Contributor

mairas commented Sep 22, 2023

I don't argue about MCP2515 being less than stellar, but the impact isn't necessarily that huge. My boat server, based on a Raspberry Pi CM4 and Waveshare's isolated CAN HAT (with MCP2515), at the moment has been running for 8 days straight. CAN RX packet counter is at 82292123 (627.7 MiB) and the dropped packets counter is at 5227. That's one dropped packet every 2 minutes. Not nothing, but the packet loss rate is 5227/82292123, or 0.006%. Given the repetitive nature of N2K traffic, that's not a huge deal. But this is something where YMMV a lot.

@naugehyde
Copy link
Author

I agree that the RX buffers on the MCP2515 may be insufficient for some scenarios. In a large harbor like mine (Marina Del Rey) the N2K stream is flooded with AIS position updates.) It doesn't explain why the canboatjs fails silently as explained earlier in the thread but a more robust CAN receiver could only help matters.

I would happily test an MCP2518 with Signalk, if I could find a board for less than $20 that easily interfaced with an RPi.

Do you have any suggestions, @mpvader? I found this product but it's not clear from the docs how to interface it with an RPi.

@mpvader
Copy link
Contributor

mpvader commented Sep 22, 2023

I would happily test an MCP2518 with Signalk, if I could find a board for less than $20 that easily interfaced with an RPi.
Do you have any suggestions, @mpvader?

Sorry, I wouldn't know. I checked a bit, and can't find any for that price. There are some, under EUR 100. It looks like the low cost makers have not switched over (yet).

I don't argue about MCP2515 being less than stellar, but the impact isn't necessarily that huge.

Yes I agree completely on the YMMV and I can see that it will work in most cases. It will depend on business of the canbus vs CPU idle time. Note that not all can controllers have proper implementation of error and drop counters. I don't remember the details around that though. And while at it, the adapter/protocol you really don't want for a complex protocol like NMEA2000 is SLCAN.

Btw. we're looking at also from a view of having the network 100% reliable for changing configuration, and so forth. Not just repetitive data.

@mpvader
Copy link
Contributor

mpvader commented Sep 22, 2023

To add: an implementation that we also tested extensively, and found to be very reliable, is the candlelight project, which uses the gsusb driver between the CAN-CPU and the Linux-CPU.

@sailoog
Copy link
Contributor

sailoog commented Sep 25, 2023

Hi, our MacArthur HAT v1.2 is about to be released as stable and it has the MCP2518 tested and working: https://github.com/OpenMarine/MacArthur-HAT/tree/v1.1

All the prototypes are installed and working and I only have one on hand now, but we will soon go into production. Maybe we can leave you a prototype temporarily, we are in Barcelona and Washington, where are you?

@naugehyde
Copy link
Author

naugehyde commented Sep 25, 2023 via email

@astuder
Copy link
Contributor

astuder commented Sep 26, 2023

I'd happily test a prototype if I had one. I am in Los Angeles as it happens.

I can hook you up with a MacArthur HAT with the MCP2518. Drop me an email to adrian at adrianstuder dot com

@tkurki
Copy link
Member

tkurki commented Oct 5, 2023

@naugehyde any news on this? Have you had the chance to test the minimal repro we talked about above?

@naugehyde
Copy link
Author

@tkurki Currently testing with a MCP2518 equipped CAN hat courtesy @astuder. The MCP2518 has a much larger buffer space than the MCP2515 that's on the PICAN-M and the Waveshare CAN hat.

Just got the rig up and running today. Was not able to recreate with the MCP2518 hat but it's just day one and only had time for half an hour's testing.

Will see if I can recreate w/ the MCP2518 then get back to testing on a MCP2515 equipped hat.

It will be very interesting if I cannot recreate the bug on a MCP2518 equipped hat as the common denominator in my tests on two different boards (RPI 4b and a Le Potato) with two different CAN hats was the MCP2515 controller.

@naugehyde
Copy link
Author

@tkurki Should I test on 2.3.0 or 2.2.0? I note with interest that 2.3 has an upgraded canboatjs.

@tkurki
Copy link
Member

tkurki commented Oct 6, 2023

Latest makes most sense I guess.

@naugehyde
Copy link
Author

naugehyde commented Oct 12, 2023

@tkurki Tested the code you supplied (with modifications to increase the spawn rate) on an RPI-4b running Bullseye 64 bit and, alas, after about 12 minutes, the SOCKETCAN connection silently failed. It's worth noting that the CAN connection running on the same unit on a separate node process (with RPI-Monitor running at 30s intervals) at the same time did NOT fail.

I also tested CAN Connection (in SK) using a prototype MacArthur Hat that incorporates the MCP2518 chip (unlike the MCP2515 on the Pican-M hat it has CAN-FD capability and a robust buffer space) and that rig failed much like the Pican-M hat and a Waveshare CAN hat did in earlier tests.

Recommend that new releases provide a default of 2s to the noDataReceivedTimeout subOption for CAN connections (see: this)..

Still not sure why this is happening. Still not sure what (if anything) besides repeated spawn() calls will cause a CAN0 connection to silently fail.

@tkurki
Copy link
Member

tkurki commented Oct 12, 2023

@naugehyde just to be crystal clear:

What I am after is a reproducible bug report at https://github.com/sebi2k1/node-can. If I understand your comment correctly this has actually nothing to do with SK Server nor canboatjs (that uses socketcan).

Naturally you are very welcome to file it there yourself and link back here.

@tkurki
Copy link
Member

tkurki commented Oct 12, 2023

Oh forgot to add thanks for doing the leg work on this!

@naugehyde
Copy link
Author

naugehyde commented Oct 12, 2023

@tkurki my bad! Tested with SOCKETCAN not canboat. Used your code, just added additional intervals and removed the reconnect in the error code (never reached by the way).

OS: Linux piriot 6.1.21-v8+ #1642 SMP PREEMPT Mon Apr 3 17:24:16 BST 2023 aarch64 GNU/Linux

Node version: v18.17.0

NB: Tested overnight and wrote up comment first thing prior to any caffeine.

@naugehyde
Copy link
Author

@tkurki Just tested node 20. SOCKETCAN failed silently after about an 1/2 hour on node 20.6.1

NB: I am testing with AIS on in a busy harbor.

@naugehyde
Copy link
Author

@tkurki Check out: nodejs/node#14917

Wonder if the issue above can be reproduced in more recent node versions and if so, if it is related to the CAN failure we're seeing.

@naugehyde
Copy link
Author

@tkurki any luck recreating the issue on your end?

@humppafreak
Copy link

looks like i'm running into the same issue, or at least a very similar one:

  • rpi4, 8GB, running raspbian 11.7 bullseye 64bit
  • hat-stack consists of waveshare 2ch can hat, waveshare 2ch rs485 hat, waveshare GNSS hat (with the interrupt pin snipped off), and @mairas' sailor-hat v2.
  • signalk v. 2.2.0
  • signalk-rpi-monitor plugin is installed, but not enabled

at irregular intervals (anywhere from a few minutes to >24h), the can0 interface simply "goes silent" without any hint of an error in any logfile.
rx and tx packet counters in /proc/net/dev do not increase anymore, it's like someone simply pulled the plug on the interface.
in most cases, the can0 interface can be forced to work again by a simple ifconfig can0 down; sleep 5; ifconfig can0 up.
without the "sleep 5" in the restart commands, I get SIOCSIFFLAGS: Connection timed out on stdout and mcp251x spi0.0: MCP251x didn't enter in config mode in dmesg.

my current workaround is a simple shell script that monitors /proc/net/dev for changes in rx packet count, and does the ifconfig down/up dance as described above.

@naugehyde
Copy link
Author

naugehyde commented Oct 24, 2023 via email

@humppafreak
Copy link

humppafreak commented Oct 25, 2023

Ah, okay. In that case, I might have stumbled upon a completely different CAN issue... which, interestingly enough, seems to have disappeared completely since I appended trace_event=spi:* to my /boot/cmdline.txt in an attempt to debug what's going on on the SPI bus.

And to your other question: The Sailor-HAT v2 does not have an N2K interface (the v1 did..). The SH v2 is more or less just a "smart power supply" with a wide input voltage range (9-32V), a stabilized 5V output, and a few beefy supercaps to buy enough time for a clean shutdown on power failure / switch off.

@JonyIvy
Copy link
Contributor

JonyIvy commented Nov 1, 2023

I've had this happening several times last summer - no more N2K data shown in SK / KIP. I didn't investigate further, just rebooted the machine, so I can't tell which one of the errors I had.

CM4, Waveshare CM4 IO BASE B, Waveshare isolated dual CAN HAT (MCP2515).
signalk-rpi-monitor is running with 5s refresh.

Unfortunately I cannot test anything anytime soon, RPi sitting on my desk, boat far away.

@naugehyde
Copy link
Author

naugehyde commented Nov 1, 2023 via email

@JonyIvy
Copy link
Contributor

JonyIvy commented Nov 1, 2023

I already have noDataReceivedTimeout: 4 in my settings.json. So my problem might be the same as humppafreak's.

@humppafreak
Copy link

humppafreak commented Nov 2, 2023

I already have noDataReceivedTimeout: 4 in my settings.json. So my problem might be the same as humppafreak's.

in that case, you might want to try enabling SPI event tracing as described in my last comment.

@tkurki
Copy link
Member

tkurki commented Jan 22, 2024

@naugehyde I believe you had some new information about this?

@naugehyde
Copy link
Author

@tkurki I do. The testing process is very slow, (requires a NMEA network and the fault can take up to 8 hours to reproduce), so please forgive any lapses in communication on my part.

What I know now is: at least for my NMEA network the fault occurs when RPI-Monitor (or its equivalent) refresh is set for 5s 2.5s 1.25s, .625s etc. Remarkably, the fault does NOT occur at values inbetween the reported values. Ran the test at .411s for example which equated to maybe 50 deltas/s and at least after eight hours, CAN0 stayed up. I have no idea why this is the case. Very likely a timing issue of some kind. Very likely a problem deep inside nodejs.

The fault occurs in the latest build with the latest SK and RPI system software after considerably more elapsed time than earlier builds and systems, but remains catastrophic: CAN0 dies without a warning.

It's unlikely to occur in most setups and may not occur at all on other NMEA networks (I have no way of knowing for sure. I have take devices off the network lowering the number of deltas and the fault is still reproducible but I don't have access to other networks.)

The simple solution is to put a no data rec'd timeout into the settings.json as illustrated here (#1626 (comment)). As you know, a no data timeout will cause canboatjs to pause for the indicated amount of time in the event no data is rec'd and then restart the CAN0 connection. From my testing this will successfully keep the connection up.

It may be an obscure bug but it is real and it is catastrophic as the connection dies silently and a SK restart is required to repair.

I believe it's wise to have a NoDataReceivedTimeout default value >0 for CAN0 connections in the settings.json of the next SK version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants