-
-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CAN0 failure with RPI4b and Pican-M hat caused by signalk-rpi-monitor at low intervals #1626
Comments
Whoa. Thanks for investigating and sharing. Created sberl/signalk-rpi-monitor#9 with link here and posted link to this on Slack. |
Maybe we close the other issue and edit the title of this one to mention CAN and signalk-rpi-monitor? |
I'll be interested to see if others can recreate the problem on different hardware and different canboat connections. I have a waveshare CAN hat rig that I'll try later today. In the meantime, setting the @canboat/canboatjs option noDataReceivedTimeout to an absurdly large value (60s) to see if the problem is mitigated. I suspect the problem is only triggered by signalk-rpi-monitor (which of necessity uses the spawn() function repeatedly) but may be the inevitable result of a garbage collecting program environment. |
One stopgap solution against any plugin (not just rpi-monitor) that's doing whatever rpi-monitor is doing and causing CAN0 to go down is to set the "noDataReceivedTimeout" suboption for the connection in your settings.json. e.g.: |
@naugehyde how is the |
My theory: PICAN-M uses MCP2515 as the CAN controller (just like most other Pi devices out there). That IC has an RX queue of only 2 frames. Signalk-rpi-monitor spawns external processes to read the information. Even though this should be an async operation, I dunno, maybe this still causes some blocking in nodejs main loop. If this is the case, it could make the RX buffers sometimes overflow. I believe overflows cause the RX error counter to increase, and once REC goes above 127, the CAN controller enters an Error-Passive state, causing any transmissions to cease. |
Thanks, that's interesting.
FYI rpi-monitor spawns system shells and executes system calls (mpstat etc)
to get its data. It's possible one or more of the commands is blocking the
CAN IP device but I doubt it as CAN0 continues to function normally after
canboatjs has hiccupped (a candump open in another terminal will stream
continuously). It stands to reason that candump would reflect any blockage
and timeout/retry in its stream.
(As I noted earlier, setting a noDataReceivedTimeout value in the CAN0
suboptions causes canboatjs to retry after hiccupping and the problem is
minimized as far as SK is concerned.)
But yeah it remains a mystery why canboatjs is hiccupping at all. The
nodejs event loop should not be blocked by spawned system calls. If it is
the case that the CAN device buffer is overflowing then canboatjs needs to
handle it more gracefully. But that still doesn't answer the question of
what rpi-monitor is doing to cause the problem in the first place!
…On Mon, Sep 18, 2023 at 12:01 AM Matti Airas ***@***.***> wrote:
My theory: PICAN-M uses MCP2515 as the CAN controller (just like most
other Pi devices out there). That IC has an RX queue of only 2 frames.
Signalk-rpi-monitor spawns external processes to read the information. Even
though this should be an async operation, I dunno, maybe this still causes
some blocking in nodejs main loop. If this is the case, it could make the
RX buffers sometimes overflow. I believe overflows cause the RX error
counter to increase, and once REC goes above 127, the CAN controller enters
an Error-Passive state, causing any transmissions to cease.
—
Reply to this email directly, view it on GitHub
<#1626 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALTAIRGDNVY6FDKS4FZP2BLX27WTFANCNFSM6AAAAAA43WHRS4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@naugehyde could you please provide can0 configuration settings I requested above? |
Note that IF the issue is due to the RX buffers overflowing, canboatjs has nothing to do with it and it can do nothing about it. |
The config is the vanilla config prescribed by the manufacturer:
auto can0
iface can0 inet manual
pre-up /sbin/ip link set can0 type can bitrate 250000
up /sbin/ifconfig can0 up
down /sbin/ifconfig can0 down
NB: I've recreated the issue on a rig with a different mcp2515 can hat (
https://www.waveshare.com/rs485-can-hat.htm).
Not getting reports of buffer overflows from the driver-- as I stated
earlier, the HW continues to function normally (if candump/candumpjs are to
be believed). Also, it's only canboatjs that's timing out and failing. The
event loop continues to stream from other sources after canboatjs stops
reporting CAN0 data. And FWIW, canboatjs, when noDataReceivedTimeout is
set, will report receiving no data then retry and function normally after a
brief delay.
…On Mon, Sep 18, 2023 at 3:41 AM Matti Airas ***@***.***> wrote:
Note that IF the issue is due to the RX buffers overflowing, canboatjs has
nothing to do with it and it can do nothing about it.
—
Reply to this email directly, view it on GitHub
<#1626 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALTAIRB742EQEH57BJTHTQTX3AQN3ANCNFSM6AAAAAA43WHRS4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I recommend adding "restart-ms 100" to the pre-up line. That makes the kernel reset the CAN controller after 100 ms if the controller enters bus-off state. That might not apply in this case because bus-off kills both RX and TX, and you said RX still works. But it's a good in any case. OK, and finally found: If you can get TX to stop again (the error state you reported), try the following command (before applying the above change):
What does it report as state? Normally it should be ERROR-ACTIVE. ERROR-PASSIVE means that the controller is in passive state; TX is disabled but RX still works. And BUS-OFF is completely dead. With this, we can decisively narrow down the issue; either to the CAN controller & kernel driver or userspace. |
Excellent! Thank you. Will give this a go ASAP and let you know.
…On Mon, Sep 18, 2023 at 7:29 AM Matti Airas ***@***.***> wrote:
I recommend adding "restart-ms 100" to the pre-up line. That makes the
kernel reset the CAN controller after 100 ms if the controller enters
bus-off state. That might not apply in this case because bus-off kills both
RX and TX, and you said RX still works. But it's a good in any case.
OK, and finally found:
If you can get TX to stop again (the error state you reported), try the
following command (before applying the above change):
ip -details -statistics link show can0
What does it report as state? Normally it should be ERROR-ACTIVE.
ERROR-PASSIVE means that the controller is in passive state; TX is disabled
but RX still works. And BUS-OFF is completely dead.
With this, we can decisively narrow down the issue; either to the CAN
controller & kernel driver or userspace.
—
Reply to this email directly, view it on GitHub
<#1626 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALTAIRB22QFJU7N2BMV2LILX3BLFDANCNFSM6AAAAAA43WHRS4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
FYI the bug is easily recreated by setting rpi-monitor's interval to 1s.
I'm not near my boat today and can only connect to it over 2G so it may
take me a day or so to get to testing. If you have access to a N2K network
atop an RPI with a mcp2515 controller, I'd love to know what you discover.
On Mon, Sep 18, 2023 at 8:04 AM Andrew Gerngross ***@***.***>
wrote:
… Excellent! Thank you. Will give this a go ASAP and let you know.
On Mon, Sep 18, 2023 at 7:29 AM Matti Airas ***@***.***>
wrote:
> I recommend adding "restart-ms 100" to the pre-up line. That makes the
> kernel reset the CAN controller after 100 ms if the controller enters
> bus-off state. That might not apply in this case because bus-off kills both
> RX and TX, and you said RX still works. But it's a good in any case.
>
> OK, and finally found:
>
> If you can get TX to stop again (the error state you reported), try the
> following command (before applying the above change):
>
> ip -details -statistics link show can0
>
> What does it report as state? Normally it should be ERROR-ACTIVE.
> ERROR-PASSIVE means that the controller is in passive state; TX is disabled
> but RX still works. And BUS-OFF is completely dead.
>
> With this, we can decisively narrow down the issue; either to the CAN
> controller & kernel driver or userspace.
>
> —
> Reply to this email directly, view it on GitHub
> <#1626 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ALTAIRB22QFJU7N2BMV2LILX3BLFDANCNFSM6AAAAAA43WHRS4>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
Before CAN0 failure:
4: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UP mode
DEFAULT group default qlen 10
link/can promiscuity 0 minmtu 0 maxmtu 0
can state ERROR-ACTIVE restart-ms 0
bitrate 250000 sample-point 0.875
tq 250 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1
mcp251x: tseg1 3..16 tseg2 2..8 sjw 1..4 brp 1..64 brp-inc 1
clock 8000000
re-started bus-errors arbit-lost error-warn error-pass bus-off
0 0 0 0 0 0
numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
RX: bytes packets errors dropped missed mcast
52248708 6531116 0 1233 0 0
TX: bytes packets errors dropped carrier collsns
14640232 1830069 0 0 0 0
After:
4: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UP mode
DEFAULT group default qlen 10
link/can promiscuity 0 minmtu 0 maxmtu 0
can state ERROR-ACTIVE restart-ms 0
bitrate 250000 sample-point 0.875
tq 250 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1
mcp251x: tseg1 3..16 tseg2 2..8 sjw 1..4 brp 1..64 brp-inc 1
clock 8000000
re-started bus-errors arbit-lost error-warn error-pass bus-off
0 0 0 0 0 0
numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
RX: bytes packets errors dropped missed mcast
52811163 6601425 0 1233 0 0
TX: bytes packets errors dropped carrier collsns
14767240 1845950 0 0 0 0
No relevant change that I can see.
|
OK, interface is fine both before and after. And it's not about dropped packets because the dropped count didn't change, and error count remains zero. So it is a userspace issue! |
Yes. Something is stalling the RX at the canboatjs level (socketcan
functions normally throughout) and causing a timeout that fails silently.
The offending plugin that triggered all of this (rpi-monitor) isn't making
any blocking calls that I can see (Node::spawn() for one is documented as
non-blocking), so I am mystified as to the root cause.
FWIW other users have reported canboatjs stream failures (
canboat/canboatjs#196 and
canboat/canboatjs#193) but I can't speak to the
details of their configurations.
My next test is to create a plugin that mimics rpi-monitor spawn'ed system
calls and see if there's an offending system call at the root of this or if
it is in fact Node::spawn() somehow messing things up.
In the meantime, SK users are advised to add the noDataReceivedTimeout
option to canboatjs as I detailed earlier in this thread (
#1626 (comment)
).
…On Tue, Sep 19, 2023 at 8:07 AM Matti Airas ***@***.***> wrote:
OK, interface is fine both before and after. And it's not about dropped
packets because the dropped count didn't change, and error count remains
zero. So it is a userspace issue!
—
Reply to this email directly, view it on GitHub
<#1626 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALTAIRG5TIFL7U7EMC2YTSDX3GYKRANCNFSM6AAAAAA43WHRS4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@naugehyde 👏 for investigating this! Let's concentrate communications here for the time being, as we dont' know what combination triggers the problem. I think the best way forward would be to create as low level repro for this as we can, starting with just |
Thanks for the code. Will put it to use soon. If it's a resource issue owing to the overhead of frequent shell creation, Nodejs's garbage collection cycle could be running long enough for the CAN0 nodejs stream to programmatically timeout . If this is true (will test inside SK first) running the test outside of SK will change the running memory footprint so I may have to adjust the V8 heap size to invoke the timeout. |
Hey all, possibly off topic, but did want to share here: that Pican-M is using the MCP2515, which has only two RX buffers. That puts quite the requirement on speedy processing by the main CPU. When designing hardware on our end, and needing an external CAN controller (as opposed to one inside a microcontroller, which almost always has more RX buffers than only 2); we quite quickly discarded the MCP2515 as an option. And worked on/with the MCP2518 or 2519 instead. Much more buffers. And in case of N2K, especially the existence of fastpackets adds to the requirement of having sufficient RX buffers. Supported in Linux kernel 5.10 and up if I’m not mistaken. in other words: on my boat, I’d try to keep it free from MCP2515. Missing packets leads to the weirdest of issues, very hard to diagnose. |
I don't argue about MCP2515 being less than stellar, but the impact isn't necessarily that huge. My boat server, based on a Raspberry Pi CM4 and Waveshare's isolated CAN HAT (with MCP2515), at the moment has been running for 8 days straight. CAN RX packet counter is at 82292123 (627.7 MiB) and the dropped packets counter is at 5227. That's one dropped packet every 2 minutes. Not nothing, but the packet loss rate is 5227/82292123, or 0.006%. Given the repetitive nature of N2K traffic, that's not a huge deal. But this is something where YMMV a lot. |
I agree that the RX buffers on the MCP2515 may be insufficient for some scenarios. In a large harbor like mine (Marina Del Rey) the N2K stream is flooded with AIS position updates.) It doesn't explain why the canboatjs fails silently as explained earlier in the thread but a more robust CAN receiver could only help matters. I would happily test an MCP2518 with Signalk, if I could find a board for less than $20 that easily interfaced with an RPi. Do you have any suggestions, @mpvader? I found this product but it's not clear from the docs how to interface it with an RPi. |
Sorry, I wouldn't know. I checked a bit, and can't find any for that price. There are some, under EUR 100. It looks like the low cost makers have not switched over (yet).
Yes I agree completely on the YMMV and I can see that it will work in most cases. It will depend on business of the canbus vs CPU idle time. Note that not all can controllers have proper implementation of error and drop counters. I don't remember the details around that though. And while at it, the adapter/protocol you really don't want for a complex protocol like NMEA2000 is SLCAN. Btw. we're looking at also from a view of having the network 100% reliable for changing configuration, and so forth. Not just repetitive data. |
To add: an implementation that we also tested extensively, and found to be very reliable, is the candlelight project, which uses the gsusb driver between the CAN-CPU and the Linux-CPU. |
Hi, our MacArthur HAT v1.2 is about to be released as stable and it has the MCP2518 tested and working: https://github.com/OpenMarine/MacArthur-HAT/tree/v1.1 All the prototypes are installed and working and I only have one on hand now, but we will soon go into production. Maybe we can leave you a prototype temporarily, we are in Barcelona and Washington, where are you? |
I'd happily test a prototype if I had one.
I am in Los Angeles as it happens.
…On Mon, Sep 25, 2023 at 3:12 AM Sailoog ***@***.***> wrote:
Hi, our MacArthur HAT v1.2 is about to be released as stable and it has
the MCP2518 tested and working:
https://github.com/OpenMarine/MacArthur-HAT/tree/v1.1
All the prototypes are installed and working and I only have one on hand
now, but we will soon go into production. Maybe we can leave you a
prototype temporarily, we are in Barcelona and Washington, where are you?
—
Reply to this email directly, view it on GitHub
<#1626 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALTAIRFZDVGLBHODRHVWW53X4FKKLANCNFSM6AAAAAA43WHRS4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I can hook you up with a MacArthur HAT with the MCP2518. Drop me an email to adrian at adrianstuder dot com |
@naugehyde any news on this? Have you had the chance to test the minimal repro we talked about above? |
@tkurki Currently testing with a MCP2518 equipped CAN hat courtesy @astuder. The MCP2518 has a much larger buffer space than the MCP2515 that's on the PICAN-M and the Waveshare CAN hat. Just got the rig up and running today. Was not able to recreate with the MCP2518 hat but it's just day one and only had time for half an hour's testing. Will see if I can recreate w/ the MCP2518 then get back to testing on a MCP2515 equipped hat. It will be very interesting if I cannot recreate the bug on a MCP2518 equipped hat as the common denominator in my tests on two different boards (RPI 4b and a Le Potato) with two different CAN hats was the MCP2515 controller. |
@tkurki Should I test on 2.3.0 or 2.2.0? I note with interest that 2.3 has an upgraded canboatjs. |
Latest makes most sense I guess. |
@tkurki Tested the code you supplied (with modifications to increase the spawn rate) on an RPI-4b running Bullseye 64 bit and, alas, after about 12 minutes, the SOCKETCAN connection silently failed. It's worth noting that the CAN connection running on the same unit on a separate node process (with RPI-Monitor running at 30s intervals) at the same time did NOT fail. I also tested CAN Connection (in SK) using a prototype MacArthur Hat that incorporates the MCP2518 chip (unlike the MCP2515 on the Pican-M hat it has CAN-FD capability and a robust buffer space) and that rig failed much like the Pican-M hat and a Waveshare CAN hat did in earlier tests. Recommend that new releases provide a default of 2s to the noDataReceivedTimeout subOption for CAN connections (see: this).. Still not sure why this is happening. Still not sure what (if anything) besides repeated spawn() calls will cause a CAN0 connection to silently fail. |
@naugehyde just to be crystal clear:
What I am after is a reproducible bug report at https://github.com/sebi2k1/node-can. If I understand your comment correctly this has actually nothing to do with SK Server nor canboatjs (that uses socketcan). Naturally you are very welcome to file it there yourself and link back here. |
Oh forgot to add thanks for doing the leg work on this! |
@tkurki my bad! Tested with SOCKETCAN not canboat. Used your code, just added additional intervals and removed the reconnect in the error code (never reached by the way). OS: Linux piriot 6.1.21-v8+ #1642 SMP PREEMPT Mon Apr 3 17:24:16 BST 2023 aarch64 GNU/Linux Node version: v18.17.0 NB: Tested overnight and wrote up comment first thing prior to any caffeine. |
@tkurki Just tested node 20. SOCKETCAN failed silently after about an 1/2 hour on node 20.6.1 NB: I am testing with AIS on in a busy harbor. |
@tkurki Check out: nodejs/node#14917 Wonder if the issue above can be reproduced in more recent node versions and if so, if it is related to the CAN failure we're seeing. |
@tkurki any luck recreating the issue on your end? |
looks like i'm running into the same issue, or at least a very similar one:
at irregular intervals (anywhere from a few minutes to >24h), the can0 interface simply "goes silent" without any hint of an error in any logfile. my current workaround is a simple shell script that monitors /proc/net/dev for changes in rx packet count, and does the ifconfig down/up dance as described above. |
Fascinating but not entirely parallel to the problem I isolated. Notably,
the issue I identified only affected the current Nodejs process and not the
underlying device. IOW, I could bring down the CAN0 connection in Signalk
with repeated spawn() calls while another process's CAN0 connection
continued unabated. Sounds like that would not be the case for you. If I
misunderstood, please let me know.
Also, your hat stack sounds like it has multiple CAN interfaces (Waveshare
can and Sailor Hat). Is that correct? If so, why? Just curious...
…On Tue, Oct 24, 2023 at 3:23 AM humppafreak ***@***.***> wrote:
looks like i'm running into the same issue, or at least a very similar one:
- rpi4, 8GB, running raspbian 11.7 bullseye 64bit
- hat-stack consists of waveshare 2ch can hat, waveshare 2ch rs485
hat, waveshare GNSS hat (with the interrupt pin snipped off), and
@mairas <https://github.com/mairas>' sailor-hat v2.
- signalk v. 2.2.0
- signalk-rpi-monitor plugin is installed, but *not enabled*
at irregular intervals (anywhere from a few minutes to >24h), the can0
interface simply "goes silent" without any hint of an error in any logfile.
rx and tx packet counters in /proc/net/dev do not increase anymore, it's
like someone simply pulled the plug on the interface.
in most cases, the can0 interface can be forced to work again by a simple ifconfig
can0 down; sleep 5; ifconfig can0 up.
without the "sleep 5" in the restart commands, I get SIOCSIFFLAGS:
Connection timed out on stdout and mcp251x spi0.0: MCP251x didn't enter
in config mode in dmesg.
my current workaround is a simple shell script that monitors /proc/net/dev
for changes in rx packet count, and does the ifconfig down/up dance as
described above.
—
Reply to this email directly, view it on GitHub
<#1626 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALTAIRGV27Y2OKEGE4MS573YA6JJVAVCNFSM6AAAAAA43WHRS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZWHEZTINJUGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Ah, okay. In that case, I might have stumbled upon a completely different CAN issue... which, interestingly enough, seems to have disappeared completely since I appended And to your other question: The Sailor-HAT v2 does not have an N2K interface (the v1 did..). The SH v2 is more or less just a "smart power supply" with a wide input voltage range (9-32V), a stabilized 5V output, and a few beefy supercaps to buy enough time for a clean shutdown on power failure / switch off. |
I've had this happening several times last summer - no more N2K data shown in SK / KIP. I didn't investigate further, just rebooted the machine, so I can't tell which one of the errors I had. CM4, Waveshare CM4 IO BASE B, Waveshare isolated dual CAN HAT (MCP2515). Unfortunately I cannot test anything anytime soon, RPi sitting on my desk, boat far away. |
So it isn't just me!
Yeah the most reliable solution is earlier in this thread: just add a no
data received timeout option. See:
#1626 (comment)
If you ask me, a valid timeout value should be the default for all CAN
socket connections.
…On Wed, Nov 1, 2023, 11:46 AM JonyIvy ***@***.***> wrote:
I've had this happening several times last summer - no more N2K data shown
in SK / KIP. I didn't investigate further, just rebooted the machine, so I
can't tell which one of the errors I had.
CM4, Waveshare CM4 IO BASE B, Waveshare isolated dual CAN HAT (MCP2515).
signalk-rpi-monitor is running with 5s refresh.
Unfortunately I cannot test anything anytime soon, RPi sitting on my desk,
boat far away.
—
Reply to this email directly, view it on GitHub
<#1626 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALTAIRCUASQQT66DLCICJXLYCKKIZAVCNFSM6AAAAAA43WHRS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBZGQ3TGNJSGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I already have noDataReceivedTimeout: 4 in my settings.json. So my problem might be the same as humppafreak's. |
in that case, you might want to try enabling SPI event tracing as described in my last comment. |
@naugehyde I believe you had some new information about this? |
@tkurki I do. The testing process is very slow, (requires a NMEA network and the fault can take up to 8 hours to reproduce), so please forgive any lapses in communication on my part. What I know now is: at least for my NMEA network the fault occurs when RPI-Monitor (or its equivalent) refresh is set for 5s 2.5s 1.25s, .625s etc. Remarkably, the fault does NOT occur at values inbetween the reported values. Ran the test at .411s for example which equated to maybe 50 deltas/s and at least after eight hours, CAN0 stayed up. I have no idea why this is the case. Very likely a timing issue of some kind. Very likely a problem deep inside nodejs. The fault occurs in the latest build with the latest SK and RPI system software after considerably more elapsed time than earlier builds and systems, but remains catastrophic: CAN0 dies without a warning. It's unlikely to occur in most setups and may not occur at all on other NMEA networks (I have no way of knowing for sure. I have take devices off the network lowering the number of deltas and the fault is still reproducible but I don't have access to other networks.) The simple solution is to put a no data rec'd timeout into the settings.json as illustrated here (#1626 (comment)). As you know, a no data timeout will cause canboatjs to pause for the indicated amount of time in the event no data is rec'd and then restart the CAN0 connection. From my testing this will successfully keep the connection up. It may be an obscure bug but it is real and it is catastrophic as the connection dies silently and a SK restart is required to repair. I believe it's wise to have a NoDataReceivedTimeout default value >0 for CAN0 connections in the settings.json of the next SK version. |
Previously reported that CAN0 randomly stops streaming data (via canboatjs) from a Pican-M hat on a RPI4b connected to a N2k network. [#1622]
Okay so after a LOT of investigation I've been able to isolate the problem and the bug is (mostly) recreatable.
The problem stems from the signalk-rpi-monitor interval. When the monitor plugin is set to report at an interval of say 5s (the default is 60s) CAN0 is likely to fail after 20min or so. The lower the interval the more likely the failure will occur in short order.
I haven't isolated the root cause of the problem (I'm guessing that it has something to do with nodejs's intermittent garbage collection cycle) but it's a fairly serious problem as the only way to get one's instruments back is to restart the Signalk server process.
I'd bet that even at the default 60s interval the CAN bus will, in time, fail (silently). At sea, this could be problematic to say the least.
To recreate the problem you'll need a CAN0 connection of some kind on an RPI 3/4b (only tested on 4b w/Pican-M) and signalk-rpi-monitor installed and its interval set to a sufficiently low value.
For the time being, boaters with signalk-rpi-monitor installed on a RPI with a CAN0 bus should, at the very least, take warning and definitely not set their rpi-monitor interval to a value less than the default.
As for a longer term solution, further investigation into canboatjs is warranted as the failure is not occurring at the system level (candumpjs/candump continue to function normally after signalk's CAN0 data connection fails).
The text was updated successfully, but these errors were encountered: