Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ChannelState#Received accounting can apparently fail #357

Open
rvagg opened this issue Jan 10, 2023 · 2 comments
Open

ChannelState#Received accounting can apparently fail #357

rvagg opened this issue Jan 10, 2023 · 2 comments
Assignees

Comments

@rvagg
Copy link
Member

rvagg commented Jan 10, 2023

Two related items:

Autoretrieve is seeing "successful" transfers that have zero bytes transferred, our event logs for them look like this: { "confirmed": true, "receivedCids": 1, "receivedSize": 0 }.

They are initiated because we don't have the block locally, and a new transfer shouldn't be able to start for the same CID. They also aren't marked as failed because the block confirmer gives us a 👍 that the root block we wanted is now in our blockstore. So it would appear that the transfer happens but DT doesn't set state properly so channelState.Received() is zero.

h/t to @dirkmc, from sophia, apparently from autoretrieve, with what looks like a ~1m timeout cancellation yet the SP claims it's sending data. We timeout when no bytes are received, even if we're still chatting with the peer.

Screenshot 2023-01-10 at 11 07 20 AM

Screenshot 2023-01-10 at 11 07 28 AM

Screenshot 2023-01-10 at 11 07 35 AM

@dirkmc
Copy link
Contributor

dirkmc commented Jan 10, 2023

We timeout when no bytes are received, even if we're still chatting with the peer.

In this case it looks from the logs like the last data is sent at 2023-01-10 10:18:41.744 and then after a minute, auto-retrieve gives up waiting for more data and cancels the retrieval.

@davidd8 davidd8 assigned dirkmc and jacobheun and unassigned dirkmc Feb 1, 2023
@davidd8
Copy link

davidd8 commented Feb 1, 2023

Triaging this to the Boost team ( @jacobheun ) to prioritize in their backlog, since this may be causing the "1m timeout" errors being seen in the autoretrieve dashboard: https://protocollabs.grafana.net/d/lDh_Fko4k/autoretrieve-estuary?orgId=1&refresh=5s&from=1675121579495&to=1675294379495&viewPanel=39. Given it's the top retrieval error at the moment, elevating this issue's severity to a P1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants