Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inaccurate claimed DAG sizes #1427

Closed
alanshaw opened this issue Feb 21, 2022 · 5 comments
Closed

Inaccurate claimed DAG sizes #1427

alanshaw opened this issue Feb 21, 2022 · 5 comments
Assignees
Labels
kind/bug A bug in existing code (including security flaws) P2 Medium: Good to have, but can wait until someone steps up stack/api-protocols stack/write-services topic/upload related to uploads

Comments

@alanshaw
Copy link
Contributor

alanshaw commented Feb 21, 2022

DAG sizes for some upload types are incorrect (reporting smaller than actual).

cargo=> SELECT
  COUNT(*),
  PG_SIZE_PRETTY( SUM( size_actual - size_claimed ) ) AS delta,
  project,
  ds.details->>'upload_type' AS upload_type,
  MAX( ds.entry_created ) AS latest_such_upload
FROM dags d JOIN dag_sources ds USING ( cid_v1 ) JOIN sources USING ( srcid )
WHERE
  size_actual > 0
    AND
  size_claimed > 0
    AND
  size_actual - size_claimed > 1024*1024
GROUP BY project, upload_type;


 count  | delta  | project | upload_type |     latest_such_upload     
--------+--------+---------+-------------+----------------------------
    404 | 400 GB |       1 | Car         | 2022-01-16 04:38:33.466+00
 150331 | 18 TB  |       2 | Car         | 2022-01-17 10:09:57.625+00
 139165 | 303 GB |       2 | Nft         | 2022-01-17 10:16:03.843+00
      4 | 426 MB |       2 | Remote      | 2022-01-05 13:57:07.938+00
(4 rows)

Examples:

select pg_size_pretty( delta ) , * from (
  select (d.size_actual - ds.size_claimed) AS delta, *
    from dag_sources ds
    join dags d using ( cid_v1 )
  where d.size_actual > 0 and ds.size_claimed > 0 and d.size_actual != ds.size_claimed and ds.entry_created > NOW()-'5 day'::interval
  order by delta desc
) sq

examples.txt

@alanshaw alanshaw added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels Feb 21, 2022
@dchoi27 dchoi27 added P2 Medium: Good to have, but can wait until someone steps up and removed need/triage Needs initial labeling and prioritization labels Feb 28, 2022
@dchoi27
Copy link
Contributor

dchoi27 commented Feb 28, 2022

Needs further investigation to understand full priority

@flea89
Copy link
Contributor

flea89 commented Mar 15, 2022

I haven't gone too deep yet, but I'll start sharing my Initial investigation results (in web3.storage):

Data sample 6484 cids:

  • Most of the "problematic" cids are cbor ones (5392 out of 6484)
  • For pb one, 50 with problems out of 50 I've checked are directories

Looking at the code, possible roots of the problem are:

  • for codec pb we rely on metadata to calculate size, which could be deliberately changed
  • in carStat we're calculating size for code PB and raw (with one block), and not CBOR.

CBOR dags
Given the size calculation doesn't happen in .storage AFAICT, I wonder if size_claimed is populated for those cids wrongly in cargo? But I haven't had time to look there yet.

From a quick look, I suspect size_claimed stores the size of the first block rather than the whole dag.

PB directories
I just quickly checked a couple of CIDs, and in this case the we're actually reporting a bigger size in .storage.
ie.
CID: bafybeiduwb4o2fsl2lbmuyigzhjdpluahrexjpd7edlilyl5wmz332vnyq
public.content.size = 715
cargo.dag.size_actual = 690

> ipfs dag stat /ipfs/bafybeiduwb4o2fsl2lbmuyigzhjdpluahrexjpd7edlilyl5wmz332vnyq 
> Size: 690, NumBlocks: 7

> ipfs files stat /ipfs/bafybeiduwb4o2fsl2lbmuyigzhjdpluahrexjpd7edlilyl5wmz332vnyq
> bafybeiduwb4o2fsl2lbmuyigzhjdpluahrexjpd7edlilyl5wmz332vnyq
> Size: 0
> CumulativeSize: 715
> ChildBlocks: 1
> Type: directory

I haven't yet checked why the 2 reports different sizes, (is it unixFs headers or a bug) but I'm sure you know @alanshaw.

@alanshaw can you run a query in prod where you use the dag size from public.content, to make sure this is really a problem for .storage?

@mbommerez mbommerez assigned flea89 and Gozala and unassigned Gozala Mar 15, 2022
@flea89
Copy link
Contributor

flea89 commented Mar 16, 2022

Web3.storage findings

TL;DR

Relying on tsize is the issue for the observed CIDs.
In fact the claimed size matches in all cases with ipfs files stat that relies on tsize, see here.
I also ran those dags through web3.storage carStat to triple check that's the case.

Not sure there's a whole lot we can do apart from switching to navigating the whole dag instead of relying on metadata.

(I'll put together a different comment for nft.storage, and in particular for CBOR dags)

Investigation details

Query

select pg_size_pretty( delta ) as delta , * from (
  select (d.size_actual - ds.size_claimed) AS delta, *
    from cargo.dag_sources ds
    join cargo.dags d using ( cid_v1 )
    JOIN cargo.sources s USING ( srcid )
  where 
	d.size_actual - ds.size_claimed > 0 and
	d.size_actual > 0 and 
	ds.size_claimed > 0 and 
	d.size_actual != ds.size_claimed and 
	ds.entry_created > NOW()-'50 day'::interval and
	s.project = 1
  order by delta desc
  limit 100
) sq

Results:
67 rows

Observation

67 of 67 dag-pb

CID bafybeicyifavcxymohtwpsiht7ule5hnb3o7cj6ulirbvar33vffhepupa

Cargo

Delta: 1477 kB
Size claimed: 50
Size actual: 1512315

Web3.storage

size: 50

> ipfs dag stat /ipfs/bafybeicyifavcxymohtwpsiht7ule5hnb3o7cj6ulirbvar33vffhepupa   
> Size: 1512315, NumBlocks: 10
> ipfs files stat /ipfs/bafybeicyifavcxymohtwpsiht7ule5hnb3o7cj6ulirbvar33vffhepupa   
> Size: 0
> CumulativeSize: 50
> ChildBlocks: 1
> Type: directory

Context

pbNodeBytes.byteLength => 50

Links
[
  {
    Hash: CID(bafybeiazdqznr4nng7l3kitckqa7ojsziseszdnuweov6eh7m6qfudtgg4),
    Name: 'tt',
    Tsize: 0
  }
]

CID bafybeid7dl444brakxtphxnhj5flad37wwntcweijzycd242uy6anqvadq

Cargo

Delta: 107 MB
Size claimed: 163
Size actual: 112393872

Web3.storage

size: 163

Sizes

> ipfs dag stat /ipfs/bafybeid7dl444brakxtphxnhj5flad37wwntcweijzycd242uy6anqvadq   
> Size: 112393872, NumBlocks: 433
> ipfs files stat /ipfs/bafybeid7dl444brakxtphxnhj5flad37wwntcweijzycd242uy6anqvadq   
> Size: 113
> CumulativeSize: 163
> ChildBlocks: 1
> Type: file

Context

pbNodeBytes.byteLength => 50

Links
[
  {
    Hash: CID(bafybeidbqd5hygsdomns34r6xjgucyh5nsfvaa3sqympo544qlk2rn23wm),
    Tsize: 113
  }
]

CID bafybeictly5566rw434m4p6gihiqryqghei23vz5v5w4nwfhf3qyonyrwu

Cargo

Delta: 5121 MB
Size claimed: 4224
Size actual: 5369697184

Web3.storage

size: 4224

> ipfs dag stat /ipfs/bafybeictly5566rw434m4p6gihiqryqghei23vz5v5w4nwfhf3qyonyrwu   

*I had to cancel this one because is too big and I was running out of space

ipfs files stat /ipfs/bafybeictly5566rw434m4p6gihiqryqghei23vz5v5w4nwfhf3qyonyrwu   

@alanshaw
Copy link
Contributor Author

This could be a bug in the JS UnixFS importer. It would be interesting to see if importing data with problematic size into a go-ipfs node resulted in a DAG with similar problematic Tsizees. My hunch is that they'd differ i.e. we'd get a different DAG in go-ipfs with accurate Tsize...meaning that actually we're fine to rely on the size in the metadata (for 99% of data), provided we have a bug free importer.

We currently use dagcargo to update dag size when we don't know how big the DAG is:
https://github.com/nftstorage/nft.storage/blob/main/packages/cron/src/jobs/dagcargo.js

We could/should setup a cron job to "fix up" where we do not agree.

@dchoi27
Copy link
Contributor

dchoi27 commented Apr 6, 2022

@Gozala does this problem go away with Uploads v2 (if we just use the size on disk in Elastic Provider)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) P2 Medium: Good to have, but can wait until someone steps up stack/api-protocols stack/write-services topic/upload related to uploads
Projects
None yet
Development

No branches or pull requests

6 participants