Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add error handling; fix performance; fix cid decoding #4

Merged
merged 3 commits into from
Feb 24, 2024
Merged

Conversation

MarshalX
Copy link
Owner

@MarshalX MarshalX commented Feb 22, 2024

for bechmarks i am using 20mb car file with more than 58k dict keys and the values are some nested dicts too (simple stuctures tbh no more than deep lvl 4 ig)

good perf improve:

-fn _decode_dag_cbor(data: Vec<u8>) -> Result<Ipld> {
+fn _decode_dag_cbor(data: &[u8]) -> Result<Ipld> {

upd. will from_sequence_bound help?

@MarshalX
Copy link
Owner Author

MarshalX commented Feb 22, 2024

happy news they are reworking internal storage for python objects and it will be release in the next minor verison (0.21)! i am so exited about perf boost PyO3/pyo3#3382

@DavidBuchanan314
Copy link

DavidBuchanan314 commented Feb 24, 2024

I think there's some correctness issues here. For my "CID decode torture test" (a big array of CIDS) it decodes to this:

[0, 1, [{}, 'bafkreic75tvwn76in44nsutynrwws3dzyln4eoo5j2i3izzj245cp62x5e', 'bafkreidlq2zhh7zu7tqz224aj37vup2xi6w2j2vcf4outqa6klo3pb23jm', 'bafkreiguonpdujs6c3xoap2zogfzwxidagoapwfwyupzbwr2mzxoye5lgu', 'bafkreicoa5aikyv63ofwbtqfyhpm7y5nc23semewpxqb6zalpzdstne7zy', 'bafkreiclej3xpvg5d7dby34ij5egihicwtisdu75gkglbc2vgh6kzwv7ri'], 'bafkreihpfujh3y33sqv2vudbixsuwddbtipsemt3f2547pwhr5kwjl7dtu', 'bafkreihh63abc53orw342mylkqlu7v3ppubbnnqshb5f77h3qhtpbemwqm', 'bafkreidzajuzxzbmrkhen653iuaxezix5bvsfrlkdcpxmjng3jeqqgzeke', 'bafkreibmmjbdftosef3rffg7xmyqvsqabig7nlelm23jnwio6bx5563eum', 'bafkreiazlapcpxt45uap6hhfbmqepz5fm7dwwhf25ov6l3yd67bqc65vw4', 'bafkreickitobknscasua72aosa4ukxgbmcbidaqp4kze6hssgow6nly52u', 'bafkreicpzavsnlwli7jindco7prvqfzsupt4xtdmf35tebrmbalqubpoxa', 'bafkreidlkhkddx25p4kbzpwozt3z5xz53bq4hnagt4frczq2h3x2zo5jda', 'bafkreib73orv6bg4rrdctbwjsk6pq5kumjlrcmdsvee4cyxx4ryolapcpa', 'bafkreiefe6ujdyrecnuvb7zsziqswrn4sp3j7o4adq5r5pw2yutxl6m6me', 'bafkreihgfh5glggxgj3i67dsnnfweeuf7hb3quydsafkseqbpw3wc7ml3m', 'bafkreifrp33ndhd2lmpoqo4qprmvkjw4whvqnw4ce7lfbvo5ucu7jtum3e', 'bafkreicfenka6fiezulradcigxufw7xp2smrcwapr377awm2r4udxzvz4m', 'bafkreicozfmz7qqd2f3kgaktnqxasgqzxscsownskw6wqgebbjbml7wrji', 'bafkreieuady3ehfve7l7upj6vo5jgvl2ddv6piwkjzdrz7s6jrnuzj7xm4', 'bafkreihvzi4posfb23vpojvyuqx3k5ody4prqzficqzqc6bn4e62fwjafm', 'bafkreidpjntbees7woqnv3gspgo723e4fgkcj7msb6ntbaiqula7xwhuim', 'bafkreidyl47mp2zs6mfzbtipz43fpu4iwx7uff7s7fyw75totnu4axo5be', 'bafkreictl6rq27rf3wfet4ktm54xgtwifbqqruiv3jielv37hnaylwhxsa', 'bafkreigcgvqgt2or46oksjbxqfj47o73jvcbnmpztva2ffal7w3gyuyz3m', 'bafkreifxuvuhhtlxd4wei3jwtnsjimfwlj2wxity76l6zan3n5k3fzzvne', 'bafkreic7trflbdfmorl6sei2gdsgmsjamb7kfqivufbt267jr2l6mqsezi', 'bafkreidhazy43f2aiflce3sqpfz7fk4dgdjqelfjnygjhpn5wmqmigw4v4']

The expected result is an array of 100000 CIDs and nothing else.

Edit: actually I think this issue applies to the main branch too, somehow

@MarshalX
Copy link
Owner Author

MarshalX commented Feb 24, 2024

@DavidBuchanan314 try last commit. looks fixed. but segfault still here on your nested lists and nested maps. ig not on my side. but on the lib what i use for decoding

@MarshalX
Copy link
Owner Author

decode_car_faster (a bit optimized method):

1: 46.16MB/s 2: 38.95MB/s 3: 55.14MB/s 4: 41.52MB/s 5: 40.74MB/s 6: 55.20MB/s 7: 58.91MB/s 8: 41.38MB/s 9: 41.39MB/s 10: 54.32MB/s 

Average: 47.37MB/s

decode_car (the old one from main branch):

1: 17.89MB/s 2: 17.04MB/s 3: 20.18MB/s 4: 17.53MB/s 5: 19.63MB/s 6: 16.58MB/s 7: 19.66MB/s 8: 17.76MB/s 9: 17.66MB/s 10: 18.98MB/s 

Average: 18.29MB/s

with gc.collect()

decode_car_faster (a bit optimized method):

1: 56.94MB/s 2: 57.59MB/s 3: 58.66MB/s 4: 55.20MB/s 5: 58.84MB/s 6: 53.29MB/s 7: 55.51MB/s 8: 58.48MB/s 9: 58.83MB/s 10: 57.93MB/s 

Average: 57.13MB/s

decode_car (the old one from main branch):

1: 19.24MB/s 2: 19.12MB/s 3: 19.86MB/s 4: 18.99MB/s 5: 19.44MB/s 6: 20.09MB/s 7: 20.39MB/s 8: 20.07MB/s 9: 20.16MB/s 10: 18.78MB/s 

Average: 19.61MB/s

@DavidBuchanan314
Copy link

Could you elaborate on the gc.collect()? Where do you put it?

@DavidBuchanan314
Copy link

Can confirm the CID decoding is fixed now 💯

@DavidBuchanan314
Copy link

(By the way, the benchmarks in my dag-cbor-benchmark repo don't currently do whole-car parsing, I plan to add it though)

@MarshalX
Copy link
Owner Author

MarshalX commented Feb 24, 2024

thats my bench.py

import gc
import time

import libipld


if __name__ == '__main__':
    car = open('repo.car', 'rb').read()

    # UNCOMMENT THE METHOD YOU WANT TO TEST

    # method = libipld.decode_car
    method = libipld.decode_car_faster

    # we call the method once because it's incredibly fast the first time. idk why...
    res = method(car)

    n = 10
    speed_avg = 0

    for i in range(n):
        # gc.collect()  # optionally

        start = time.perf_counter_ns()
        res = method(car)  # we are saving the result to var because it's more realistic. and it's more slow xd
        duration = time.perf_counter_ns() - start

        car_speed = (len(car) / (1024 * 1024)) / (duration * 1e-9)
        speed_avg += car_speed

        print(f'{i + 1}: {car_speed:.2f}MB/s', end=' ')

    print(f'\n\nAverage: {speed_avg / n:.2f}MB/s')

@MarshalX MarshalX changed the title Fix performance Add error handling; fix performance; fix cid decoding Feb 24, 2024
@MarshalX MarshalX marked this pull request as ready for review February 24, 2024 19:29
@MarshalX MarshalX merged commit 37a1544 into main Feb 24, 2024
15 checks passed
@MarshalX MarshalX deleted the fix-perf branch February 24, 2024 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants