-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add error handling; fix performance; fix cid decoding #4
Conversation
happy news they are reworking internal storage for python objects and it will be release in the next minor verison (0.21)! i am so exited about perf boost PyO3/pyo3#3382 |
I think there's some correctness issues here. For my "CID decode torture test" (a big array of CIDS) it decodes to this: [0, 1, [{}, 'bafkreic75tvwn76in44nsutynrwws3dzyln4eoo5j2i3izzj245cp62x5e', 'bafkreidlq2zhh7zu7tqz224aj37vup2xi6w2j2vcf4outqa6klo3pb23jm', 'bafkreiguonpdujs6c3xoap2zogfzwxidagoapwfwyupzbwr2mzxoye5lgu', 'bafkreicoa5aikyv63ofwbtqfyhpm7y5nc23semewpxqb6zalpzdstne7zy', 'bafkreiclej3xpvg5d7dby34ij5egihicwtisdu75gkglbc2vgh6kzwv7ri'], 'bafkreihpfujh3y33sqv2vudbixsuwddbtipsemt3f2547pwhr5kwjl7dtu', 'bafkreihh63abc53orw342mylkqlu7v3ppubbnnqshb5f77h3qhtpbemwqm', 'bafkreidzajuzxzbmrkhen653iuaxezix5bvsfrlkdcpxmjng3jeqqgzeke', 'bafkreibmmjbdftosef3rffg7xmyqvsqabig7nlelm23jnwio6bx5563eum', 'bafkreiazlapcpxt45uap6hhfbmqepz5fm7dwwhf25ov6l3yd67bqc65vw4', 'bafkreickitobknscasua72aosa4ukxgbmcbidaqp4kze6hssgow6nly52u', 'bafkreicpzavsnlwli7jindco7prvqfzsupt4xtdmf35tebrmbalqubpoxa', 'bafkreidlkhkddx25p4kbzpwozt3z5xz53bq4hnagt4frczq2h3x2zo5jda', 'bafkreib73orv6bg4rrdctbwjsk6pq5kumjlrcmdsvee4cyxx4ryolapcpa', 'bafkreiefe6ujdyrecnuvb7zsziqswrn4sp3j7o4adq5r5pw2yutxl6m6me', 'bafkreihgfh5glggxgj3i67dsnnfweeuf7hb3quydsafkseqbpw3wc7ml3m', 'bafkreifrp33ndhd2lmpoqo4qprmvkjw4whvqnw4ce7lfbvo5ucu7jtum3e', 'bafkreicfenka6fiezulradcigxufw7xp2smrcwapr377awm2r4udxzvz4m', 'bafkreicozfmz7qqd2f3kgaktnqxasgqzxscsownskw6wqgebbjbml7wrji', 'bafkreieuady3ehfve7l7upj6vo5jgvl2ddv6piwkjzdrz7s6jrnuzj7xm4', 'bafkreihvzi4posfb23vpojvyuqx3k5ody4prqzficqzqc6bn4e62fwjafm', 'bafkreidpjntbees7woqnv3gspgo723e4fgkcj7msb6ntbaiqula7xwhuim', 'bafkreidyl47mp2zs6mfzbtipz43fpu4iwx7uff7s7fyw75totnu4axo5be', 'bafkreictl6rq27rf3wfet4ktm54xgtwifbqqruiv3jielv37hnaylwhxsa', 'bafkreigcgvqgt2or46oksjbxqfj47o73jvcbnmpztva2ffal7w3gyuyz3m', 'bafkreifxuvuhhtlxd4wei3jwtnsjimfwlj2wxity76l6zan3n5k3fzzvne', 'bafkreic7trflbdfmorl6sei2gdsgmsjamb7kfqivufbt267jr2l6mqsezi', 'bafkreidhazy43f2aiflce3sqpfz7fk4dgdjqelfjnygjhpn5wmqmigw4v4'] The expected result is an array of 100000 CIDs and nothing else. Edit: actually I think this issue applies to the main branch too, somehow |
@DavidBuchanan314 try last commit. looks fixed. but segfault still here on your nested lists and nested maps. ig not on my side. but on the lib what i use for decoding |
decode_car_faster (a bit optimized method): 1: 46.16MB/s 2: 38.95MB/s 3: 55.14MB/s 4: 41.52MB/s 5: 40.74MB/s 6: 55.20MB/s 7: 58.91MB/s 8: 41.38MB/s 9: 41.39MB/s 10: 54.32MB/s
Average: 47.37MB/s decode_car (the old one from main branch): 1: 17.89MB/s 2: 17.04MB/s 3: 20.18MB/s 4: 17.53MB/s 5: 19.63MB/s 6: 16.58MB/s 7: 19.66MB/s 8: 17.76MB/s 9: 17.66MB/s 10: 18.98MB/s
Average: 18.29MB/s with gc.collect()decode_car_faster (a bit optimized method): 1: 56.94MB/s 2: 57.59MB/s 3: 58.66MB/s 4: 55.20MB/s 5: 58.84MB/s 6: 53.29MB/s 7: 55.51MB/s 8: 58.48MB/s 9: 58.83MB/s 10: 57.93MB/s
Average: 57.13MB/s decode_car (the old one from main branch): 1: 19.24MB/s 2: 19.12MB/s 3: 19.86MB/s 4: 18.99MB/s 5: 19.44MB/s 6: 20.09MB/s 7: 20.39MB/s 8: 20.07MB/s 9: 20.16MB/s 10: 18.78MB/s
Average: 19.61MB/s |
Could you elaborate on the gc.collect()? Where do you put it? |
Can confirm the CID decoding is fixed now 💯 |
(By the way, the benchmarks in my dag-cbor-benchmark repo don't currently do whole-car parsing, I plan to add it though) |
thats my import gc
import time
import libipld
if __name__ == '__main__':
car = open('repo.car', 'rb').read()
# UNCOMMENT THE METHOD YOU WANT TO TEST
# method = libipld.decode_car
method = libipld.decode_car_faster
# we call the method once because it's incredibly fast the first time. idk why...
res = method(car)
n = 10
speed_avg = 0
for i in range(n):
# gc.collect() # optionally
start = time.perf_counter_ns()
res = method(car) # we are saving the result to var because it's more realistic. and it's more slow xd
duration = time.perf_counter_ns() - start
car_speed = (len(car) / (1024 * 1024)) / (duration * 1e-9)
speed_avg += car_speed
print(f'{i + 1}: {car_speed:.2f}MB/s', end=' ')
print(f'\n\nAverage: {speed_avg / n:.2f}MB/s') |
for bechmarks i am using 20mb car file with more than 58k dict keys and the values are some nested dicts too (simple stuctures tbh no more than deep lvl 4 ig)
good perf improve:
upd. will
from_sequence_bound
help?