Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/fix bigtiff #33

Merged
merged 8 commits into from
Jun 28, 2022
Merged

Conversation

jurekkow
Copy link
Collaborator

Changes overview

  • automatically save image as bigtiff if needed
  • freeze versions of dependencies according to ones used for production use cases
  • enhance CI build, so that it runs tests

Bigtiff Problem

In order to save tiff image that is bigger than 4GBs, you need to save them as bigtiff.
tifffile library automatically saves it as a bigtiff for you, but ONLY if the compression is turned off.

square_bigtiff_size = int(math.sqrt(4 * 2 ** 30))
biggtiff_shape = (square_bigtiff_size, square_bigtiff_size)
random_bigtiff_img = np.random.randint(0, 255, biggtiff_shape, dtype=np.uint8)

tifffile.imwrite("random_adobe_deflate.tif", random_bigtiff_img, compress="adobe_deflate")

### OUTPUT
Traceback (most recent call last):
  File "/home/jkowalski/src/apeer-ometiff-library/limits.py", line 10, in <module>
    tifffile.imwrite("random_adobe_deflate.tif", random_bigtiff_img, compress="adobe_deflate")
  File "/home/jkowalski/.venv/apeer-ometiff-library-36/lib/python3.6/site-packages/tifffile/tifffile.py", line 698, in imwrite
    return tif.save(data, shape, dtype, **kwargs)
  File "/home/jkowalski/.venv/apeer-ometiff-library-36/lib/python3.6/site-packages/tifffile/tifffile.py", line 1915, in save
    ifd.write(pack(offsetformat, offset))
  File "/home/jkowalski/.venv/apeer-ometiff-library-36/lib/python3.6/site-packages/tifffile/tifffile.py", line 1451, in pack
    return struct.pack(byteorder + fmt, *val)
struct.error: 'I' format requires 0 <= number <= 4294967295
tifffile.imwrite("random.big.tif", random_bigtiff_img)
$ ls -l random.big.tif
-rw-rw-r-- 1 jkowalski jkowalski 4294967712 cze 24 14:32 random.big.tif

###Potential solutions

Option 1

Always save as bigtiff

Pros:

  • super easy implementation

Cons:

  • wasted space for vast majority of produced images:
img = np.zeros((20000, 20000), dtype=np.uint8)
tifffile.imwrite("adobe_deflate.tif", img, compress="adobe_deflate")
tifffile.imwrite("adobe_deflate.big.tif", img, compress="adobe_deflate", bigtiff=True)

Almost 30MB bigger file for bigtiff:

$ ls -l adobe*
-rw-rw-r-- 1 jkowalski jkowalski 620439 cze 24 13:42 adobe_deflate.big.tif
-rw-rw-r-- 1 jkowalski jkowalski 593655 cze 24 13:42 adobe_deflate.tif

Option 2

Check whether image is bigger than 4GB, and only then save as bigtiff.

Pros:

  • not wasted space for vast majority of produced images

Cons:

  • dirty implementation, potentially requires some copy-pasting from tifffile
  • still wasted space for images >4GB before compression, but <4GB after compression:
zeros_bigtiff_img = np.zeros(biggtiff_shape, dtype=np.uint8)
tifffile.imwrite("zeros_adobe_deflate.big.tif", zeros_bigtiff_img, compress="adobe_deflate", bigtiff=True)
tifffile.imwrite("zeros_adobe_deflate.tif", zeros_bigtiff_img, compress="adobe_deflate")
ls -l zeros*
-rw-rw-r-- 1 jkowalski jkowalski 6291872 cze 24 14:47 zeros_adobe_deflate.big.tif
-rw-rw-r-- 1 jkowalski jkowalski 6029616 cze 24 14:49 zeros_adobe_deflate.tif

Option 3

Try to save as bigfile only on failure.

Pros:

  • no wasted space for bigtiffs unless it's needed
  • easy implementation

Cons:

  • waste of time due to retry:
t0 = time.time()
tifffile.imwrite("random_adobe_deflate.big.tif", random_bigtiff_img, compress="adobe_deflate", bigtiff=True)
print(f"imwrite directly with bigtiff=True took {time.time() - t0:2f} seconds")

t0 = time.time()
try:
    tifffile.imwrite("random_adobe_deflate.big.tif", random_bigtiff_img, compress="adobe_deflate")
except struct.error:
    print("imwrite failed, retrying with bigtiff=True")
    tifffile.imwrite("random_adobe_deflate.big.tif", random_bigtiff_img, compress="adobe_deflate", bigtiff=True)
print(f"imwrite with retry took {time.time() - t0:2f} seconds")

### OUTPUT:
imwrite directly with bigtiff=True took 70.509310 seconds
imwrite failed, retrying with bigtiff=True
imwrite with retry took 143.868945 seconds
  • struct.error is not very specific

Summary

I'm finding Option 3 by far the most reasonable, since it will affect only files that currently fails.

Time cost for the solution is acceptable and we've never came across any other struct.errors since we're working with compressed ome.tiff files.

@evhen14
Copy link

evhen14 commented Jun 27, 2022

Option 3 would have a negative impact on performance, especially in the cloud environment where data is written to disk over a network (Auzre Blob, Azure File, etc). So, writing of data to see it fail, then compress and write in big.tiff format would negatively contribute to the performance of large files.

If we can detect the data size, it could be decided if we want to use bigtiff and/or compression. For example, if the data size is >4GB => apply compression, if the result of compression is >4GB, use bigtiff. In this case, the option 2 makes more sense.

Another question: why we freeze versions from 2020 and not 2022? Hopefully, the new versions have bugfixes.

@jurekkow
Copy link
Collaborator Author

@evhen14 after discussion, Option 3 (storage optimized) was changed to Option 2 (time optimized).

@jurekkow jurekkow merged commit 8d2dc92 into apeer-micro:master Jun 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants