Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 multipart upload is broken for segments larger than 8kb #67

Closed
levand opened this issue Mar 29, 2019 · 3 comments
Closed

S3 multipart upload is broken for segments larger than 8kb #67

levand opened this issue Mar 29, 2019 · 3 comments
Labels
bug Something isn't working

Comments

@levand
Copy link

levand commented Mar 29, 2019

The root cause is a problem with the input-stream->byte-array function. The following unit test fragment (intended to be added to utils_test.clj) demonstrates the problem.

(testing "works with a 1mb array"
    (let [input (byte-array (int (Math/pow 2 20)))
          rng (Random.)
          _ (.nextBytes rng input)
          output (util/input-stream->byte-array (io/input-stream input))]
      (is (Arrays/equals ^bytes input ^bytes output))))

The root issue is line 78 of util.clj: (.mark is 0). That is not how marked input streams work. Specifically, the readlimit arg is used incorrectly. Per the docs, The readlimit arguments tells this input stream to allow that many bytes to be read before the mark position gets invalidated. . So in this case, passing zero, the mark is invalidated immediately. This sometimes works by accident in some implementations as long as the stream length doesn't exceed the buffer size, but as the example above demonstrates, some very common use cases can cause problems.

The quickest fix is to wrap any provided input stream with a special implementation that, when .mark is called, creates a buffer to store arbitrary amounts of data so that it can always be replayed. Even implementations that support some degree of marking (such as a BufferedInputStream) cannot be relied upon to unwind arbitrary amounts of data: typically mark is limited to the buffer size.

Fundamentally, however, I think there is a mismatch in the way InputStreams are being used here. The marking technique is used to extend protocols such as sha-256 to generic InputStreams. In general, this is not a good match for the semantics of streams. Logically, one must consume a stream to create a checksum. One must also consume a stream to actually use it. Creating a programming model that relies on being able to consume the same stream twice is simply is a violation of the intended semantics of streams.

My recommendation would be this: the AWS Rest API cannot actually genuinely stream data, anyway (see #14) It requires that the content length and checksums be known up front. So the API should match this semantic, and require either a File handle or a byte buffer: something concrete and finite.

For situations where input streams of arbitrary length (potentially bigger than memory) actually are required, a wrapper could be provided that accepts an InputStream and initiates a multi-part upload, buffering the input stream into chunks and loading each one separately. I think that's the closest thing one could find to true streaming upload using the APIs that AWS defines.

However, that is a breaking API change so I totally understand if you would rather just wrap any provided InputStreams with one that supports a growing buffer.

@dchelimsky dchelimsky added the bug Something isn't working label Mar 29, 2019
@levand
Copy link
Author

levand commented Mar 29, 2019

If possible, it would also be great to support java.nio.ByteBuffer as a valid type for :blob. This would enable direct memory mapped S3 uploads for the highest possible performance.

When this library is running on AWS with a fast network link to S3, it is very likely that copying bytes around in the client is going to be the bottleneck for ultimate throughput to S3.

@dchelimsky
Copy link
Contributor

Thanks @levand for such a comprehensive report. I almost added an examplar label to pin on it :)

To get this working correctly (so you can use MultipartUpload as expected), we're going to leave the API as/is, and convert the InputStream to a ByteBuffer earlier (so it's only read once).

@dchelimsky
Copy link
Contributor

Released in version 0.8.283.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants