S3 multipart upload is broken for segments larger than 8kb #67

levand · 2019-03-29T01:14:09Z

The root cause is a problem with the input-stream->byte-array function. The following unit test fragment (intended to be added to utils_test.clj) demonstrates the problem.

(testing "works with a 1mb array"
    (let [input (byte-array (int (Math/pow 2 20)))
          rng (Random.)
          _ (.nextBytes rng input)
          output (util/input-stream->byte-array (io/input-stream input))]
      (is (Arrays/equals ^bytes input ^bytes output))))

The root issue is line 78 of util.clj: (.mark is 0). That is not how marked input streams work. Specifically, the readlimit arg is used incorrectly. Per the docs, The readlimit arguments tells this input stream to allow that many bytes to be read before the mark position gets invalidated. . So in this case, passing zero, the mark is invalidated immediately. This sometimes works by accident in some implementations as long as the stream length doesn't exceed the buffer size, but as the example above demonstrates, some very common use cases can cause problems.

The quickest fix is to wrap any provided input stream with a special implementation that, when .mark is called, creates a buffer to store arbitrary amounts of data so that it can always be replayed. Even implementations that support some degree of marking (such as a BufferedInputStream) cannot be relied upon to unwind arbitrary amounts of data: typically mark is limited to the buffer size.

Fundamentally, however, I think there is a mismatch in the way InputStreams are being used here. The marking technique is used to extend protocols such as sha-256 to generic InputStreams. In general, this is not a good match for the semantics of streams. Logically, one must consume a stream to create a checksum. One must also consume a stream to actually use it. Creating a programming model that relies on being able to consume the same stream twice is simply is a violation of the intended semantics of streams.

My recommendation would be this: the AWS Rest API cannot actually genuinely stream data, anyway (see #14) It requires that the content length and checksums be known up front. So the API should match this semantic, and require either a File handle or a byte buffer: something concrete and finite.

For situations where input streams of arbitrary length (potentially bigger than memory) actually are required, a wrapper could be provided that accepts an InputStream and initiates a multi-part upload, buffering the input stream into chunks and loading each one separately. I think that's the closest thing one could find to true streaming upload using the APIs that AWS defines.

However, that is a breaking API change so I totally understand if you would rather just wrap any provided InputStreams with one that supports a growing buffer.

The text was updated successfully, but these errors were encountered:

levand · 2019-03-29T14:28:49Z

If possible, it would also be great to support java.nio.ByteBuffer as a valid type for :blob. This would enable direct memory mapped S3 uploads for the highest possible performance.

When this library is running on AWS with a fast network link to S3, it is very likely that copying bytes around in the client is going to be the bottleneck for ultimate throughput to S3.

dchelimsky · 2019-03-29T17:49:42Z

Thanks @levand for such a comprehensive report. I almost added an examplar label to pin on it :)

To get this working correctly (so you can use MultipartUpload as expected), we're going to leave the API as/is, and convert the InputStream to a ByteBuffer earlier (so it's only read once).

dchelimsky · 2019-03-29T18:24:45Z

Released in version 0.8.283.

dchelimsky added the bug Something isn't working label Mar 29, 2019

dchelimsky closed this as completed in ad13a0f Mar 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 multipart upload is broken for segments larger than 8kb #67

S3 multipart upload is broken for segments larger than 8kb #67

levand commented Mar 29, 2019

levand commented Mar 29, 2019

dchelimsky commented Mar 29, 2019

dchelimsky commented Mar 29, 2019

S3 multipart upload is broken for segments larger than 8kb #67

S3 multipart upload is broken for segments larger than 8kb #67

Comments

levand commented Mar 29, 2019

levand commented Mar 29, 2019

dchelimsky commented Mar 29, 2019

dchelimsky commented Mar 29, 2019