You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The root cause is a problem with the input-stream->byte-array function. The following unit test fragment (intended to be added to utils_test.clj) demonstrates the problem.
The root issue is line 78 of util.clj: (.mark is 0). That is not how marked input streams work. Specifically, the readlimit arg is used incorrectly. Per the docs, The readlimit arguments tells this input stream to allow that many bytes to be read before the mark position gets invalidated. . So in this case, passing zero, the mark is invalidated immediately. This sometimes works by accident in some implementations as long as the stream length doesn't exceed the buffer size, but as the example above demonstrates, some very common use cases can cause problems.
The quickest fix is to wrap any provided input stream with a special implementation that, when .mark is called, creates a buffer to store arbitrary amounts of data so that it can always be replayed. Even implementations that support some degree of marking (such as a BufferedInputStream) cannot be relied upon to unwind arbitrary amounts of data: typically mark is limited to the buffer size.
Fundamentally, however, I think there is a mismatch in the way InputStreams are being used here. The marking technique is used to extend protocols such as sha-256 to generic InputStreams. In general, this is not a good match for the semantics of streams. Logically, one must consume a stream to create a checksum. One must also consume a stream to actually use it. Creating a programming model that relies on being able to consume the same stream twice is simply is a violation of the intended semantics of streams.
My recommendation would be this: the AWS Rest API cannot actually genuinely stream data, anyway (see #14) It requires that the content length and checksums be known up front. So the API should match this semantic, and require either a File handle or a byte buffer: something concrete and finite.
For situations where input streams of arbitrary length (potentially bigger than memory) actually are required, a wrapper could be provided that accepts an InputStream and initiates a multi-part upload, buffering the input stream into chunks and loading each one separately. I think that's the closest thing one could find to true streaming upload using the APIs that AWS defines.
However, that is a breaking API change so I totally understand if you would rather just wrap any provided InputStreams with one that supports a growing buffer.
The text was updated successfully, but these errors were encountered:
If possible, it would also be great to support java.nio.ByteBuffer as a valid type for :blob. This would enable direct memory mapped S3 uploads for the highest possible performance.
When this library is running on AWS with a fast network link to S3, it is very likely that copying bytes around in the client is going to be the bottleneck for ultimate throughput to S3.
Thanks @levand for such a comprehensive report. I almost added an examplar label to pin on it :)
To get this working correctly (so you can use MultipartUpload as expected), we're going to leave the API as/is, and convert the InputStream to a ByteBuffer earlier (so it's only read once).
The root cause is a problem with the
input-stream->byte-array
function. The following unit test fragment (intended to be added toutils_test.clj
) demonstrates the problem.The root issue is line 78 of util.clj:
(.mark is 0)
. That is not how marked input streams work. Specifically, thereadlimit
arg is used incorrectly. Per the docs,The readlimit arguments tells this input stream to allow that many bytes to be read before the mark position gets invalidated.
. So in this case, passing zero, the mark is invalidated immediately. This sometimes works by accident in some implementations as long as the stream length doesn't exceed the buffer size, but as the example above demonstrates, some very common use cases can cause problems.The quickest fix is to wrap any provided input stream with a special implementation that, when
.mark
is called, creates a buffer to store arbitrary amounts of data so that it can always be replayed. Even implementations that support some degree of marking (such as a BufferedInputStream) cannot be relied upon to unwind arbitrary amounts of data: typicallymark
is limited to the buffer size.Fundamentally, however, I think there is a mismatch in the way InputStreams are being used here. The marking technique is used to extend protocols such as
sha-256
to genericInputStream
s. In general, this is not a good match for the semantics of streams. Logically, one must consume a stream to create a checksum. One must also consume a stream to actually use it. Creating a programming model that relies on being able to consume the same stream twice is simply is a violation of the intended semantics of streams.My recommendation would be this: the AWS Rest API cannot actually genuinely stream data, anyway (see #14) It requires that the content length and checksums be known up front. So the API should match this semantic, and require either a File handle or a byte buffer: something concrete and finite.
For situations where input streams of arbitrary length (potentially bigger than memory) actually are required, a wrapper could be provided that accepts an
InputStream
and initiates a multi-part upload, buffering the input stream into chunks and loading each one separately. I think that's the closest thing one could find to true streaming upload using the APIs that AWS defines.However, that is a breaking API change so I totally understand if you would rather just wrap any provided InputStreams with one that supports a growing buffer.
The text was updated successfully, but these errors were encountered: