Buffer encoding #19

HoneyryderChuck · 2017-12-18T11:44:40Z

This should fix #18

…ou'll need binary for payload

… encoding

…force binary here

tarcieri · 2017-12-18T15:08:30Z

lib/http/form_data/composite_io.rb

@@ -29,14 +29,16 @@ def initialize(ios)
      #
      # @return [String, nil]
      def read(length = nil, outbuf = nil)
-        outbuf = outbuf.to_s.replace("")
+        outbuf = outbuf.to_s.clear


Though this is likely a bug in the previous version as well, #to_s may make a new object, at which point the purpose of passing outbuf is defeated.

Perhaps give outbuf a default value (e.g. outbuf = "".b instead of outbuf = nil as the argument definition), and raise if it's not a String?

#to_s will not create a new object if the object is a string. But the replace call might change its encoding. That was the reasoning behind it.

Hence "may make a new object". We never want to make a new object for the buffer... that defeats the point.

How about raising if the supplied buffer is not a String type with Encoding::BINARY?

We never want to make a new object for the buffer... that defeats the point.

ahh, I see what you mean.

How about raising if the supplied buffer is ...

I'm still not sure. The encoding requirement is usually not mandatory, and the buffer passed may be a specialized buffer. Besides "fixing" jruby, I'm not even sure that forcing binary encoding there is the right thing to do.

By the way, what's the reasoning behind allowing length = nil? The length argument is usually required for IO#read and File#read.

the buffer passed may be a specialized buffer

In what case would it make sense for it to be anything but a String? Ultimately it needs to thunk to a Ruby I/O function which also accepts a preallocated buffer, and those all need to be a String unless JRuby has some fancy behavior I don't know about where it lets you pass a Java ByteBuffer or something.

Besides "fixing" jruby, I'm not even sure that forcing binary encoding there is the right thing to do.

For these types of buffers, I think mandating Encoding::BINARY is the only right option. They should really be a ByteArray/ByteBuffer type, but Ruby doesn't have those, so we need the next best thing, which is a String with Encoding::BINARY.

For other encodings there are all sorts of pathological performance implications, and the contract at this point in the code is to handle arbitrary binary data that may not be whatever-encoding-was-passed-in.

Given this is a rarely used power-user feature, I think it should have some pretty exacting semantics, which are:

Avoid all allocations. The whole point of this API is to avoid allocations. This means no #to_s or anything else that might potentially allocate memory

String is probably the only argument type that makes sense

Encoding::BINARY (i.e. Encoding::ASCII_8BIT) is probably the only sensible encoding for the buffer String. If you think other encodings should be supported, do you really think the answer to the question "Will outbuf always be a valid string under encoding X?" is yes?

By the way, what's the reasoning behind allowing length = nil?

No idea, like the outbuf being nil I think that probably doesn't make sense.

I agree with this change, I'll try to answer the questions below.

ixti · 2017-12-20T21:17:25Z

@janko-m Please, chip in.

janko · 2017-12-20T22:34:03Z

I'll try to make some clarifications on the implementation and behaviour.

Though this is likely a bug in the previous version as well, #to_s may make a new object, at which point the purpose of passing outbuf is defeated.

The reason we're creating a new object is because when we're reading the chunks of internal @ios, we're reading them into the same @buffer string object that was allocated on CompositeIO.new, so no new string objects are created on this line:

current_io.read(length, @buffer)

Since Ruby's IO#read always returns a new string object when the outbuf is not passed in, we mimic that here as well by calling outbuf.to_s to create a new string in case outbuf wasn't passed in. That way when outbuf is not passed in we're allocating exactly 1 new string object per CompositeIO#read call, and when it's passed in it's 0, which is what we want.

Perhaps give outbuf a default value (e.g. outbuf = "".b instead of outbuf = nil as the argument definition), and raise if it's not a String?

Ruby's IO#read still allows explicitly passing the outbuf argument as nil (i.e. io.read(10, nil)), so a default argument wouldn't handle that. I recall experiencing that some library called IO#read that way. I know it's an edge case, but I think it makes sense to aim for 100% compatibility with IO#read.

#to_s will not create a new object if the object is a string. But the replace call might change its encoding. That was the reasoning behind it.

Good call, replace("") does in fact change the encoding on MRI 2.4.1:

[1] pry(main)> "".b.replace("").encoding
=> #<Encoding:UTF-8>

By the way, what's the reasoning behind allowing length = nil?
No idea, like the outbuf being nil I think that probably doesn't make sense.

Both length and outbuf are optional arguments in Ruby's IO#read, so that's why they're optional here as well. When length is not passed in, then the whole content should be returned. HTTP::Multipart#to_s (inherited from Multipart::FormData::Readable) calls HTTP::CompositeIO#read without any arguments. FYI, IO#readpartial and IO#read_nonblock require the length argument, those are the only ones I think.

HoneyryderChuck · 2017-12-21T06:53:16Z

@janko-m thank you for the clarification. I had the impression that this was to follow IO#read, but wasn't sure anymore if length was optional or not.

Besides the buffer issue, anything else you'd like to add to the PR? I'd consider it ready to go otherwise.

nightsurge · 2018-01-08T19:01:02Z

Gemfile

@@ -18,7 +18,7 @@ group :test do
 end

 group :doc do
-  gem "redcarpet"
+  gem "redcarpet", platform: :mri


Travis CI is blowing up on this line, because Rubocop for the project is requiring hash rockets. I am guessing that's a personal preference for the project, as I prefer Ruby 1.9+ colon syntax...

nightsurge · 2018-01-08T19:01:44Z

Can we get this CI issue resolved? I think my usage of the gem may depend on this change as well.

tarcieri · 2018-01-08T23:29:14Z

@nightsurge there's an open upstream issue re: keyword arguments httprb/http#447

ixti · 2018-01-09T20:52:04Z

@tarcieri as far as I thought kwargs issue was about changing options hashes to kwargs. But not about changing hash-rocket usage to new hash style. :)) I'm still hard follower of hash rockets stile as it keeps code much more consistent :D

ixti · 2018-01-09T20:52:50Z

Will review/merge this pr asap

nightsurge · 2018-01-17T19:26:23Z

@HoneyryderChuck I have some oddities in my code:

    10: def initialize(ios)
    11:   @index  = 0
    12:   @buffer = String.new
 => 13:   binding.pry
    14:   @ios    = ios.map do |io|
    15:     if io.is_a?(String)
    16:       StringIO.new(io)
    17:     elsif io.respond_to?(:read)
    18:       io
    19:     else
    20:       raise ArgumentError,
    21:         "#{io.inspect} is neither a String nor an IO object"
    22:     end
    23:   end
    24: end

[1] pry(#<HTTP::FormData::CompositeIO>)> @buffer.encoding
=> #<Encoding:ASCII-8BIT>
[2] pry(#<HTTP::FormData::CompositeIO>)> String.new.encoding
=> #<Encoding:ASCII-8BIT>

@buffer/String.new appears to be ASCII-8BIT, which is correct. Also, when it gets to the io handling, I have to add a force_encoding(Encoding::BINARY) to both the outbuf and @buffer. I'm just doing standard image uploads of various types (png/jpg).

def read(length = nil, outbuf = nil)
        outbuf = outbuf.to_s.replace("")
        outbuf.force_encoding(Encoding::BINARY)

        while current_io
          current_io.read(length, @buffer)
          outbuf << @buffer.force_encoding(Encoding::BINARY)

I'm using JRuby 9.1.15.0

HoneyryderChuck · 2018-01-17T21:05:48Z

I have to add a force_encoding(Encoding::BINARY) to both the outbuf and @buffer

I feel that this might be a jruby bug. You shouldn't have to force-encode between two binary string, or I don't see the edge case where you should.

nightsurge · 2018-01-17T21:11:50Z

@HoneyryderChuck agreed. I wonder if I can setup a test script using io.read and a sample file to reproduce the issue. Is it detrimental to performance to use both force_encodings? I hate to have to monkeypatch the gem to keep it working for our enterprise app. But then again it is silly to ask a gem to sacrifice potential performance for a JRuby bug. Both scenarios are potentially just temporary.

HoneyryderChuck · 2018-01-17T21:26:52Z

@nightsurge I'd say that you could easily turn it into a reproducible script (in theory, at least), as you're just reading chunks into a buffer. I'd suggest you to present this in the jruby issue tracker, and they're usually quite responsive folks.

janko · 2018-01-21T15:33:48Z

I have to add a force_encoding(Encoding::BINARY) to both the outbuf and @buffer

I feel that this might be a jruby bug. You shouldn't have to force-encode between two binary string, or I don't see the edge case where you should.

Ruby strings can sometimes change encodings depending on what you append to them. This is not JRuby-specific, it's the same on MRI.

require "securerandom"
string = ""
string.encoding # => #<Encoding:UTF-8>
string << SecureRandom.random_bytes
string # => ",\xCA\xBA\xEE<E\xB0\x12\xB0+\xD8\x1Ebh}\e"
string.encoding # => #<Encoding:ASCII-8BIT>

We can see that when we appended a binary string to an UTF-8 encoded string, its encoding automatically changed to binary as well.

Is it detrimental to performance to use both force_encodings?

I would be surprised if it is, because force_encoding doesn't change the bytes, it only changes how those bytes should be interpreted (the encoding).

If the goal is to have all string returned in binary encoding, which if you ask me is definitely a safe option, we should add .force_encoding(Encoding::BINARY) calls whenever we're appending to outbuf.

nightsurge · 2018-01-21T15:39:36Z

That sounds safest. I was definitely seeing the same behavior in the script I wrote for both MRI/C and Jruby.

…

On Jan 21, 2018 10:33 AM, "Janko Marohnić" ***@***.***> wrote: I have to add a force_encoding(Encoding::BINARY) to both the outbuf and @buffer <https://github.com/buffer> I feel that this might be a jruby bug. You shouldn't have to force-encode between two binary string, or I don't see the edge case where you should. Ruby strings can sometimes change encodings depending on what you append to them. This is not JRuby-specific, it's the same on MRI. require "securerandom" string = "" string.encoding # => #<Encoding:UTF-8> string << SecureRandom.random_bytes string # => ",\xCA\xBA\xEE<E\xB0\x12\xB0+\xD8\x1Ebh}\e" string.encoding # => #<Encoding:ASCII-8BIT> We can see that when we appended a binary string to an UTF-8 encoded string, its encoding automatically changed to binary as well. Is it detrimental to performance to use both force_encodings? I would be surprised if it is, because force_encoding doesn't change the bytes, it only changes the instructions how those bytes should be interpreted (the encoding). If the goal is to have all string returned in binary encoding, which if you ask me is definitely a safe option, we should add .force_encoding(Encoding::BINARY) calls whenever we're appending to outbuf. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC7IGi1XQIqJqtaSVPZwEhP8ag_NFS9Gks5tM1jcgaJpZM4RFWXI> .

HoneyryderChuck · 2018-01-22T17:18:46Z

Ruby strings can sometimes change encodings depending on what you append to them. This is not JRuby-specific, it's the same on MRI.

@janko-m the discussed case (if I remember correctly) was where the binary chunk was being appended to the binary outbuf AND then switching the encoding to UTF-8. And this does seem like a bug, although it might also be just undefined and one has to ensure encoding itself.

I could also just easily employ the fix and add the .force_encoding calls, but I was under the impression that it had a non-trivial cost. But if my impression is wrong, I can just simply add it.

janko · 2018-01-29T00:48:01Z

the discussed case (if I remember correctly) was where the binary chunk was being appended to the binary outbuf AND then switching the encoding to UTF-8. And this does seem like a bug, although it might also be just undefined and one has to ensure encoding itself.

Oh, that definitely seems like a bug then. But we'd need the .force_encoding call in any case.

I could also just easily employ the fix and add the .force_encoding calls, but I was under the impression that it had a non-trivial cost. But if my impression is wrong, I can just simply add it.

We know that String#force_encoding changes only the encoding associated with the string, so I think we can safely assume that it's a very cheap operation.

I would like to see this merged, because that fact that currently String#length is used instead of String#bytesize means that this gem doesn't work correctly with files that have multi-byte UTF-8 characters and aren't opened in binary mode, on MRI as well as JRuby (sorry for having introduced that bug). Perhaps it would be also useful to add a test for that as part of this PR.

ixti · 2018-01-29T15:30:42Z

@HoneyryderChuck can you please add specs (if it's feasible)?

HoneyryderChuck · 2018-01-29T19:50:07Z

@ixti I'll try to come up with smth, + the force_encoding call. But can we get a reproducible script to present the bug case to the jruby guys? @nightsurge any inputs?

nightsurge · 2018-01-29T20:05:08Z

@HoneyryderChuck @ixti @janko-m I already posted to JRuby and they confirmed that since it was also a bug with MRI/C-Ruby, there was nothing they planned to do at this point and closed the issue:

(script is in that issue post)
jruby/jruby#4986

HoneyryderChuck · 2018-01-29T20:23:13Z

@ixti decided not to write any test, as composite_io module tests are following a pattern I find hard to add other ios into. Also, I wouldn't know how to test the specific case of the bytesize call. If you think this is important, I'd suggest you to change one of the ios strings to a different encoding, although that's up to you.

@nightsurge did you file a bug report in ruby tracker? I'd like to follow that.

janko · 2018-01-29T20:45:22Z

@HoneyryderChuck That's ok, since I wrote that bit I can add the tests after the PR is merged.

ixti · 2018-01-29T23:47:57Z

@HoneyryderChuck Thank you!

janko · 2018-03-04T11:25:47Z

@ixti @tarcieri Shall we make a new release with this fix?

ixti · 2018-03-04T18:09:06Z

@janko-m Oh, right!. Seems like so. Will handle this today! Thanks for the reminder.

ixti · 2018-03-05T02:18:15Z

@janko-m v2.1.0 is here :D

janko · 2018-03-05T06:04:05Z

@ixti Awesome, thank you!

@janko-m

## 2.1.0 (2018-03-05) * [#21](httprb/form_data#21) Rewind content at the end of `Readable#to_s`. [@janko-m][] * [#19](httprb/form_data#19) Fix buffer encoding. [@HoneyryderChuck][] ## 2.0.0 (2017-10-01) * [#17](httprb/form_data#17) Add CRLF character to end of multipart body. [@mhickman][] ## 2.0.0.pre2 (2017-05-11) * [#14](httprb/form_data#14) Enable streaming for urlencoded form data. [@janko-m][] ## 2.0.0.pre1 (2017-05-10) * [#12](httprb/form_data#12) Enable form data streaming. [@janko-m][]

HoneyryderChuck added 5 commits December 18, 2017 09:54

using buffer string in binary encoding, as String.new is utf-8, and y…

2a86ded

…ou'll need binary for payload

use String#bytesize, as #length returns the length size in its encoding

621e18e

use String#clear instead of String#replace, as the latter changes the…

3d97b83

… encoding

do not use rercarpet for jruby

9a1b911

jruby accusing something other than ASCII-8bit sometimes (US-ASCII), …

3bc3787

…force binary here

tarcieri reviewed Dec 18, 2017

View reviewed changes

janko approved these changes Dec 21, 2017

View reviewed changes

nightsurge reviewed Jan 8, 2018

View reviewed changes

hash rocketz

f08d80c

forcing the buffer to binary, as the buffer encoding might change

eeff06c

ixti merged commit 31eefbc into httprb:master Jan 29, 2018

janko mentioned this pull request Jan 30, 2018

Add specs for CompositeIO operating in binary mode #20

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buffer encoding #19

Buffer encoding #19

HoneyryderChuck commented Dec 18, 2017

tarcieri Dec 18, 2017 •

edited

Loading

HoneyryderChuck Dec 18, 2017

tarcieri Dec 18, 2017

HoneyryderChuck Dec 19, 2017

tarcieri Dec 20, 2017 •

edited

Loading

janko Dec 20, 2017

ixti commented Dec 20, 2017 •

edited

Loading

janko commented Dec 20, 2017 •

edited

Loading

HoneyryderChuck commented Dec 21, 2017

nightsurge Jan 8, 2018

nightsurge commented Jan 8, 2018

tarcieri commented Jan 8, 2018

ixti commented Jan 9, 2018

ixti commented Jan 9, 2018

nightsurge commented Jan 17, 2018 •

edited

Loading

HoneyryderChuck commented Jan 17, 2018

nightsurge commented Jan 17, 2018

HoneyryderChuck commented Jan 17, 2018

janko commented Jan 21, 2018 •

edited

Loading

nightsurge commented Jan 21, 2018 via email

HoneyryderChuck commented Jan 22, 2018

janko commented Jan 29, 2018 •

edited

Loading

ixti commented Jan 29, 2018

HoneyryderChuck commented Jan 29, 2018

nightsurge commented Jan 29, 2018 •

edited

Loading

HoneyryderChuck commented Jan 29, 2018

janko commented Jan 29, 2018

ixti commented Jan 29, 2018

janko commented Mar 4, 2018

ixti commented Mar 4, 2018

ixti commented Mar 5, 2018

janko commented Mar 5, 2018

Buffer encoding #19

Buffer encoding #19

Conversation

HoneyryderChuck commented Dec 18, 2017

tarcieri Dec 18, 2017 • edited Loading

Choose a reason for hiding this comment

HoneyryderChuck Dec 18, 2017

Choose a reason for hiding this comment

tarcieri Dec 18, 2017

Choose a reason for hiding this comment

HoneyryderChuck Dec 19, 2017

Choose a reason for hiding this comment

tarcieri Dec 20, 2017 • edited Loading

Choose a reason for hiding this comment

janko Dec 20, 2017

Choose a reason for hiding this comment

ixti commented Dec 20, 2017 • edited Loading

janko commented Dec 20, 2017 • edited Loading

HoneyryderChuck commented Dec 21, 2017

nightsurge Jan 8, 2018

Choose a reason for hiding this comment

nightsurge commented Jan 8, 2018

tarcieri commented Jan 8, 2018

ixti commented Jan 9, 2018

ixti commented Jan 9, 2018

nightsurge commented Jan 17, 2018 • edited Loading

HoneyryderChuck commented Jan 17, 2018

nightsurge commented Jan 17, 2018

HoneyryderChuck commented Jan 17, 2018

janko commented Jan 21, 2018 • edited Loading

nightsurge commented Jan 21, 2018 via email

HoneyryderChuck commented Jan 22, 2018

janko commented Jan 29, 2018 • edited Loading

ixti commented Jan 29, 2018

HoneyryderChuck commented Jan 29, 2018

nightsurge commented Jan 29, 2018 • edited Loading

HoneyryderChuck commented Jan 29, 2018

janko commented Jan 29, 2018

ixti commented Jan 29, 2018

janko commented Mar 4, 2018

ixti commented Mar 4, 2018

ixti commented Mar 5, 2018

janko commented Mar 5, 2018

tarcieri Dec 18, 2017 •

edited

Loading

tarcieri Dec 20, 2017 •

edited

Loading

ixti commented Dec 20, 2017 •

edited

Loading

janko commented Dec 20, 2017 •

edited

Loading

nightsurge commented Jan 17, 2018 •

edited

Loading

janko commented Jan 21, 2018 •

edited

Loading

janko commented Jan 29, 2018 •

edited

Loading

nightsurge commented Jan 29, 2018 •

edited

Loading