incorrect line splitting in HttpRequestParser #97

tumb1er · 2014-07-03T06:00:38Z

We are receiving some HTTP requests that cause an Invalid Header error in aiohttp.
That's because of an issue in splitlines in
aiohttp.procotol.HttpRequestParser:

lines = raw_data.decode(
            'ascii', 'surrogateescape').splitlines(True)

For example, this code produces 4 lines instead of 3:

>>> rd = bytearray(b'line1\r\nline2_start_\x1c_line2_end\r\nline3\r\n')
>>> rd.decode('ascii', 'surrogateescape').splitlines(True)
['line1\r\n', 'line2_start_\x1c', '_line2_end\r\n', 'line3\r\n']

In my case it was invalid user-agent header for UCBrowser, but any thoughts how to fix it?

The text was updated successfully, but these errors were encountered:

popravich · 2014-07-03T08:11:00Z

we had the same issue with splitlines when parsing csv.
In our case calling splitlines(True) before decode() solved it. But this means lines must be decoded after splitlines (loop through whole list) or each line must be decoded in place where it is used.

fafhrd91 · 2014-07-06T16:36:35Z

problem with split lines and then decode is parser performance degradation. i tested that in early days of aiohttp development.

@asvetlov @popravich @kxepal ideas?

popravich · 2014-07-07T07:37:33Z

I've done few simple performance tests and here are the results:

>>> raw = b'\r\n'.join([b'part1\x1c_part2\r\nline2\r\nline3'] * 50000)
>>> raw2 = b'\r\n'.join([b'some-not-very-short-header: and_its_verryyyyyy_looooooooong_value'+b'e'*100] * 10000)
>>> len(raw), len(raw2)
... (1399998, 1669998)

# short lines
>>> %timeit raw.decode('ascii', 'surrogateescape').splitlines(True)
100 loops, best of 3: 14.5 ms per loop
>>> %timeit list(map(lambda b: b.decode('ascii', 'surogateescape'), raw.splitlines(1)))
10 loops, best of 3: 81.2 ms per loop
>>> %timeit next(map(lambda b: b.decode('ascii', 'surogateescape'), raw.splitlines(1)))
100 loops, best of 3: 7.68 ms per loop
>>> %timeit raw.decode('ascii', 'surogateescape').split('\r\n')
100 loops, best of 3: 11.3 ms per loop

# longer lines
>>> %timeit raw2.decode('ascii', 'surrogateescape').splitlines(True)
100 loops, best of 3: 2.68 ms per loop
>>> %timeit list(map(lambda b: b.decode('ascii', 'surogateescape'), raw2.splitlines(1)))
100 loops, best of 3: 7.97 ms per loop
>>> %timeit next(map(lambda b: b.decode('ascii', 'surogateescape'), raw2.splitlines(1)))
100 loops, best of 3: 2.22 ms per loop
>>> %timeit raw2.decode('ascii', 'surogateescape').split('\r\n')
100 loops, best of 3: 3.25 ms per loop

So maybe it makes sense to use next(map... pair?
What your thoughts?

tumb1er · 2014-07-07T13:08:00Z

next(map(...)) processes only first line, and all other variants process all headers data. Correct test is

def test_next():
    try:
        while True:
            next(map(...))
    except StopIteration:
        pass

asvetlov · 2014-07-07T14:10:58Z

@fafhrd91
The benchmark from @popravich is incorrect but I like the idea for splitting byte-string and decoding after that.

popravich · 2014-07-07T14:16:56Z

Sorry for confusion.
Yes, the test isn't correct, morning coffee didn't work)

I meant using iterator that map returns instead of lines list.
I will do some tests on HttpRequestParser with different variants of split lines and come back later.

But I think that splitting bytes and then decoding lines in place where its used might not hit performance too much. Any way I will do some tests.

tumb1er · 2014-07-07T14:34:49Z

@popravich what about regex splitting? May be it'll both fix split logic and preserve line endings. Without performance penalty.

fafhrd91 · 2014-07-07T20:50:01Z

how often headers encoding is broken?
can we do something like (optimistic approach):

   try:
       lines = raw.decode('ascii').splitlines(True)
   except UnicodeDecodeError:
       lines = split and decode with surrogateescape

tumb1er · 2014-07-08T03:40:23Z

It doesn't raise decode error, it just splits extra lines by \x1c symbol. Real exception is in parse_headers:

            try:
                name, value = line.split(':', 1)
            except ValueError:
                raise ValueError('Invalid header: {}'.format(line)) from None

Good idea but hard to implement it.

PS. Exception happens for less than 0.1% of requests for me.

fafhrd91 · 2014-07-08T03:58:10Z

ah! i think this is bug in .splitlines()
.split('\r\n') works right it just removes '\r\n' from strings.

fafhrd91 · 2014-07-08T04:27:38Z

@tumb1er could you tes fix @a6a179c5ad1e011a73610588de3046487244bed1

fafhrd91 · 2014-07-08T04:29:33Z

@asvetlov could you fill python bug report for .splitlines()

tumb1er · 2014-07-08T06:39:52Z

@fafhrd91, I've tested, it works now.

BTW, \x1c in ASCII is named "File separator, Information separator four" here http://donsnotes.com/tech/charsets/ascii.html
So it may be not a bug in splitlines :)

fafhrd91 · 2014-07-08T06:58:16Z

maybe:)
documentation should say that then.

asvetlov · 2014-07-08T10:14:41Z

@fafhrd91 I don't understand clean what exactly is wrong with .splitlines() ?

fafhrd91 · 2014-07-08T13:08:06Z

.splitlines() treats this chars \x1c, \x1d,\x1e` as line endings.

popravich · 2014-07-08T13:22:36Z

well, str.splitlines() treats those chars as line endings and bytes.splitlines() does not. that may a problem.

lock · 2019-10-29T22:01:55Z

This thread has been automatically locked since there has not been
any recent activity after it was closed. Please open a new issue for
related bugs.

If you feel like there's important points made in this discussion,
please include those exceprts into that new issue.

fafhrd91 mentioned this issue Jul 7, 2014

Release 0.9.0 #104

Closed

fafhrd91 added a commit that referenced this issue Jul 8, 2014

seems .splitlines() has bug use .split() instead #97

a6a179c

fafhrd91 closed this as completed Jul 8, 2014

lock bot added the outdated label Oct 29, 2019

lock bot locked as resolved and limited conversation to collaborators Oct 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incorrect line splitting in HttpRequestParser #97

incorrect line splitting in HttpRequestParser #97

tumb1er commented Jul 3, 2014

popravich commented Jul 3, 2014

fafhrd91 commented Jul 6, 2014

popravich commented Jul 7, 2014

tumb1er commented Jul 7, 2014

asvetlov commented Jul 7, 2014

popravich commented Jul 7, 2014

tumb1er commented Jul 7, 2014

fafhrd91 commented Jul 7, 2014

tumb1er commented Jul 8, 2014

fafhrd91 commented Jul 8, 2014

fafhrd91 commented Jul 8, 2014

fafhrd91 commented Jul 8, 2014

tumb1er commented Jul 8, 2014

fafhrd91 commented Jul 8, 2014

asvetlov commented Jul 8, 2014

fafhrd91 commented Jul 8, 2014

popravich commented Jul 8, 2014

lock bot commented Oct 29, 2019

incorrect line splitting in HttpRequestParser #97

incorrect line splitting in HttpRequestParser #97

Comments

tumb1er commented Jul 3, 2014

popravich commented Jul 3, 2014

fafhrd91 commented Jul 6, 2014

popravich commented Jul 7, 2014

tumb1er commented Jul 7, 2014

asvetlov commented Jul 7, 2014

popravich commented Jul 7, 2014

tumb1er commented Jul 7, 2014

fafhrd91 commented Jul 7, 2014

tumb1er commented Jul 8, 2014

fafhrd91 commented Jul 8, 2014

fafhrd91 commented Jul 8, 2014

fafhrd91 commented Jul 8, 2014

tumb1er commented Jul 8, 2014

fafhrd91 commented Jul 8, 2014

asvetlov commented Jul 8, 2014

fafhrd91 commented Jul 8, 2014

popravich commented Jul 8, 2014

lock bot commented Oct 29, 2019