-
-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sqlite3.iterdump() incompatible with binary data #108590
Comments
I am not able to reproduce this on 3.11.5, could you provide an example database or the script/sql to create one that demonstrates the issue? |
Hi @CorvinM, thanks for taking the time to attempt to reproduce. My use case is fiddling around with existing settings in a Google Chrome profile. Your mileage may vary. But here's a simplified/synthesized test that should be reproducible...
Notice a single non-ASCII/UTF byte in the data VARCHAR column. Then do this to build the database: But then in Python... #! /usr/bin/env python3.11
"""https://github.com/python/cpython/issues/108590"""
import sqlite3
conn = sqlite3.connect('foo.db')
curs = conn.cursor()
curs.execute('SELECT * FROM foo')
for line in conn.iterdump():
pass It fails thusly:
|
I'm unable to reproduce this using the following: import sqlite3
with sqlite3.connect(":memory:") as cx:
cx.execute("CREATE TABLE foo (data VARCHAR)")
cx.execute("INSERT INTO foo VALUES(?)", ["a\x9f"])
for row in cx.iterdump():
print(row)
cx.close() |
The SQLite shell does not special case VARCHAR (take a look at |
Unable to reproduce on Debian or Ubuntu as well. |
BTW, @dotysan, I see you are using the connection context manager. Note that the connection context manager does not close the database, it only makes sure that the transaction in the |
I am able to reproduce using @dotysan's foo.sql (recreated from hex dump, my terminal/editor was giving problems trying to "fix" the invalid encoding). The sqlite3 CLI both accepts it and reproduces it in a dump so I do believe its a python sqlite bug (rather than a corrupt db). Usually there is not supposed to be invalid encoded characters in a VARCHAR/TEXT field but it seems sqlite takes the garbage in -> garbage out policy. Working on a patch for this. Heres an alternative hexdump of foo.sql that can be imported a bit easier (with
|
We can't use a hexdump for a regression test; please provide a Python only reproducer. |
@CorvinM, something like this should suffice: import sqlite3
SCRIPT = """
CREATE TABLE foo (data VARCHAR);
INSERT INTO foo VALUES('a\x9f');
"""
with sqlite3.connect(":memory:") as cx:
cx.executescript(SCRIPT)
for row in cx.iterdump():
print(row)
cx.close() |
Unfortunately I don't think I can make it that pretty unless there is a API function that lets us send a query of bytes instead of str that I'm not aware of. The encode() down the chain is causing a problem as its turning the To show the encode() problem: import sqlite3
SCRIPT = """
CREATE TABLE foo (data VARCHAR);
INSERT INTO foo VALUES('a\x9f');
"""
with sqlite3.connect("out.db") as cx:
cx.executescript(SCRIPT) $ sqlite3 out.db 'SELECT data from foo;' | xxd
00000000: 61c2 9f0a a... Regardless, this works to show the original issue (albeit nasty): import sqlite3
import gzip
"""
# Created from the following shell commands:
# hex dump required because of the invalid unicode 9f at offset 3a
# cant survive most terminals/editors or python encode()
xxd -r << EOF | sqlite3 foo.db
00000000: 4352 4541 5445 2054 4142 4c45 2066 6f6f CREATE TABLE foo
00000010: 2028 6461 7461 2056 4152 4348 4152 293b (data VARCHAR);
00000020: 0a49 4e53 4552 5420 494e 544f 2066 6f6f .INSERT INTO foo
00000030: 2056 414c 5545 5328 2761 9f27 293b VALUES('a.');
EOF
python -c 'import sqlite3; import gzip; print(gzip.compress(sqlite3.connect("foo.db").serialize()))'
"""
dbfile_gz = b"\x1f\x8b\x08\x00'7\xeed\x02\xff\xed\xd71\n\xc2@\x14\x84\xe1\xb7K\xb0\x13\x95\x14\xb6[j#\x88\x17p\r\x01\xc14\xc6`\xbf\xd1\x04\x04eA\xf6>\x9e\xc8\x0bY\xb9\xa266\x16v\xf2\x7f\xcc\x14\x0f\xde\x05f\xb3.\x0e\xa11\xad?\x9f\\03\xe9\x8bR27FD\xf4\xabo*6\xf9\xb8\xbf\xd12\xd9I\xf7\xf1\xdc\xbbJ\x0c\x00\x00\x00\x00\x00\xf8\xd5Tu\x86i\xaaV\xc1\xd5\xc7\xa6\xf5>Fgen\xab\xdcTvQ\xe4q\xe6{3\xda\xbb\xe0\xcc\xd6\x96\xd9\xd2\x96\xe3\xe76\xbfI\x0c\x00\x00\x00\x00\x00\xf8;\x89\xd2\x03w\xb9\x03\xef\xd0\xa6\xa3\x00 \x00\x00"
with sqlite3.connect(":memory:") as cx:
cx.deserialize(gzip.decompress(dbfile_gz))
for line in cx.iterdump():
print(line)
cx.close() |
Hm, yes I noticed this, @CorvinM. I can reproduce using the hexdump. |
It does not matter if the repro is ugly; we cannot rely on the SQLite shell in the CI. Having a zipped database and loading it with deserialize is a better option (however, that won't work in 3.11, since deserialize isn't implemented there). |
For context, I stumbled across this while attempting to dump my Google Chrome profile settings using Python. They are the source of the binary data read/written into VARCHAR columns. I.e. not written from Python. In particular The workaround in my case is to bypass the Python module with... with open(dump_path, 'w') as dump:
subprocess.run(['sqlite3', db_path, '.dump'], stdout=dump) |
...So I assume the rational for storing binary data in a sqlite3 VARCHAR is somewhere in the Chromium source..? |
Found a much better way to get around the encoding issue. import sqlite3
import gzip
SCRIPT = """
CREATE TABLE foo (data VARCHAR);
INSERT INTO foo VALUES (CAST(X'619f' AS VARCHAR));
"""
with sqlite3.connect(":memory:") as cx:
cx.executescript(SCRIPT)
for line in cx.iterdump():
print(line)
cx.close() |
You can drop |
BTW: Note that SQLite has the concept of type affinity. VARCHAR has the TEXT affinity. You can get the same result casting to TEXT. Also, the data type is optional in the table creation. |
SQLite will let you put pretty much any data type into any column. See SQLite quirks: https://sqlite.org/quirks.html. |
@CorvinM wrote:
Thanks; I'm curious how you intend to solve it :) |
…e sequences This also reverts 400a1ce.
Reverted per Serhiy's request.
* main: pythongh-108520: Fix bad fork detection in nested multiprocessing use case (python#108568) pythongh-108590: Revert pythongh-108657 (commit 400a1ce) (python#108686) pythongh-108494: Argument Clinic: Document how to generate code that uses the limited C API (python#108584) Document Python build requirements (python#108646) pythongh-101100: Fix Sphinx warnings in the Logging Cookbook (python#108678) Fix typo in multiprocessing docs (python#108666) pythongh-108669: unittest: Fix documentation for TestResult.collectedDurations (python#108670) pythongh-108590: Fix sqlite3.iterdump for invalid Unicode in TEXT columns (python#108657) Revert "pythongh-103224: Use the realpath of the Python executable in `test_venv` (pythonGH-103243)" (pythonGH-108667) pythongh-106320: Remove private _Py_ForgetReference() (python#108664) Mention Ellipsis pickling in the docs (python#103660) Revert "Use non alternate name for Kyiv (pythonGH-108533)" (pythonGH-108649) pythongh-108278: Deprecate passing the first param of sqlite3.Connection callback APIs by keyword (python#108632) pythongh-108455: peg_generator: install two stubs packages before running mypy (python#108637) pythongh-107801: Improve the accuracy of io.IOBase.seek docs (python#108268)
…on#108686) (cherry picked from commit 2a3926f) Reverted per Serhiy's request.
Surrogate-escape seems to be different, I cannot recreate the same DB dump from iterdump when using it (its turing the \x9f into \udc9f) but can with the original chr() version. Adding a test to #108695 for dump reproducibility to check this. |
Document how to handle table columns with invalid Unicode sequences.
@serhiy-storchaka, @CorvinM: As an alternative, I created #108699 in order to try to solve this in documentation only. |
Let see. It is a complex issue, like every encoding issue. The SQLite database can contain non UTF-8 sequences of bytes as a text. To work it around, there is a special mechanism: you can set
So what can we do?
(2) and (3) are complex tasks and may require separate discussions about details. For example, should |
+1
IMO, this is the preferred solution. See #108699 for a draft docs update.
I'm not sure this is a good idea. The sqlite3 module is a DB API (PEP-249) implementation, and the DB API dictates the param spec of
Using |
I suggest we focus on 1. for now, the docs update. Well, in my opinion, docs are as hard to get right as encoding issues, so feel free to chime in 😄 |
At least it works with older versions of Python. |
…those (#108699) Add a guide for how to handle non-UTF-8 text encodings. Link to that guide from the 'text_factory' docs. Co-authored-by: Alex Waygood <[email protected]> Co-authored-by: C.A.M. Gerlach <[email protected]> Co-authored-by: Corvin <[email protected]> Co-authored-by: Ezio Melotti <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>
…andle those (pythonGH-108699) Add a guide for how to handle non-UTF-8 text encodings. Link to that guide from the 'text_factory' docs. (cherry picked from commit 1262e41) Co-authored-by: Erlend E. Aasland <[email protected]> Co-authored-by: Alex Waygood <[email protected]> Co-authored-by: C.A.M. Gerlach <[email protected]> Co-authored-by: Corvin <[email protected]> Co-authored-by: Ezio Melotti <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>
…andle those (pythonGH-108699) Add a guide for how to handle non-UTF-8 text encodings. Link to that guide from the 'text_factory' docs. (cherry picked from commit 1262e41) Co-authored-by: Erlend E. Aasland <[email protected]> Co-authored-by: Alex Waygood <[email protected]> Co-authored-by: C.A.M. Gerlach <[email protected]> Co-authored-by: Corvin <[email protected]> Co-authored-by: Ezio Melotti <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>
…handle those (GH-108699) (#111325) Add a guide for how to handle non-UTF-8 text encodings. Link to that guide from the 'text_factory' docs. (cherry picked from commit 1262e41) Co-authored-by: Erlend E. Aasland <[email protected]> Co-authored-by: Alex Waygood <[email protected]> Co-authored-by: C.A.M. Gerlach <[email protected]> Co-authored-by: Corvin <[email protected]> Co-authored-by: Ezio Melotti <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>
…handle those (GH-108699) (#111324) Add a guide for how to handle non-UTF-8 text encodings. Link to that guide from the 'text_factory' docs. (cherry picked from commit 1262e41) Co-authored-by: Erlend E. Aasland <[email protected]> Co-authored-by: Alex Waygood <[email protected]> Co-authored-by: C.A.M. Gerlach <[email protected]> Co-authored-by: Corvin <[email protected]> Co-authored-by: Ezio Melotti <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>
I see lots of merged PRs for this issue. Can it be closed or is there more to be done? |
The sqlite3 docs now includes a section about encoding issues, so it should hopefully be easier to avoid issues like this. However, @serhiy-storchaka is not satisfied with how the docs turned out and has flagged that he wants to improve them further. Perhaps those improvements should be done in a separate issue, though. I'm fine with closing this. |
…andle those (python#108699) Add a guide for how to handle non-UTF-8 text encodings. Link to that guide from the 'text_factory' docs. Co-authored-by: Alex Waygood <[email protected]> Co-authored-by: C.A.M. Gerlach <[email protected]> Co-authored-by: Corvin <[email protected]> Co-authored-by: Ezio Melotti <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>
…andle those (python#108699) Add a guide for how to handle non-UTF-8 text encodings. Link to that guide from the 'text_factory' docs. Co-authored-by: Alex Waygood <[email protected]> Co-authored-by: C.A.M. Gerlach <[email protected]> Co-authored-by: Corvin <[email protected]> Co-authored-by: Ezio Melotti <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>
Bug report
Checklist
and am confident this bug has not been reported before
CPython versions tested on:
3.11
Operating systems tested on:
Linux
Output from running 'python -VV' on the command line:
Python 3.11.5 (main, Aug 25 2023, 13:19:53) [GCC 9.4.0]
A clear and concise description of the bug:
Apologies if I'm misunderstanding. Please advice if I should post elsewhere. But shouldn't iterdump() properly detect VARCHAR columns with binary data and output X'' strings instead of throwing an error? This is what
sqlite3 .dump
does.The above will throw an error:
I tried enabling
conn.text_factory = bytes
as a workaround, but now get a different error.Linked PRs
The text was updated successfully, but these errors were encountered: