-
Notifications
You must be signed in to change notification settings - Fork 7.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve JSONEachRowRowInputFormat
by skipping all remaining fields when all required fields are read
#62210
Improve JSONEachRowRowInputFormat
by skipping all remaining fields when all required fields are read
#62210
Conversation
b4d8f3a
to
9b3bf4e
Compare
some test results select count(uid) from
file('test.json', 'JSONEachRow', 'uid Nullable(String), latency Nullable(Int32), _region Nullable(String), shard Nullable(Int32)') settings max_threads=2
|
d5ed872
to
4fd2fc8
Compare
4fd2fc8
to
673912f
Compare
673912f
to
cbedf8c
Compare
This is an automated comment for commit bd63a31 with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only concern is with this improvement we might parse invalid JSONs without any issues, e.g. when after the last necessary column there is an invalid JSON field.
As by default input_format_allow_errors_num
is 0
, I think this might mean that some currently not parsable JSONs become parsable.
I would at least introduce a setting to be able to disable this.
What about other formats, where we can have extra columns? Shouldn't the same logic be implemented for them too?
@@ -186,6 +187,11 @@ void JSONEachRowRowInputFormat::readJSONObject(MutableColumns & columns) | |||
JSONUtils::skipColon(*in); | |||
readField(column_index, columns); | |||
} | |||
if (seen_columns_count >= total_columns) | |||
{ | |||
skipToUnescapedNextLineOrEOF(*in); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct, we need to skip not up to the next line or EOF but to the next }
symbol that is not inside quotes and track brackets balance (actually, we also need to do it in syncAfterError
), because we support any delimiter betweeen JSON objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example:
select * from format(JSONEachRow, 'a UInt32', '{"a" : 42}, {"a" : 43}')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also string fields can have \n
, and it will break this logic:
select * from format(JSONEachRow, 'a UInt32', '{"a" : 42, "b" : "\n"}, {"a" : 43}')
So yes, we need to skip up to the end of JSON object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will create a PR with fix for syncAfterError
with proper data skipping
UPD: it's not so obvious as I though, because when some row is a broken JSON we cannot just skip all up to the end of JSON object as there might me no end (as it's broken). So here we just have the easies and best solution - skip up to new line delimiter as in most cases it will work.
But in case of valid JSON we definitely should skip up to the end of JSON object. The similar code how to do it fast can be found in readJSONObjectOrArrayPossiblyInvalid
function in ReadHelpers
Totally agree with @antaljanosbenjamin, it should be done under a setting. Also we have setting |
@Avogar @antaljanosbenjamin The way to skip unused fields has been changed. I think it would have the same behavior as before in this way. |
JSONEachRowRowInputFormat
by skipping to the line end when all fields are readJSONEachRowRowInputFormat
by skipping all fields when all fields are read
JSONEachRowRowInputFormat
by skipping all fields when all fields are readJSONEachRowRowInputFormat
by skipping all fields when all remaining fields are read
JSONEachRowRowInputFormat
by skipping all fields when all remaining fields are readJSONEachRowRowInputFormat
by skipping all remaining fields when all required fields are read
Hi @antaljanosbenjamin could you review this again? The latest code just skip all remaining fields without looking up the |
@@ -161,6 +162,12 @@ void JSONEachRowRowInputFormat::readJSONObject(MutableColumns & columns) | |||
for (size_t key_index = 0; advanceToNextKey(key_index); ++key_index) | |||
{ | |||
StringRef name_ref = readColumnName(*in); | |||
if (seen_columns_count >= total_columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would skip the duplicated fields check in readField
in case we read all columns. Considering that I think we should introduce a new setting to enable that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from this tiny change, I am okay with this PR.
src/Core/SettingsChangesHistory.h
Outdated
@@ -89,6 +89,7 @@ static std::map<ClickHouseVersion, SettingsChangesHistory::SettingsChanges> sett | |||
{"ignore_drop_queries_probability", 0, 0, "Allow to ignore drop queries in server with specified probability for testing purposes"}, | |||
{"lightweight_deletes_sync", 2, 2, "The same as 'mutation_sync', but controls only execution of lightweight deletes"}, | |||
{"query_cache_system_table_handling", "save", "throw", "The query cache no longer caches results of queries against system tables"}, | |||
{"input_format_json_throw_on_duplicated_fields", false, true, "Throw an exception if there are duplicated fields in JSON object in JSON input formats. If disabled, the first field will be used"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{"input_format_json_throw_on_duplicated_fields", false, true, "Throw an exception if there are duplicated fields in JSON object in JSON input formats. If disabled, the first field will be used"}, | |
{"input_format_json_throw_on_duplicated_fields", true, true, "Throw an exception if there are duplicated fields in JSON object in JSON input formats. If disabled, the first field will be used"}, |
In my understanding previously we thrown an exception, so the previous value of this should true
, not false
.
0e3a3dc
to
d1b80d7
Compare
Actually, I am not sure about introducing setting |
As @Avogar you have much more experience in input formats and how to handle new optimizations, settings, etc, plus I also think this is a nice optimization, I am happy if we enable the optimization by default. You also made me think again about the setting. And I think you are right: we should have a setting to enable the optimization and not to disable throwing on repeated fields, because right now if the first two fields have the same name, we will still throw an exception regardless of the value of the new setting. Thanks for jumping in! @lgbo-ustc My review seems to be a bit bad on this PR, sorry for that. I am still learning about formats and might miss things like this. Please let us know what do you think. |
Indeed I want to enable this optimization at default. This optimization is inspired by a case we meet recently, In our actual production applications, json strings with invalid format or duplicated fields are rare. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the really nice work!
Merging this PR. |
6189cd4
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
...
When all required fields are read, skip all remaining fields directly which can save a lot of comparison.
Documentation entry for user-facing changes
Modify your CI run:
NOTE: If your merge the PR with modified CI you MUST KNOW what you are doing
NOTE: Checked options will be applied if set before CI RunConfig/PrepareRunConfig step
Include tests (required builds will be added automatically):
Exclude tests:
Extra options:
Only specified batches in multi-batch jobs: