Replace row/column based `Location` with byte-offsets. #3931

MichaReiser · 2023-04-11T10:09:30Z

Summary

This PR changes ruff to use our own fork of RustPython that replaces Location { row: u32, column: u32 } with TextSize astral-sh/RustPython#4. The main motivation for this change is to ship the logical line rules. Enabling the logical line rules regresses performance by as much as 50% because the rules need to slice into the source string, which requires building and querying the LineIndex. Using byte offsets everywhere trades the need from having to build the LineIndex to inspect the source text in lint rules with re-computing the row and column information when rendering diagnostics. This is a favourable trade because most projects using ruff only have very few diagnostics.

Notable Changes

`SourceCodeFile`

It is now necessary to always include the source code when passing Messages because the source text is necessary to re-compute the row and column positions for byte offsets (TextSize). Previously, the source text was only included when using --show-source. This results in a noticeable slowdown in projects with many (ten thousand) diagnostics.

`Locator`

The Locator now exposes methods to:

For a byte offset, compute its line's start and end positions. Useful to compute the indent or when replacing a full line.
For a text range, compute the offset of the first line and the last line in that range. Useful when replacing all lines in a range.

The computations are performed on demand without querying the LineIndex.

The Locator still has a lazy computed LineIndex because we have a few diagnostics that use a line number as part of their message.

`SourceCode`

SourceCode now provides methods to compute the SourceLocation (row column information) given an offset.

`UniversalNewline`

The UniversalNewline iterator now returns Line items instead of &str. This is necessary because many lints need to know the offset of the nth line and summing the text.text_len() doesn't give you the right result because the text does not include the trailing newline character:

def f:
    pass

The text len of the first line is 6 bytes because the line does not include the trailing newline character.

The Line struct provides methods o get a line's start offset, end offset, range, and text. It also provides methods to get the text, end offset, and range, including the trailing newline character.

Use `TextRange` for ranges

Consistently uses TextRange in favor of: (Location, Location) and start: Location, end: Location because TextRange better communicates that the two offsets are related.

Replaces all references of Range with TextRange and deletes Range.

Use `TextSize` instead of `Location`

Replaces all references to Location with TextSize.

`Stylist`

This PR removes the lazy computations for indention and quote because slicing into the source string is now cheap.

`Indexer`

The Indexer used to store the line numbers of commented lines, lines with continuations, and lines with multiline strings. This is no longer feasible because it would require computing the line numbers. The new implementation stores the line-start offset for continuous lines and the TextRange for comments and multiline strings.

Storing the TextRange instead of line numbers helped to fix a false-negative where a mixed spaces-tab indent at the start of a multiline string was not reported because the analysis incorrectly assumed that it is part of the multiline string.

Noqa

This PR now stores the TextRange of the line for each noqa comment sorted in ascending order by the start position. Testing whether a diagnostic is suppressed requires a binary search on the ranges to test if any range contains the diagnostics start location.

This PR further replaces the mapping to suppress some syntaxes on other lines by a TextRange vector where every entry means that a position falling into that range should be remapped to the end of the range.

isort directives

Similar to noqa. It now stores the TextRanges instead of the line numbers for the areas where sorting is disabled. This PR now only stores the TextSize for split positions as this proves to be sufficient.

Benchmark

TLDR: 10% performance improvement for projects with few diagnostics. Identical performance or small regression for projects with many diagnostics. The new implementation with logical-lines enabled outperforms main with logical-lines disabled.

Micro Benchmarks

This PR improves the default-rules benchmark by 6-15% and the all-rules benchmark by 4-8%. More importantly, ruff with logical-lines enabled is as fast or even faster than main. This should allow us to ship logical lines without causing a runtime regression.

group                                      bytes                                  bytes-logical                          main                                   main-logical
-----                                      -----                                  -------------                          ----                                   ------------
linter/all-rules/large/dataset.py          1.00      8.6±0.08ms     4.7 MB/sec    1.03      8.9±0.17ms     4.6 MB/sec    1.04      8.9±0.12ms     4.6 MB/sec    1.10      9.4±0.01ms     4.3 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.00      2.0±0.07ms     8.3 MB/sec    1.05      2.1±0.03ms     7.9 MB/sec    1.05      2.1±0.03ms     7.9 MB/sec    1.13      2.3±0.00ms     7.4 MB/sec
linter/all-rules/numpy/globals.py          1.00    220.7±4.63µs    13.4 MB/sec    1.04    228.8±2.74µs    12.9 MB/sec    1.08    238.8±1.17µs    12.4 MB/sec    1.16    256.8±1.25µs    11.5 MB/sec
linter/all-rules/pydantic/types.py         1.00      3.5±0.04ms     7.3 MB/sec    1.05      3.7±0.10ms     6.9 MB/sec    1.07      3.7±0.02ms     6.8 MB/sec    1.14      4.0±0.02ms     6.4 MB/sec
linter/default-rules/large/dataset.py      1.00      4.3±0.07ms     9.5 MB/sec    1.07      4.6±0.03ms     8.8 MB/sec    1.08      4.6±0.05ms     8.8 MB/sec    1.18      5.1±0.08ms     8.1 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.00    870.1±5.69µs    19.1 MB/sec    1.12    971.9±3.08µs    17.1 MB/sec    1.15   1002.2±4.42µs    16.6 MB/sec    1.27   1108.6±3.25µs    15.0 MB/sec
linter/default-rules/numpy/globals.py      1.00     95.9±0.72µs    30.8 MB/sec    1.08    103.2±1.69µs    28.6 MB/sec    1.06    102.0±1.01µs    28.9 MB/sec    1.19    114.5±0.47µs    25.8 MB/sec
linter/default-rules/pydantic/types.py     1.00  1883.0±10.58µs    13.5 MB/sec    1.12      2.1±0.01ms    12.1 MB/sec    1.10      2.1±0.00ms    12.3 MB/sec    1.23      2.3±0.01ms    11.0 MB/sec

It's worth pointing out that the relative slowdown introduced by enabling the logical lines lint rules remains unchanged. I'm surprised by this because it doesn't show the improvement I expected from removing the LineIndex computation from the linting path.

group                                      main                                   main-logical
-----                                      ----                                   ------------
linter/all-rules/large/dataset.py          1.00      8.9±0.12ms     4.6 MB/sec    1.06      9.4±0.01ms     4.3 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.00      2.1±0.03ms     7.9 MB/sec    1.07      2.3±0.00ms     7.4 MB/sec
linter/all-rules/numpy/globals.py          1.00    238.8±1.17µs    12.4 MB/sec    1.08    256.8±1.25µs    11.5 MB/sec
linter/all-rules/pydantic/types.py         1.00      3.7±0.02ms     6.8 MB/sec    1.07      4.0±0.02ms     6.4 MB/sec
linter/default-rules/large/dataset.py      1.00      4.6±0.05ms     8.8 MB/sec    1.09      5.1±0.08ms     8.1 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.00   1002.2±4.42µs    16.6 MB/sec    1.11   1108.6±3.25µs    15.0 MB/sec
linter/default-rules/numpy/globals.py      1.00    102.0±1.01µs    28.9 MB/sec    1.12    114.5±0.47µs    25.8 MB/sec
linter/default-rules/pydantic/types.py     1.00      2.1±0.00ms    12.3 MB/sec    1.12      2.3±0.01ms    11.0 MB/sec

group                                      bytes                                  bytes-logical
-----                                      -----                                  -------------
linter/all-rules/large/dataset.py          1.00      8.6±0.08ms     4.7 MB/sec    1.03      8.9±0.17ms     4.6 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.00      2.0±0.07ms     8.3 MB/sec    1.05      2.1±0.03ms     7.9 MB/sec
linter/all-rules/numpy/globals.py          1.00    220.7±4.63µs    13.4 MB/sec    1.04    228.8±2.74µs    12.9 MB/sec
linter/all-rules/pydantic/types.py         1.00      3.5±0.04ms     7.3 MB/sec    1.05      3.7±0.10ms     6.9 MB/sec
linter/default-rules/large/dataset.py      1.00      4.3±0.07ms     9.5 MB/sec    1.07      4.6±0.03ms     8.8 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.00    870.1±5.69µs    19.1 MB/sec    1.12    971.9±3.08µs    17.1 MB/sec
linter/default-rules/numpy/globals.py      1.00     95.9±0.72µs    30.8 MB/sec    1.08    103.2±1.69µs    28.6 MB/sec
linter/default-rules/pydantic/types.py     1.00  1883.0±10.58µs    13.5 MB/sec    1.12      2.1±0.01ms    12.1 MB/sec

CPython

This benchmark measures the worst-case performance: A project with many violations.

Performance regresses for loading cached results (except when using --show-source). This is expected because printing diagnostics now always requires storing the source text and computing the source locations adds some overhead as well.
Slight improvement for --no-cache as seen in the micro benchmarks
10% performance improvement when using silent (-s). This shows the potential of the refactor for projects with few or no diagnostics. Silent still pays the overhead for storing the source text for every diagnostic, but the implementation doesn't compute the LineIndex.

Benchmark results

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`./ruff-bytes ./crates/ruff/resources/test/cpython/ -e`	44.4 ± 2.0	40.3	49.0	1.17 ± 0.08
`./ruff-main ./crates/ruff/resources/test/cpython/ -e`	37.9 ± 1.8	34.1	43.5	1.00
`./ruff-bytes ./crates/ruff/resources/test/cpython/ -e --no-cache`	200.2 ± 4.0	194.5	210.1	5.28 ± 0.28
`./ruff-main ./crates/ruff/resources/test/cpython/ -e --no-cache`	215.7 ± 6.5	203.5	228.0	5.69 ± 0.32
`./ruff-bytes ./crates/ruff/resources/test/cpython/ -e --select=ALL`	401.6 ± 9.6	391.4	420.1	10.60 ± 0.57
`./ruff-main ./crates/ruff/resources/test/cpython/ -e --select=ALL`	395.6 ± 8.8	383.8	412.5	10.44 ± 0.55
`./ruff-bytes ./crates/ruff/resources/test/cpython/ -e --no-cache --select=ALL`	691.1 ± 9.6	677.6	709.3	18.24 ± 0.91
`./ruff-main ./crates/ruff/resources/test/cpython/ -e --no-cache --select=ALL`	716.7 ± 11.3	700.7	736.3	18.91 ± 0.96
`./ruff-bytes ./crates/ruff/resources/test/cpython/ -e --show-source`	85.7 ± 2.4	81.7	93.2	2.26 ± 0.13
`./ruff-main ./crates/ruff/resources/test/cpython/ -e --show-source`	85.0 ± 2.0	81.9	90.4	2.24 ± 0.12
`./ruff-bytes ./crates/ruff/resources/test/cpython/ -e --no-cache --show-source`	245.7 ± 4.4	238.5	251.2	6.48 ± 0.33
`./ruff-main ./crates/ruff/resources/test/cpython/ -e --no-cache --show-source`	259.8 ± 5.1	251.3	272.6	6.86 ± 0.36
`./ruff-bytes ./crates/ruff/resources/test/cpython/ -e --select=ALL --show-source`	1507.1 ± 18.6	1474.6	1531.3	39.77 ± 1.98
`./ruff-main ./crates/ruff/resources/test/cpython/ -e --select=ALL --show-source`	1459.6 ± 23.5	1426.0	1500.0	38.52 ± 1.96
`./ruff-bytes ./crates/ruff/resources/test/cpython/ -e --no-cache --select=ALL --show-source`	1770.1 ± 21.8	1747.0	1822.1	46.71 ± 2.32
`./ruff-main ./crates/ruff/resources/test/cpython/ -e --no-cache --select=ALL --show-source`	1799.7 ± 31.7	1762.7	1856.2	47.49 ± 2.44

Homeassitant

Best case benchmark: A project with very few diagnostics (10).

Performance is unchanged when running with caching
Performance improves by about 10% when caching is disabled.

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`../ruff/ruff-bytes . -e`	51.4 ± 2.0	44.6	55.7	1.00
`../ruff/ruff-main . -e`	51.4 ± 2.3	45.4	57.3	1.00 ± 0.06
`../ruff/ruff-bytes . -e --no-cache`	360.0 ± 3.0	355.4	364.0	7.01 ± 0.28
`../ruff/ruff-main . -e --no-cache`	387.4 ± 5.9	380.1	396.2	7.54 ± 0.32

Enabling logical lines introduces many new errors (2000), no longer showing the best case. But the new implementation still outperforms the old with logical-lines enabled and remains about 10% faster.

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`../ruff/ruff-bytes-logical . -e`	53.7 ± 2.3	49.1	59.8	1.03 ± 0.06
`../ruff/ruff-main-logical . -e`	52.1 ± 2.4	46.6	56.6	1.00
`../ruff/ruff-bytes-logical . -e --no-cache`	383.2 ± 2.7	378.9	387.6	7.35 ± 0.34
`../ruff/ruff-main-logical . -e --no-cache`	417.7 ± 6.3	411.8	433.6	8.01 ± 0.38

Test Plan

The ecosystem check shows no changes (after applying Fix (doc-)line-too-long start location #4006, JSON Emitter: Use one indexed column numbers for edits #4007, and Set non-empty range for indentation diagnostics #4005)
WASM playground
Test LSP
Test fixes across repositories
Test --add-noqa with airflow repository
Test --fix with airflow repository

Breaking Changes

This PR changes the column numbers of fixes in the JSON output to be one indexed to align the column numbers with the Diagnostic start and end columns. I can undo this change but I got it "for free" by using SourceLocation consistently.

I reverted the change in this PR and extracted it into #4007

crates/ruff_python_formatter/src/format/expr.rs

crates/ruff_python_formatter/src/cst/mod.rs

evanrittenhouse · 2023-04-11T15:29:37Z

Just for my own curiosity, what's the context for this? Why's it necessary?

MichaReiser · 2023-04-12T07:18:13Z

Just for my own curiosity, what's the context for this? Why's it necessary?

Strictly speaking, it isn't necessary from a functional point of view, but using byte offsets helps to improve performance and reduce memory consumption.

I started investigating switching to byte offsets because enabling the pycodestyle (logical line) rules #3689 results in a 20%-50% performance regression, even tough I already improved the performance of the rules themselves. A key observation is that benchmarks for the default-rules regress more than for the all rules benchmarks.

group                                      main                                   pr
-----                                      ----                                   --
linter/all-rules/large/dataset.py          1.00     13.5±0.06ms     3.0 MB/sec    1.20     16.3±0.09ms     2.5 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.00      3.5±0.01ms     4.7 MB/sec    1.18      4.2±0.01ms     4.0 MB/sec
linter/all-rules/numpy/globals.py          1.00    490.6±2.52µs     6.0 MB/sec    1.16    567.3±0.81µs     5.2 MB/sec
linter/all-rules/pydantic/types.py         1.00      6.0±0.02ms     4.3 MB/sec    1.23      7.4±0.02ms     3.5 MB/sec
linter/default-rules/large/dataset.py      1.00      7.2±0.01ms     5.7 MB/sec    1.38      9.9±0.04ms     4.1 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.00   1615.1±3.91µs    10.3 MB/sec    1.39      2.3±0.01ms     7.4 MB/sec
linter/default-rules/numpy/globals.py      1.00    180.5±0.32µs    16.3 MB/sec    1.52    274.0±4.45µs    10.8 MB/sec
linter/default-rules/pydantic/types.py     1.00      3.4±0.01ms     7.6 MB/sec    1.41      4.8±0.02ms     5.4 MB/sec

This is because the pycodestyle rules are the first rules in the default set that inspect the source code (trivia). The challenge with inspecting the source code is that you can't slice a string with a row/column location. This isn't possible: string[row=1, column=5..10]. Ruff works around this by building a LineIndex that maps row and column locations to byte offsets (somewhat expensive), and than uses that index (lookup is cheap) to retrieve the string locations.

My goal with using byte-offsets is to remove the need to build a LineIndex from the linting phase. It will still be necessary to build the line index when we emit diagnostics, because, as a user, I strongly prefer row/column numbers over byte indices ;). Having to build a LineIndex for all files with diagnostics may result in a performance regression for projects where most files have diagnostics, but this is rare for projects using ruff that tend to have zero or only few diagnostics.

There are other, non-pycodestyle specific reasons why I want to adopt byte offsets:

Code size reduction: Comparing a byte-offset is a single u32 comparison, compared to two u32 comparisons for a row/column based source location. I further made end_location mandatory on Located which helps removing some unwrap code paths. The result for ruff_python_formatter is a small code size reduction: ~4.4MB → 4.3
Reduced memory consumption: A row/column based Location uses 8 bytes (4 bytes for the row and column). A byte offset only uses half of that (single u32). This means, we reduce the size of every LexResult, Located, and AST node by 8 bytes (-4 bytes for the start, and -4 bytes for the end locations).
Performance improvements: Writing and reading less data, and fewer CPU instructions should help to improve overall performance. A preliminary benchmark comparing the parser and visitor (counting all statements) performance shows a 10-15% performance improvement. Whether we're able to reap all these improvements in Ruff depends on how efficiently we can rewrite the logic that uses row information in the linting phase today (noqa, isort comments, commented lines, etc)

parser/numpy/globals.py time:   [65.752 µs 65.844 µs 65.973 µs]
                        thrpt:  [44.725 MiB/s 44.813 MiB/s 44.876 MiB/s]
                 change:
                        time:   [-5.8033% -5.5729% -5.3542%] (p = 0.00 < 0.05)
                        thrpt:  [+5.6571% +5.9018% +6.1609%]
                        Performance has improved.
Found 21 outliers among 100 measurements (21.00%)
  15 (15.00%) low severe
  2 (2.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe
parser/pydantic/types.py
                        time:   [1.4442 ms 1.4453 ms 1.4466 ms]
                        thrpt:  [17.630 MiB/s 17.646 MiB/s 17.659 MiB/s]
                 change:
                        time:   [-12.393% -12.225% -11.904%] (p = 0.00 < 0.05)
                        thrpt:  [+13.512% +13.927% +14.146%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe
parser/numpy/ctypeslib.py
                        time:   [647.58 µs 650.18 µs 652.90 µs]
                        thrpt:  [25.503 MiB/s 25.610 MiB/s 25.713 MiB/s]
                 change:
                        time:   [-14.351% -14.154% -13.948%] (p = 0.00 < 0.05)
                        thrpt:  [+16.209% +16.488% +16.756%]
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  17 (17.00%) high mild
  1 (1.00%) high severe
parser/large/dataset.py time:   [3.5024 ms 3.5104 ms 3.5195 ms]
                        thrpt:  [11.559 MiB/s 11.589 MiB/s 11.616 MiB/s]
                 change:
                        time:   [-11.825% -11.603% -11.374%] (p = 0.00 < 0.05)
                        thrpt:  [+12.834% +13.126% +13.411%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  7 (7.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe

Other compilers using byte-offsts:

Zig: string interning
Roslyn: C#/VisualBasic compiler
Rust

MichaReiser · 2023-04-12T08:34:22Z

Current dependencies on/for this PR:

main
- PR Add parser benchmark #3990
  - PR Replace row/column based Location with byte-offsets. #3931 👈
    - PR Use memchr to speedup newline search on x86 #3985
    - PR perf(logical-lines): Various small perf improvements #4022
      - PR Use iter for logical lines #4120
    - PR Replace LexResult with Tok::Error #4121

This comment was auto-generated by Graphite.

evanrittenhouse · 2023-04-13T01:10:53Z

Thanks for the detailed explanation @MichaReiser! If you don't mind - where do the offsets come from? Column offset makes sense (obviously just offsetting from index 0), but how are rows represented? Or is a total byte-offset calculated from what is effectively row 0, column 0?

E: Never mind, just found locator.rs. Seems like we essentially treat the file as one giant string, then define rows as being delimited by \n/\r and then the columns as the offsets from that offset?

github-actions · 2023-04-14T19:04:09Z

PR Check Results

Ecosystem

ℹ️ ecosystem check detected changes. (+0, -16, 0 error(s))

airflow (+0, -7)

- airflow/api_connexion/endpoints/task_instance_endpoint.py:274:12: RET504 Unnecessary variable assignment before `return` statement
- airflow/providers/amazon/aws/secrets/systems_manager.py:200:16: RET504 Unnecessary variable assignment before `return` statement
- airflow/providers/docker/operators/docker.py:479:16: RET504 Unnecessary variable assignment before `return` statement
- airflow/providers/oracle/hooks/oracle.py:42:12: RET504 Unnecessary variable assignment before `return` statement
- airflow/security/utils.py:83:12: RET504 Unnecessary variable assignment before `return` statement
- airflow/www/extensions/init_appbuilder.py:359:16: RET504 Unnecessary variable assignment before `return` statement
- tests/test_utils/gcp_system_helpers.py:65:12: RET504 Unnecessary variable assignment before `return` statement

bokeh (+0, -1)

- src/bokeh/core/property/datetime.py:165:16: RET504 Unnecessary variable assignment before `return` statement

zulip (+0, -8)

- zerver/data_import/rocketchat.py:141:12: RET504 Unnecessary variable assignment before `return` statement
- zerver/lib/message.py:186:12: RET504 Unnecessary variable assignment before `return` statement
- zerver/lib/narrow.py:891:12: RET504 Unnecessary variable assignment before `return` statement
- zerver/lib/url_preview/oembed.py:50:12: RET504 Unnecessary variable assignment before `return` statement
- zerver/models.py:184:12: RET504 Unnecessary variable assignment before `return` statement
- zerver/webhooks/basecamp/view.py:115:12: RET504 Unnecessary variable assignment before `return` statement
- zerver/webhooks/bitbucket2/view.py:436:12: RET504 Unnecessary variable assignment before `return` statement
- zerver/webhooks/zendesk/view.py:14:12: RET504 Unnecessary variable assignment before `return` statement

Benchmark

Linux

group                                      main                                   pr
-----                                      ----                                   --
linter/all-rules/large/dataset.py          1.00     14.9±0.06ms     2.7 MB/sec    1.00     14.9±0.09ms     2.7 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.00      3.6±0.04ms     4.6 MB/sec    1.00      3.6±0.01ms     4.6 MB/sec
linter/all-rules/numpy/globals.py          1.00    380.3±1.31µs     7.8 MB/sec    1.00    378.7±1.49µs     7.8 MB/sec
linter/all-rules/pydantic/types.py         1.00      6.2±0.01ms     4.1 MB/sec    1.00      6.2±0.01ms     4.1 MB/sec
linter/default-rules/large/dataset.py      1.00      7.6±0.01ms     5.3 MB/sec    1.00      7.6±0.03ms     5.3 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.00   1596.6±2.95µs    10.4 MB/sec    1.00   1602.0±2.59µs    10.4 MB/sec
linter/default-rules/numpy/globals.py      1.00    174.3±0.29µs    16.9 MB/sec    1.00    174.0±0.63µs    17.0 MB/sec
linter/default-rules/pydantic/types.py     1.00      3.4±0.01ms     7.5 MB/sec    1.00      3.4±0.01ms     7.5 MB/sec
parser/large/dataset.py                    1.01      5.9±0.00ms     6.8 MB/sec    1.00      5.9±0.01ms     6.9 MB/sec
parser/numpy/ctypeslib.py                  1.00   1144.2±2.10µs    14.6 MB/sec    1.00   1138.6±2.20µs    14.6 MB/sec
parser/numpy/globals.py                    1.00    116.8±0.22µs    25.3 MB/sec    1.00    117.1±0.19µs    25.2 MB/sec
parser/pydantic/types.py                   1.01      2.5±0.00ms    10.2 MB/sec    1.00      2.5±0.00ms    10.3 MB/sec

Windows

group                                      main                                   pr
-----                                      ----                                   --
linter/all-rules/large/dataset.py          1.04     25.1±0.86ms  1657.1 KB/sec    1.00     24.3±0.86ms  1717.6 KB/sec
linter/all-rules/numpy/ctypeslib.py        1.02      6.2±0.44ms     2.7 MB/sec    1.00      6.1±0.30ms     2.7 MB/sec
linter/all-rules/numpy/globals.py          1.00   707.5±38.28µs     4.2 MB/sec    1.00   710.4±43.20µs     4.2 MB/sec
linter/all-rules/pydantic/types.py         1.00     10.1±0.47ms     2.5 MB/sec    1.02     10.3±0.40ms     2.5 MB/sec
linter/default-rules/large/dataset.py      1.00     12.3±0.87ms     3.3 MB/sec    1.00     12.2±0.46ms     3.3 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.00      2.5±0.13ms     6.6 MB/sec    1.02      2.6±0.13ms     6.4 MB/sec
linter/default-rules/numpy/globals.py      1.00   300.2±17.16µs     9.8 MB/sec    1.02   305.2±19.67µs     9.7 MB/sec
linter/default-rules/pydantic/types.py     1.00      5.4±0.25ms     4.7 MB/sec    1.01      5.5±0.25ms     4.7 MB/sec
parser/large/dataset.py                    1.00      9.9±0.48ms     4.1 MB/sec    1.01      9.9±0.35ms     4.1 MB/sec
parser/numpy/ctypeslib.py                  1.00  1864.2±78.93µs     8.9 MB/sec    1.01  1881.3±62.66µs     8.9 MB/sec
parser/numpy/globals.py                    1.00    193.7±9.50µs    15.2 MB/sec    1.02   196.8±15.61µs    15.0 MB/sec
parser/pydantic/types.py                   1.00      4.1±0.17ms     6.2 MB/sec    1.01      4.1±0.18ms     6.2 MB/sec

crates/ruff/src/autofix/actions.rs

crates/ruff/src/checkers/imports.rs

crates/ruff/src/directives.rs

crates/ruff/src/doc_lines.rs

crates/ruff/src/docstrings/definition.rs

crates/ruff/src/rules/pandas_vet/fixes.rs

crates/ruff/src/rules/pandas_vet/rules/inplace_argument.rs

crates/ruff/src/rules/pycodestyle/rules/errors.rs

crates/ruff/src/rules/pycodestyle/rules/invalid_escape_sequence.rs

crates/ruff/src/rules/pycodestyle/rules/logical_lines/missing_whitespace.rs

crates/ruff/src/rules/pycodestyle/rules/logical_lines/mod.rs

crates/ruff/src/rules/pydocstyle/rules/backslashes.rs

crates/ruff/src/rules/pyupgrade/rules/printf_string_formatting.rs

crates/ruff/src/rules/pyupgrade/rules/replace_stdout_stderr.rs

crates/ruff/src/rules/pyupgrade/rules/replace_universal_newlines.rs

crates/ruff/src/rules/ruff/rules/ambiguous_unicode_character.rs

crates/ruff_python_ast/src/imports.rs

MichaReiser · 2023-04-15T21:05:13Z

Thanks for the detailed explanation @MichaReiser! If you don't mind - where do the offsets come from? Column offset makes sense (obviously just offsetting from index 0), but how are rows represented? Or is a total byte-offset calculated from what is effectively row 0, column 0?

E: Never mind, just found locator.rs. Seems like we essentially treat the file as one giant string, then define rows as being delimited by \n/\r and then the columns as the offsets from that offset?

@evanrittenhouse, sorry for the late reply.

The RustPython Lexer generates the offsets. The old implementation counted the rows and columns (from the start of the row). The lexer increments the current row index and resets the column to zero for every new line character.

Byte offsets don't use row or columns. Instead, it's an offset from the beginning of the file. Think of the string as a byte array and the byte offset is the index into that array:

def f(): pass
x = 20

The position of the identifier f:

row/column representation: Location { row: 1, column 4 }
byte offsets: TextSize::from(4)

The position of the = sign

row/column representation: Location { row: 2, column: 2 }
byte offsets: TextSize::from(16)

crates/ruff/src/rules/pycodestyle/snapshots/ruff__rules__pycodestyle__tests__E111_E11.py.snap

MichaReiser · 2023-04-17T13:24:29Z

crates/ruff/src/rules/pycodestyle/snapshots/ruff__rules__pycodestyle__tests__W191_W19.py.snap

@@ -282,6 +282,16 @@ W19.py:133:1: W191 Indentation contains tabs
 137 | def test_keys(self):
    |

+W19.py:136:1: W191 Indentation contains tabs


My understanding is that this was a false negative because Indexer.strings incorrectly suppressed this violation because it is on a line with a string. This now gets correctly reported because we test if the tab is inside of a string range (rather than on a line)

I think you're right.

crates/ruff/src/rules/pycodestyle/snapshots/ruff__rules__pycodestyle__tests__E501_E501.py.snap

.. remove unnecessary `contains_line_break` calls, create non-empty range for `SyntaxErrors`

MichaReiser commented Apr 11, 2023

View reviewed changes

crates/ruff_python_formatter/src/format/expr.rs Show resolved Hide resolved

MichaReiser commented Apr 11, 2023

View reviewed changes

crates/ruff_python_formatter/src/cst/mod.rs Outdated Show resolved Hide resolved

MichaReiser force-pushed the byte-offset-parser branch from 6ecd573 to 5c17126 Compare April 12, 2023 08:39

MichaReiser force-pushed the byte-offset-parser branch 2 times, most recently from a69f012 to 09cbc45 Compare April 14, 2023 17:28

This was referenced Apr 14, 2023

Fork RustPyton and change Location to TextSize RustPython/RustPython#4874

Closed

Replace row/column based Location with byte-offsets. astral-sh/RustPython#4

Merged

MichaReiser force-pushed the byte-offset-parser branch 2 times, most recently from 5161fdc to 9435ba5 Compare April 14, 2023 17:59

MichaReiser added the breaking Breaking API change label Apr 14, 2023

MichaReiser force-pushed the byte-offset-parser branch from 9435ba5 to 9c24e59 Compare April 14, 2023 18:10

MichaReiser force-pushed the byte-offset-parser branch from 1026903 to b2a19a9 Compare April 14, 2023 20:21

MichaReiser commented Apr 15, 2023

View reviewed changes

MichaReiser mentioned this pull request Apr 16, 2023

Use memchr to speedup newline search on x86 #3985

Merged

MichaReiser changed the base branch from main to add-parser-benchmark April 17, 2023 06:07

MichaReiser force-pushed the byte-offset-parser branch from 0e7a8fa to c477216 Compare April 17, 2023 06:07

MichaReiser mentioned this pull request Apr 17, 2023

Add parser benchmark #3990

Merged

MichaReiser force-pushed the byte-offset-parser branch from c477216 to 35c39a6 Compare April 17, 2023 06:50

MichaReiser commented Apr 17, 2023

View reviewed changes

crates/ruff/src/rules/pycodestyle/snapshots/ruff__rules__pycodestyle__tests__E111_E11.py.snap Outdated Show resolved Hide resolved

MichaReiser commented Apr 17, 2023

View reviewed changes

crates/ruff/src/rules/pycodestyle/snapshots/ruff__rules__pycodestyle__tests__E501_E501.py.snap Outdated Show resolved Hide resolved

MichaReiser force-pushed the byte-offset-parser branch from 1a1ea79 to 0963a3f Compare April 17, 2023 14:43

Base automatically changed from add-parser-benchmark to main April 17, 2023 14:44

MichaReiser added 12 commits April 26, 2023 11:20

Use Indent offsets

8b9c9f4

Lazy compute source file line index

ff55025

Delete PLC0999.py

bd38594

Cargo fmt

c6c540b

Fix add_rule script

f6a02e7

Fix column in playground

4ba63e4

Fix line too long column number, ...

1601f60

.. remove unnecessary `contains_line_break` calls, create non-empty range for `SyntaxErrors`

fix invalid escape sequence at end of file, sort messages by kind

bf1789b

Revert one-indexed columns for JSON ouptut

7b657bb

Undo visibility change

8c81e50

Address code-review feedback

bd586ef

Fix noqa handling

6ebe212

MichaReiser force-pushed the byte-offset-parser branch from 90fc963 to 5995306 Compare April 26, 2023 17:22

This was referenced Apr 26, 2023

Replace LexResult with Tok::Error #4121

Closed

Use iter for logical lines #4120

Closed

Fix byte offset errors

7893968

MichaReiser force-pushed the byte-offset-parser branch from 5995306 to 7893968 Compare April 26, 2023 17:24

Upgrade RustPython Parser

7336d55

MichaReiser enabled auto-merge (squash) April 26, 2023 18:06

MichaReiser merged commit cab65b2 into main Apr 26, 2023

MichaReiser deleted the byte-offset-parser branch April 26, 2023 18:11

MichaReiser mentioned this pull request Apr 26, 2023

Enable pycodestyle rules #3689

Merged

evanrittenhouse mentioned this pull request Apr 26, 2023

Make D410/D411 autofixes mutually exclusive #4110

Merged

Tenzer mentioned this pull request May 2, 2023

ruff 0.0.264 Homebrew/homebrew-core#129929

Merged

qarmin mentioned this pull request May 13, 2023

Ruff panics when fixing file and partially change its content #4406

Closed

charliermarsh mentioned this pull request May 23, 2023

Investigate performance regression for 100,000s of violations #4606

Closed

MichaReiser mentioned this pull request May 24, 2023

Improve Message sorting performance #4624

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace row/column based `Location` with byte-offsets. #3931

Replace row/column based `Location` with byte-offsets. #3931

MichaReiser commented Apr 11, 2023 •

edited

Loading

evanrittenhouse commented Apr 11, 2023

MichaReiser commented Apr 12, 2023 •

edited

Loading

MichaReiser commented Apr 12, 2023 •

edited

Loading

evanrittenhouse commented Apr 13, 2023 •

edited

Loading

github-actions bot commented Apr 14, 2023 •

edited

Loading

MichaReiser commented Apr 15, 2023 •

edited

Loading

MichaReiser Apr 17, 2023

charliermarsh Apr 19, 2023

Replace row/column based Location with byte-offsets. #3931

Replace row/column based Location with byte-offsets. #3931

Conversation

MichaReiser commented Apr 11, 2023 • edited Loading

Summary

Notable Changes

SourceCodeFile

Locator

SourceCode

UniversalNewline

Use TextRange for ranges

Use TextSize instead of Location

Stylist

Indexer

Noqa

isort directives

Benchmark

Micro Benchmarks

CPython

Homeassitant

Test Plan

Breaking Changes

evanrittenhouse commented Apr 11, 2023

MichaReiser commented Apr 12, 2023 • edited Loading

MichaReiser commented Apr 12, 2023 • edited Loading

evanrittenhouse commented Apr 13, 2023 • edited Loading

github-actions bot commented Apr 14, 2023 • edited Loading

PR Check Results

Ecosystem

Benchmark

Linux

Windows

MichaReiser commented Apr 15, 2023 • edited Loading

MichaReiser Apr 17, 2023

Choose a reason for hiding this comment

charliermarsh Apr 19, 2023

Choose a reason for hiding this comment

Replace row/column based `Location` with byte-offsets. #3931

Replace row/column based `Location` with byte-offsets. #3931

MichaReiser commented Apr 11, 2023 •

edited

Loading

`SourceCodeFile`

`Locator`

`SourceCode`

`UniversalNewline`

Use `TextRange` for ranges

Use `TextSize` instead of `Location`

`Stylist`

`Indexer`

MichaReiser commented Apr 12, 2023 •

edited

Loading

MichaReiser commented Apr 12, 2023 •

edited

Loading

evanrittenhouse commented Apr 13, 2023 •

edited

Loading

github-actions bot commented Apr 14, 2023 •

edited

Loading

MichaReiser commented Apr 15, 2023 •

edited

Loading