Skip to content
This repository has been archived by the owner on May 11, 2023. It is now read-only.

Replace row/column based Location with byte-offsets. #4

Merged
merged 16 commits into from
Apr 26, 2023
Merged

Conversation

MichaReiser
Copy link
Member

@MichaReiser MichaReiser commented Apr 14, 2023

This PR replaces RustPython's Location { row: u32, column: u32 } with TextSize, a byte offset from the beginning of the file. The motivation of the change is that:

  • The size of Location is 8 bytes, whereas TextSize is only 4 bytes -> an 8 byte size reduction for every Lexer Item and 12 bytes for Located (This PR also changes Option<Location> to TextSize).
  • Benchmarks show a 10% performance improvement
  • Using a byte offset instead of a row/column-based location allows slicing into the source string without first converting from a row/column-based location to a byte offset.

Other notable changes

  • Located now stores a TextRange where the end_location is no longer optional (exposed via located.end()).
  • Changed the Item type of the Lexer from (Location, Tok, Location) to (Tok, TextRange)

See astral-sh/ruff#3931 for more details and the upstream work necessary in ruff to use byte offsets.

Micro Benchmarks

Created running cargo bench -p ruff_benchmark --bench parser in the ruff repository.

group                        bytes                                  main
-----                        -----                                  ----
parser/large/dataset.py      1.00      3.5±0.05ms    11.7 MB/sec    1.13      3.9±0.04ms    10.4 MB/sec
parser/numpy/ctypeslib.py    1.00    659.1±1.33µs    25.3 MB/sec    1.11   731.2±12.78µs    22.8 MB/sec
parser/numpy/globals.py      1.00     65.9±1.16µs    44.7 MB/sec    1.10     72.3±0.74µs    40.8 MB/sec
parser/pydantic/types.py     1.00  1486.2±12.91µs    17.2 MB/sec    1.07  1593.9±24.13µs    16.0 MB/sec

This change improves parsing and visiting performance by about 10%.

Downsides

The main downside of this change is that we now have to maintain our own fork of RustPython and that we would need to convert our AST to RustPython's AST if we ever decide to use RustPython to run e.g. tests or plugins.

Another downside is that this change is based on the assumption that it isn't necessary to compute row/column information for most parsed files, as is the case when linting a project that has adopted ruff.

I'm convinced that the speedups justify the change.

Target branch

This is a bit unclear to me. It could make sense to not merge to main or create a new upstream branch that keeps Location and is where we pull upstream changes in (or branch off when contributing upstream).

Cargo.toml Outdated Show resolved Hide resolved
@@ -0,0 +1,22 @@
[package]
name = "ruff_text_size"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to move ruff_text_size into this repository because:

  • I want to avoid git submodules in our main repository
  • ruff_text_size should rarely change

@MichaReiser MichaReiser force-pushed the byte-offset branch 10 times, most recently from 29ff5d8 to b16267a Compare April 17, 2023 11:10
@MichaReiser MichaReiser marked this pull request as ready for review April 17, 2023 15:56
.github/workflows/ci.yaml Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
compiler/ast/asdl_rs.py Show resolved Hide resolved
compiler/ast/src/lib.rs Show resolved Hide resolved
common/Cargo.toml Show resolved Hide resolved
compiler/parser/python.lalrpop Outdated Show resolved Hide resolved
ruff_text_size/tests/serde.rs Show resolved Hide resolved
@charliermarsh
Copy link
Member

Yeah I think the biggest question for me here is the branching strategy... What I've done in the past is had main stay in-sync with RustPython, and then used a ruff branch in the fork (which Ruff depended on). At one point I tried to maintain a linear history via rebasing between the forked main and the ruff branch, but that runs the risk of deleting commits that Ruff depends on, since you're force-pushing. So then I started to manage the ruff branch via merges and it all just got really messy and confusing. That is to say: we should ignore what I've done in the past, because it didn't work well.

@MichaReiser
Copy link
Member Author

Yeah I think the biggest question for me here is the branching strategy... What I've done in the past is had main stay in-sync with RustPython, and then used a ruff branch in the fork (which Ruff depended on). At one point I tried to maintain a linear history via rebasing between the forked main and the ruff branch, but that runs the risk of deleting commits that Ruff depends on, since you're force-pushing. So then I started to manage the ruff branch via merges and it all just got really messy and confusing. That is to say: we should ignore what I've done in the past, because it didn't work well.

I don't have any experience with this but we could try to creat a new upstream branch that matches RustPython's main and we merge our changes into our main branch. We can then periodically pull in changes from upstream, and rebase main. Or does this create issues because it changes the commit revs and we use those in the Cargo.toml...

@MichaReiser
Copy link
Member Author

Using merge seems like the most straightforward approach from all the approaches listed in the friendly forks blog post

@MichaReiser
Copy link
Member Author

MichaReiser commented Apr 25, 2023

I haven't fully figured out how we want to keep our branch up to date, but rebasing merges seems promising, which is used by git-windows. It's currently hard to figure out the details without having a real merge to play through. That's why I want to merge this PR and look into documenting the merge strategy later.

@youknowone
Copy link

youknowone commented Apr 26, 2023

I'd like to share I am going to fully merge this changes into RustPython upstream.

2 motivations.

  • I know fragmentations with forks takes efforts. Looking back history of open source projects, once diff goes accumulated, the cost going to be increased by every update. I wish the entire rustpython-parser users can share the non-distracted efforts.
  • The goal of this patch is also valid for RustPython too. Errors are definitely rare than no-error.

One question is how much RustPython change is required to adapt this to upstream.
But I promise I will try my best to keep this integrated way, regardless how much effort it requires.

For now, I am waiting you to find the best design to enhance performance of ruff. Please let me know if the patch is stablized and you are satisfied.

@MichaReiser
Copy link
Member Author

MichaReiser commented Apr 26, 2023

I'd like to share I am going to fully merge this changes into RustPython upstream.

2 motivations.

* I know fragmentations with forks takes efforts. Looking back history of open source projects, once diff goes accumulated, the cost going to be increased by every update. I wish the entire rustpython-parser users can share the non-distracted efforts.

* The goal of this patch is also valid for RustPython too. Errors are definitely rare than no-error.

One question is how much RustPython change is required to adapt this to upstream. But I promise I will try my best to keep this integrated way, regardless how much effort it requires.

Oh wow, this is huge for us! I'm happy to give a hand with upstreaming these changes.

Something I noticed that requires change is that RustPython uses the BaseError's Display implementation to show errors to users. This will no longer work without storing the source code on the error and recomputing the line and column indices. It may be interesting to take a look at miette for representing errors with source locations.

For now, I am waiting you to find the best design to enhance performance of ruff. Please let me know if the patch is stablized and you are satisfied.

The design is finalized and I plan to merge this PR and the PR to ruff by the end of this week.

Copy link

@youknowone youknowone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there more patches immediately mergable into rustpython?
Otherwise I will try to ship this patch on RustPython late this week and next week.

compiler/parser/python.lalrpop Outdated Show resolved Hide resolved
@@ -649,7 +649,7 @@ def write_ast_def(mod, typeinfo, f):
#![allow(clippy::derive_partial_eq_without_eq)]

pub use crate::constant::*;
pub use crate::Location;
pub use ruff_text_size::{TextSize, TextRange};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rustpython_parser::text_size if the only direct dependency is rustpython-parser?

Copy link
Member Author

@MichaReiser MichaReiser Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand the suggestion. Do you want to move ruff_text_size to the rustpython-parser or re-export the types for crates that only depend on rustpython_parser? Having the type in a separate crate does have advantages for us, for example, ruff_diagnostics no longer depends on rustpython_parser, which helps speed up our compile times.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure it is only used to replace Location or have more usage in ruff. Thank you!

@MichaReiser
Copy link
Member Author

Current dependencies on/for this PR:

This comment was auto-generated by Graphite.

@MichaReiser MichaReiser merged commit c3147d2 into main Apr 26, 2023
@MichaReiser MichaReiser deleted the byte-offset branch April 26, 2023 18:00
@charliermarsh
Copy link
Member

Woohoo!

youknowone added a commit to youknowone/RustPython-parser that referenced this pull request May 9, 2023
- Split parser core and compiler core. Fix RustPython#14
- AST int type to `u32`
- Updated asdl_rs.py and update_asdl.sh fix RustPython#6
- Use `ruff_python_ast::SourceLocation` for Python source location. Deleted our own Location.
- Renamed ast::Located to ast::Attributed to distinguish terms for TextSize and SourceLocation
- `ast::<Node>`s for TextSize located ast. `ast::located::<Node>` for Python source located ast.
- And also strictly renaming `located` to refer only python location related interfaces.
- `SourceLocator` to convert locations.
- New `source-code` features of to disable python locations when unnecessary.
- Also including fully merging astral-sh/RustPython#4 closes RustPython#9
youknowone added a commit to youknowone/RustPython-parser that referenced this pull request May 9, 2023
- Split parser core and compiler core. Fix RustPython#14
- AST int type to `u32`
- Updated asdl_rs.py and update_asdl.sh fix RustPython#6
- Use `ruff_python_ast::SourceLocation` for Python source location. Deleted our own Location.
- Renamed ast::Located to ast::Attributed to distinguish terms for TextSize and SourceLocation
- `ast::<Node>`s for TextSize located ast. `ast::located::<Node>` for Python source located ast.
- And also strictly renaming `located` to refer only python location related interfaces.
- `SourceLocator` to convert locations.
- New `source-code` features of to disable python locations when unnecessary.
- Also including fully merging astral-sh/RustPython#4 closes RustPython#9
youknowone added a commit to youknowone/RustPython-parser that referenced this pull request May 9, 2023
- Split parser core and compiler core. Fix RustPython#14
- AST int type to `u32`
- Updated asdl_rs.py and update_asdl.sh fix RustPython#6
- Use `ruff_python_ast::SourceLocation` for Python source location. Deleted our own Location.
- Renamed ast::Located to ast::Attributed to distinguish terms for TextSize and SourceLocation
- `ast::<Node>`s for TextSize located ast. `ast::located::<Node>` for Python source located ast.
- And also strictly renaming `located` to refer only python location related interfaces.
- `SourceLocator` to convert locations.
- New `source-code` features of to disable python locations when unnecessary.
- Also including fully merging astral-sh/RustPython#4 closes RustPython#9
youknowone added a commit to youknowone/RustPython-parser that referenced this pull request May 9, 2023
- Split parser core and compiler core. Fix RustPython#14
- AST int type to `u32`
- Updated asdl_rs.py and update_asdl.sh fix RustPython#6
- Use `ruff_python_ast::SourceLocation` for Python source location. Deleted our own Location.
- Renamed ast::Located to ast::Attributed to distinguish terms for TextSize and SourceLocation
- `ast::<Node>`s for TextSize located ast. `ast::located::<Node>` for Python source located ast.
- And also strictly renaming `located` to refer only python location related interfaces.
- `SourceLocator` to convert locations.
- New `source-code` features of to disable python locations when unnecessary.
- Also including fully merging astral-sh/RustPython#4 closes RustPython#9
youknowone added a commit to youknowone/RustPython-parser that referenced this pull request May 10, 2023
- Split parser core and compiler core. Fix RustPython#14
- AST int type to `u32`
- Updated asdl_rs.py and update_asdl.sh fix RustPython#6
- Use `ruff_python_ast::SourceLocation` for Python source location. Deleted our own Location.
- Renamed ast::Located to ast::Attributed to distinguish terms for TextSize and SourceLocation
- `ast::<Node>`s for TextSize located ast. `ast::located::<Node>` for Python source located ast.
- And also strictly renaming `located` to refer only python location related interfaces.
- `SourceLocator` to convert locations.
- New `source-code` features of to disable python locations when unnecessary.
- Also including fully merging astral-sh/RustPython#4 closes RustPython#9
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants