[pydocstyle] docstrings encoding #3436

yuxqiu · 2023-03-10T12:29:32Z

yuxqiu
Mar 10, 2023

I am currently working on PR #3408.

In this PR, I aim to make pydocstyle respect the line continuation character to reduce the number of false positives. But I soon realize line continuation character is just a part of a larger problem.

It turns out that to properly implement some pydocstyle rules, we should be able to unescape the string.

For example, we get no warning from pydocstyle from the following code:

# pydocstyle 1.py --select D205
def no_problem():
    """Hello World.
    \nNo Problem.
    """

pydocstyle accepts this because it converts the \n to a real line break. So, I started implementing a method to get an unescaped character from the given string. It looks like this:

pub fn unescaped_docstring_char(chars: &mut Chars<'_>) -> Option<UnescapedDocStringChar> {
    let c = chars.next()?;
    // ...

    let res = match c {
        '\\' => {
            // must have at least one character after it
            // otherwise, it will be rejected by the parser
            let res = match chars.next().unwrap() {
                '\n' => None,
                '\\' => Some('\\'),
                // ...
                'u' => {
                   // problems!!!
                }

                c => Some(c),
            };
            res
        }
        _ => Some(c),
    };

    // ...
}

As you can see, there is a problem with the \u escapes. Based on the Python doc, it says

\uxxxx | Character with 16-bit hex value xxxx

In Python, this character can be a surrogate because "strings are immutable sequences of Unicode code points". However, in Rust, a char is a ‘Unicode scalar value’, which is any ‘Unicode code point’ other than a surrogate code point, which means that we have no way to store a surrogate in Rust by using char or String.

This leaves us with the question of how to properly unescape such characters, store them in a string and mimic the behaviour of pydocstyle. Based on my experimentation, pydocstyle can correctly identify errors in the docstrings in the presence of surrogates:

# pydocstyle test.py --ignore D100
def f():
    """Hello World\uDE01."""

I currently have several solutions in mind:

Replace \uxxxx with some placeholders (like \ufffd) if it's a surrogate.
Treat the unescaped string as a sequence of Unicode codepoints like Python. We need to consider how we deal with whitespaces and check string equality. This will probably break many existing rules.
Completely ignore these escapes. We distinguish blank lines based on only \n and whitespaces based on only and make sure line continuation works correctly. However, this will lead us to deviate from pydocstyle behaviour.

Any thoughts on this?

Answered by charliermarsh

Mar 10, 2023

If I recall correctly... we used to use the evaluated string body (i.e., the s in Expr::Constant { kind: Constant::Str(s), .. }), which would probably give you the behavior that you're seeing pydocstyle, since that gets evaluated by the parser, and so (e.g.) continuations wouldn't be included as part of s. But I thought this led to other pydocstyle deviations, and so we moved to using the raw string. I can't exactly remember the details unfortunately. We could look through the changelog...

It might be the case that pydocstyle uses slightly different representations for different rules. E.g., if you look at the source, they sometimes do lines = ast.literal_eval(docstring).strip().split('\n')…

View full answer

charliermarsh · 2023-03-10T23:08:00Z

charliermarsh
Mar 10, 2023
Maintainer

If I recall correctly... we used to use the evaluated string body (i.e., the s in Expr::Constant { kind: Constant::Str(s), .. }), which would probably give you the behavior that you're seeing pydocstyle, since that gets evaluated by the parser, and so (e.g.) continuations wouldn't be included as part of s. But I thought this led to other pydocstyle deviations, and so we moved to using the raw string. I can't exactly remember the details unfortunately. We could look through the changelog...

It might be the case that pydocstyle uses slightly different representations for different rules. E.g., if you look at the source, they sometimes do lines = ast.literal_eval(docstring).strip().split('\n') to get the body. I think this would be equivalent to using the Constant representation as described above (which we do have access to). Can you try looking at docstring.expr, which should always be a constant string, IIRC?

2 replies

yuxqiu Mar 10, 2023
Author

It might be the case that pydocstyle uses slightly different representations for different rules.

Yep, I realized that too. So, I think we may need to include both the evaluated string and the raw string in the DocString class.

Can you try looking at docstring.expr, which should always be a constant string, IIRC?

Sure, I'll look at that later. Thanks for your help!

yuxqiu Mar 10, 2023
Author

I gave it a try and it looks like I can get the evaluated string by using the following method:

match docstring.expr.node {
    rustpython_parser::ast::ExprKind::Constant { ref value, .. } => {
        match value {
            rustpython_parser::ast::Constant::Str(s) => println!("{}", s),
            _ => {}
        }
    }
    _ => {}
}

When I try this with the docstring that contains surrogates, the parser replaces these surrogates with replacement characters. (When I print them, I get �.)

So, I will continue to work on this PR and hope to complete it as soon as possible. 🚀

yuxqiu · 2023-03-11T00:19:01Z

yuxqiu
Mar 11, 2023
Author

@charliermarsh A quick follow-up on this question. If we rely on the evaluated string from the parser, do you have any idea how we can implement auto fix. I don't think we can get information about the position of the characters in the evaluated string.

2 replies

charliermarsh Mar 11, 2023
Maintainer

Yeah that's quite challenging. We may not be able to support it, if we want to support evaluated strings.

A separate question, that I didn't really answer, is whether we need to support evaluated strings. We could support just continuations more easily than supporting all characters.

yuxqiu Mar 11, 2023
Author

I'm considering (again) whether we should do our own evaluation.

Pros:

We can extract location information and potentially suggest more accurate auto fixes.

Cons:

It needs a little more time and space. It's a trade-off between absolute correctness and performance.

I personally think we can try it because most docstrings are not very long and most of the rules that need to use evaluated docstrings only focus on a small part of the string.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pydocstyle] docstrings encoding #3436

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

[pydocstyle] docstrings encoding #3436

yuxqiu Mar 10, 2023

Replies: 2 comments · 4 replies

charliermarsh Mar 10, 2023 Maintainer

yuxqiu Mar 10, 2023 Author

yuxqiu Mar 10, 2023 Author

yuxqiu Mar 11, 2023 Author

charliermarsh Mar 11, 2023 Maintainer

yuxqiu Mar 11, 2023 Author

yuxqiu
Mar 10, 2023

Replies: 2 comments 4 replies

charliermarsh
Mar 10, 2023
Maintainer

yuxqiu Mar 10, 2023
Author

yuxqiu Mar 10, 2023
Author

yuxqiu
Mar 11, 2023
Author

charliermarsh Mar 11, 2023
Maintainer

yuxqiu Mar 11, 2023
Author