Description of '\w' behavior is vague in `re` documentation #82747

snoopjedi · 2019-10-23T16:28:38Z

BPO	38566
Nosy	@serhiy-storchaka, @MojoVampire, @snoopjedi

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2019-10-23.16:28:38.284>
labels = ['type-bug', 'docs']
title = "Description of '\\w' behavior is vague in `re` documentation"
updated_at = <Date 2019-10-24.08:02:46.475>
user = 'https://github.com/snoopjedi'

bugs.python.org fields:

activity = <Date 2019-10-24.08:02:46.475>
actor = 'xtreak'
assignee = 'docs@python'
closed = False
closed_date = None
closer = None
components = ['Documentation']
creation = <Date 2019-10-23.16:28:38.284>
creator = 'snoopjedi'
dependencies = []
files = []
hgrepos = []
issue_num = 38566
keywords = []
message_count = 3.0
messages = ['355239', '355250', '355257']
nosy_count = 4.0
nosy_names = ['docs@python', 'serhiy.storchaka', 'josh.r', 'snoopjedi']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue38566'
versions = []

snoopjedi · 2019-10-23T16:28:38Z

The documentation for the re library¹ describes the behavior of the specifier '\w' as matching "Unicode word characters," which is very vague. The closest thing I can find that corresponds to this language is the guidance offered in Unicode Technical Standard #18², which defines the class <word_character> to include all alphabetic and decimal codepoints, as well as U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER. This does not appear to be a correct description of re, however, as these zero-width characters are not counted when matching '\w', e.g.:

>>> re.match('\w*', 'Auf\u200Clage')
<re.Match object; span=(0, 3), match='Auf'>

It seems from examining the CPython source³ that SRE treats '\w' as meaning any alphanumeric character OR U+005F SPACING UNDERSCORE, which does not match any Unicode class definition I've been able to find.

Can anyone provide clarification on what part of Unicode this documentation is referring to? If there is some other definition, the documentation should be more specific about referring to it (and including a link would be preferred). If instead the documentation is incorrect, this language should be changed to describe the true meaning of \w.

¹ https://docs.python.org/3/library/re.html#index-32
² http://unicode.org/reports/tr18/
³ https://github.com/python/cpython/blob/master/Modules/_sre.c#L125

MojoVampire · 2019-10-23T18:58:52Z

The definition of \w, historically, has corresponded to the set of characters that can occur in legal variable names in C (alphanumeric ASCII plus underscores, making it equivalent to [a-zA-Z0-9_] for ASCII regex). That's why, on top of the definitely wordy alphabetic characters, and the arguably wordy numerics, it includes the underscore, _.

That definition predates Unicode entirely, and Python is just building on it by expanding the definition of "alphanumeric" to encompass all alphanumeric characters in Unicode.

We definitely can't remove underscores from the definition without breaking existing code which assumes a common subset of PCRE support (every regex flavor I know of includes underscores in \w). Adding the zero width characters seems of limited benefit (especially in the non-joiner case; if you're trying to pull out words, presumably you don't want to group letters across a non-joining boundary?). Basically, you're parsing "Unicode word characters" as "Unicode's definition of word characters", when it's really meant to mean "All word characters, not just ASCII".

You omitted the clarifying remarks from the documentation though, the full description is:

Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.

That's about as precise as I think we can make it (because technically, some of the things that count as "word characters" aren't actually part of an "alphabet" in the technical definition). If you think there is a clearer way of expressing it, please suggest a better phrasing, and this can be fixed as a documentation bug.

snoopjedi · 2019-10-23T19:25:34Z

Cheers for the additional context. My recommendation would be to change the language to avoid confusion with the consortium's formal specifications. Describing what SRE does should be fine:

Matches any alphanumeric Unicode character, as well as '_'. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.

I think it'd also be nice for the term "alphanumeric Unicode character" to link to the documentation for str.isalnum(), which provides enough clarity for the user to work out exactly what Unicode category properties will end up qualifying as a match.

JelleZijlstra · 2022-04-30T20:45:23Z

Duplicate of #69929

snoopjedi mannequin assigned docspython Oct 23, 2019

snoopjedi mannequin added the docs Documentation in the Doc dir label Oct 23, 2019

tirkarthi added the type-bug An unexpected behavior, bug, or error label Oct 24, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

AlexWaygood added the topic-regex label Apr 19, 2022

slateny mentioned this issue Apr 28, 2022

gh-69929: Add more specific definition of \w #92015

Merged

JelleZijlstra marked this as a duplicate of #69929 Apr 30, 2022

JelleZijlstra closed this as completed Apr 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Description of '\w' behavior is vague in `re` documentation #82747

Description of '\w' behavior is vague in `re` documentation #82747

snoopjedi mannequin commented Oct 23, 2019

snoopjedi mannequin commented Oct 23, 2019

MojoVampire mannequin commented Oct 23, 2019

snoopjedi mannequin commented Oct 23, 2019

JelleZijlstra commented Apr 30, 2022

Description of '\w' behavior is vague in re documentation #82747

Description of '\w' behavior is vague in re documentation #82747

Comments

snoopjedi mannequin commented Oct 23, 2019

snoopjedi mannequin commented Oct 23, 2019

MojoVampire mannequin commented Oct 23, 2019

snoopjedi mannequin commented Oct 23, 2019

JelleZijlstra commented Apr 30, 2022

Description of '\w' behavior is vague in `re` documentation #82747

Description of '\w' behavior is vague in `re` documentation #82747