Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed quote to escape #294

Closed
wants to merge 1 commit into from
Closed

Changed quote to escape #294

wants to merge 1 commit into from

Conversation

XAMPPRocky
Copy link
Member

Escape is a much more familiar name to this type of operation.

The function will now also only allocate a String when there are actual meta characters to escape, rather than allocating on every call.

@BurntSushi
Copy link
Member

Note that this is a breaking change. The rename from quote to escape seems sensible.

I'm not sure about returning Cow. Is there a use case where the allocation actually matters?

@XAMPPRocky
Copy link
Member Author

I think it's better to ask what is the advantage of keeping it as a String? There is no reason why someone couldn't provide huge paragraphs of text to the function based on some data. Cow provides much greater flexibility over String for no overhead that I can think of.

@BurntSushi
Copy link
Member

There is no reason why someone couldn't provide huge paragraphs of text to the function based on some data.

I'm skeptical. When would this happen where the extra allocation matters?

I think it's better to ask what is the advantage of keeping it as a String?

Because it's arguably simpler.

@XAMPPRocky
Copy link
Member Author

XAMPPRocky commented Nov 7, 2016

I'm skeptical. When would this happen where the extra allocation matters?

Well, what about a web service that has make frequent regular expressions on request? The current implementation essentially doubles the memory on each call, with a lot of users that could lead up to a lot of allocations.

Because it's arguably simpler.

I think this makes the function a lot smarter, and provides a lot control for only the addition of a single if statement. For example with Cow you as the user of this function can now know that there were meta characters within the String with a simple pattern match.

@BurntSushi
Copy link
Member

Well, what about a web service that has make frequent regular expressions on request? The current implementation essentially doubles the memory on each call, with a lot of users that could lead up to a lot of allocations.

The memory requirements of the extra string are dwarfed by at least a few orders of magnitude by the memory requirements of the regular expression itself. Moreover, the regex (and corresponding string) are probably dropped once the request has been serviced.

I think this makes the function a lot smarter, and provides a lot control for only the addition of a single if statement. For example with Cow you as the user of this function can now know that there were meta characters within the String with a simple pattern match.

I'm not sure what the use for this is. If there was one, then I imagine it'd be useful to expose a has_meta_character function.

@XAMPPRocky
Copy link
Member Author

XAMPPRocky commented Nov 7, 2016

The memory requirements of the extra string are dwarfed by at least a few orders of magnitude by the memory requirements of the regular expression itself.

Isn't this a point for it's inclusion? If the String is 1MB there is no need to allocate another 1MB in addition to the regular expression if it contains no changes.

@BurntSushi
Copy link
Member

BurntSushi commented Nov 7, 2016

Isn't this a point for it's inclusion? If the String is 1MB there is no need to allocate another 1MB in addition to the regular expression if it contains no changes.

If the string is 1MB then the compiled regular expression will be much larger. If each character in the regex roughly corresponds to a single instruction (which is a pretty conservative estimate), then you're looking at a minimum overhead of 24 bytes per character. Multiply this by some small constant, since multiple FSMs are typically compiled. The size of the string is insignificant.

In fact, Regex::new takes an &str and converts that to a String internally, so there's an extra copy right there too. But it's OK, because its relative cost is tiny.

@XAMPPRocky
Copy link
Member Author

If the string is 1MB then the compile regular expression will be much larger. If each character in the regex roughly corresponds to a single instruction (which is a pretty conservative estimate), then you're looking at a minimum overhead of 24 bytes per character. Multiply this by some small constant, since multiple FSMs are typically compiled. The size of the string is insignificant.

I agree that in relative terms the regex will always be greater, and I also think that doesn't mean that extra memory should be wasted when in absolutes. This isn't a strong case but there could rare edge cases where the extra memory would cause an OOM or stack overflow. I still the best case for it's inclusion is that providing Cow is a much better API decision.

@BurntSushi
Copy link
Member

BurntSushi commented Nov 7, 2016

I remain unconvinced. I'm not a fan of complicating an API for super rare edge cases. Even the case you've provided is unrealistic: creating a 1MB long regex doesn't sound like a use case we should be considering in the API design of this library.

I just actually tried it:

extern crate regex;

use std::iter::repeat;

use regex::Regex;

fn main() {
    let size = 1<<20;
    let _ = Regex::new(&repeat('a').take(size).collect::<String>()).unwrap();
}

This program takes 0.67s and uses over 200 MB of memory. The extra 1MB is utterly insignificant. If you've built a public facing service that tries to compile 1MB regexes, then you've got a lot more problems than whether quote returns a Cow or not.

@XAMPPRocky
Copy link
Member Author

XAMPPRocky commented Nov 8, 2016

You see I think this where we differ. I see the change to Cow as a simplification. Let's say someone did want zero unnecessary allocations for some reason. Currently they can't do that, and if has_meta_character was made public the person would have to use Cow anyway. It wouldn't cause any change in it's normal use.

let escaped = if regex::has_meta_character(string) {
    Cow::Owned(regex::escape(string))
} else {
    Cow::Borrowed(string)
};

vs

let escaped = regex::escape(string);

@BurntSushi
Copy link
Member

What are the use cases for wanting this function to not allocate?

On Nov 8, 2016 4:20 AM, "Aaron Power" [email protected] wrote:

You see I think this where we differ. I see the change to Cow as a
simplification. Let's say someone did want zero unnecessary allocations for
some reason. Currently they can't do that, and if has_meta_character was
made public the person would have to use Cow anyway. It wouldn't cause
any change in it's normal use.

let escaped = if regex::has_meta_character(string) {
Cow::Owned(regex::escape(string))
} else {
Cow::Borrowed(string)
}

vs

let escaped = regex::escape(string);


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#294 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAb34mkmTSgz8mNTC5Ivk7_atplC5imLks5q8D7XgaJpZM4Krwwf
.

@XAMPPRocky
Copy link
Member Author

Well what about in the case where there is a lot allocations, say for example in tokei I escape every unique multi line start in a language. A lot of languages have at least 2 or 3 and tokei currently has 87 languages, that is a lot of time spent allocating and deallocating before I even create the regex.

@BurntSushi
Copy link
Member

BurntSushi commented Nov 15, 2016

@Aaronepower ... but you still have to create the regex. As I showed above, creating 87*3 regexes is going to take orders of magnitude more time than escaping 87*3 regex pattern strings.

@XAMPPRocky
Copy link
Member Author

Sorry, I misexplained it would be allocating 271 Strings to then create 87 Regexes. Also this function isn't guaranteed to be used immediately with creating a regex.

@BurntSushi
Copy link
Member

Sorry, I misexplained it would be allocating 271 Strings to then create 87 Regexes.

I don't think that actually changes anything. We're talking about orders of magnitude differences here.

Also this function isn't guaranteed to be used immediately with creating a regex.

Agreed, but that is clearly the intended use case.


I am personally growing weary of this debate. I feel we have reached an impasse and I don't think there's a compelling argument in favor of returning Cow here. I would suggest removing that change from this PR and just sticking with the rename from quote to escape.

One thing I would be in favor of is exposing a function (possibly in regex-syntax) that would return true if and only if the given character was a meta character. This would permit folks who care about the extra allocation to implement their own escape without complicating the API for everyone else. (I don't think this should be in this PR though.)

@XAMPPRocky
Copy link
Member Author

Updated PR

@XAMPPRocky XAMPPRocky changed the title Changed quote to escape, and changed return type to Cow Changed quote to escape Nov 15, 2016
@BurntSushi
Copy link
Member

Thanks! It looks like the build is failing. I think you might have forgotten to update src/re_unicode.rs?

@XAMPPRocky
Copy link
Member Author

XAMPPRocky commented Nov 15, 2016

My bad! Updated the PR.

@BurntSushi
Copy link
Member

@Aaronepower Yeah I'm not sure why the nightly build is failing, but it does indeed look unrelated. I'd say just leave this for now and I'll straighten it out when I merge this PR (which I'll do as part of prep for the 0.2 release, which I hope to be imminent).

@BurntSushi BurntSushi closed this Dec 29, 2016
@BurntSushi BurntSushi reopened this Dec 29, 2016
@BurntSushi
Copy link
Member

I've rolled this into #310. Thanks again!

@BurntSushi BurntSushi closed this Dec 30, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants