Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce splice_buffer! abstraction for IOBuffer in the REPL #6429

Merged
merged 1 commit into from
Apr 9, 2014

Conversation

mbauman
Copy link
Member

@mbauman mbauman commented Apr 5, 2014

Removes all text between from and to, keeping the position stable with the text. Behaves just like splice! for arrays, except that it accepts 0-indexed positions and keeps the cursor position stable with the text. Minor behavior change to kill_line -- when killing just a newline, the newline char itself goes into the buffer.

Takes the abstraction from #6427 that @carlobaldassi found useful.

@mbauman mbauman changed the title Introduce cut abstraction for IOBuffer Introduce cut abstraction for IOBuffer in the REPL Apr 5, 2014
@kmsquire
Copy link
Member

kmsquire commented Apr 5, 2014

For consistency, it might be nice to have this be called splice (and match the splice interface).

@mbauman
Copy link
Member Author

mbauman commented Apr 5, 2014

Ah, yes, that's exactly what this functionality is. The name didn't feel quite right to me, either. In that case, would it make sense to define it as a true method of splice! in iobuffer.jl? It's not really a typical "IO" method, but it'd really simplify a lot of these editing commands.

@carlobaldassi
Copy link
Member

Thanks!
I'd also prefer another name, either a less generic one (to clarify it's internal code, even though there is no cut in Base at the moment) or splice! as @kmsquire suggested (but that requires extending the interface; and then it could go in iobuffer.jl I suppose).

@kmsquire
Copy link
Member

kmsquire commented Apr 5, 2014

After thinking about it, I'm more ambivalent about naming it splice!, in that I'm not sure we really want/need that as part of IOBuffer. But the behavior does kind of match. Whatever works for you.

@mbauman
Copy link
Member Author

mbauman commented Apr 5, 2014

Agreed. I'm currently mocking up a splice_buffer! function in the REPL module. I think it makes sense to echo the behavior and interface of splice!, but it really doesn't make sense to export it.

@carlobaldassi
Copy link
Member

Sounds like a good plan.

@mbauman mbauman changed the title Introduce cut abstraction for IOBuffer in the REPL Introduce splice_buffer! abstraction for IOBuffer in the REPL Apr 7, 2014
@mbauman
Copy link
Member Author

mbauman commented Apr 7, 2014

@carlobaldassi - Finally got a chance to get back to this today. Best part about this PR is that it now completely removes all memmove calls and makes all of the editing methods much simpler. It still is playing with the internal data structures of IOBuffer, but it's not nearly as egregious now. Thanks for the suggestion to use splice!, Kevin.

@StefanKarpinski
Copy link
Member

It would be really nice to separate the IOBuffer guts from how one programs the REPL. This is not an efficiency critical API so I think that copying strings around would be perfectly acceptable. No amount of human mashing on the keyboard is going to tax a modern computer.

@carlobaldassi
Copy link
Member

Looks good to me, and significantly cleaner. I'll leave it around for others to review as well (cc @loladiro, @nolta, ...).
@StefanKarpinski: this seems a step in that direction, in that it further isolates code where IOBuffer data structures are directly accessed instead of using abstractions like position. After this change, there are still a few places where this happens now in the REPL code. It probably wouldn't be too hard to abstract those away in an extra layer between IOBuffer and REPL code, if that's what you had in mind.

@StefanKarpinski
Copy link
Member

I basically had in mind that all line editing states would simply be represented by (UTF8String,Int) pairs where the string is the content of the line and the integer is the index into that where the cursor is. Each hook would simply be a function combining the input key with such a state and producing a new (UTF8String,Int) pair. This would be less efficient but much simpler to write and understand and since efficiency is essentially irrelevant for REPL code (we would be hard pressed to be as laggy as readline), this would be a good tradeoff.

@kmsquire
Copy link
Member

kmsquire commented Apr 8, 2014

This sounds like a place where RopeStrings might actually be useful?

On Tuesday, April 8, 2014, Stefan Karpinski [email protected]
wrote:

I basically had in mind that all line editing states would simply be
represented by (UTF8String,Int) pairs where the string is the content of
the line and the integer is the index into that where the cursor is. Each
hook would simply be a function combining the input key with such a state
and producing a new (UTF8String,Int) pair. This would be less efficient
but much simpler to write and understand and since efficiency is
essentially irrelevant for REPL code (we would be hard pressed to be as
laggy as readline), this would be a good tradeoff.


Reply to this email directly or view it on GitHubhttps://github.com//pull/6429#issuecomment-39867014
.

@StefanKarpinski
Copy link
Member

Eh, doesn't really matter. You can type at the REPL until you're blue in the face and the CPU wouldn't even break a sweat.

On Apr 8, 2014, at 2:51 PM, Kevin Squire [email protected] wrote:

This sounds like a place where RopeStrings might actually be useful?

On Tuesday, April 8, 2014, Stefan Karpinski [email protected]
wrote:

I basically had in mind that all line editing states would simply be
represented by (UTF8String,Int) pairs where the string is the content of
the line and the integer is the index into that where the cursor is. Each
hook would simply be a function combining the input key with such a state
and producing a new (UTF8String,Int) pair. This would be less efficient
but much simpler to write and understand and since efficiency is
essentially irrelevant for REPL code (we would be hard pressed to be as
laggy as readline), this would be a good tradeoff.


Reply to this email directly or view it on GitHubhttps://github.com//pull/6429#issuecomment-39867014
.


Reply to this email directly or view it on GitHub.

@carlobaldassi
Copy link
Member

I basically had in mind that all line editing states would simply be represented by (UTF8String,Int) pairs where the string is the content of the line and the integer is the index into that where the cursor is

I see. I actually wrote a stub for this using an InputLine type with those two fields, without even seriously trying to make it correct, just to see how it would come along. It would surely be a little better in terms of code, but not that much after all. Basically, you end up writing a lot of small methods for that type, and most of those are duplicates of what's already in place using IOBuffer. Plus, you still need to be extra careful with indexing, since many of the subtleties which are in IOBuffer manipulations arise from the fact that the cursor position is between characters, so it ranges from 0 to length(line), and you need to take into account multibyte chars on top of that.
I'm not saying it's not worth doing it, just that it wouldn't provide as much magical simplification as one would hope for. Or maybe I'm just saying that I wasn't able to design a good interface for that abstraction ;)
Also, it's not a small task, as it requires some extensive changes spread out in various files and functions, and what's worse we don't have tests for the REPL. I suggest that before doing that we should have tests.

Anyway, I don't think this discussion should block this PR, which simplifies and rationalizes a number of things, no? @mbauman, I can't merge right now because of conflicts, but if you rebase I think we can merge it.

@StefanKarpinski
Copy link
Member

I certainly don't think it will simplify any of the complex logic involving Unicode characters and cursor position but that's not the point – the point is that's the only thing the code will have to worry about. It doesn't have to know anything at all about IOBuffers or anything else involved with I/O. It completely separates line editing logic from I/O logic – and that's a good thing.

Behaves just like `splice!` for arrays, except that it accepts 0-indexed positions and keeps the cursor position stable with the text. Minor behavior change to kill_line -- when killing just a newline, that goes into the buffer.
@mbauman
Copy link
Member Author

mbauman commented Apr 9, 2014

Rebased and ready to go again.

I like the idea of eventually moving to UTF8Strings, too. While it doesn't simplify the logic around unicode, it does allow us to abstract it and put it where it belongs — in utf8.jl. It'd be great to use next… and why not start playing with a reverse iteration protocol and define prev there, too. Many of the movement-by-word functions could punt to search and rsearch, too.

Positions would be consistent (1-indexed), and we could still use a function like splice (albeit returning a new string instead of modifying it).

@carlobaldassi
Copy link
Member

@StefanKarpinski: I agree, that's why I originally thought about an abstraction layer between the REPL code and the IOBuffer code. I still think that could be a good idea as an intermediate step: 1) isolate the code which deals with the IOBuffer internals 2) create a new simpler type which implements the exact same interface 3) possibly simplify code, which at this point can be done in small steps, testing everything carefully.

@carlobaldassi
Copy link
Member

Travis failure is unrelated (arpack).

carlobaldassi added a commit that referenced this pull request Apr 9, 2014
Introduce `splice_buffer!` abstraction for IOBuffer in the REPL
@carlobaldassi carlobaldassi merged commit 37508ca into JuliaLang:master Apr 9, 2014
@mbauman mbauman deleted the cut-iobuffer branch April 9, 2014 22:46
@carlobaldassi
Copy link
Member

Positions would be consistent (1-indexed)

Well, but positions are actually consistent already, just with the notion that the cursor is between characters rather than on top of one char. Basically, it's just the difference between position(buf) and buf.ptr: using one representation or the other simplifies some code and complicates some other code, and in any case you still need to special-case operations at the extremes. Using next is not that much different than using read, testing for the end of the string is not different than using eof, etc. (But again, don't take me wrong, I'm all in favour of doing this, I'm just reporting on some quick experiment I did).

@Keno
Copy link
Member

Keno commented Apr 9, 2014

While I'm fine with changing it to UTF8Strings with a separate index, I agree with @carlobaldassi that it would most likely not change the code very much at all and is therefore probably not all that big a priority. The nice thing about IOBuffers is that you can use mutation which is slightly more natural than creating new immutable strings. In any case, the real trick to make UTF8 work correctly is the line refresh algorithm anyway (which I think is pretty much correct at this point, but probably needs to be changed for windowed prompts (within the terminal e.g. ncurses style), since it assumes that newline wrap.

@StefanKarpinski
Copy link
Member

Here's another idea – make all the REPL logic work with UTF32Strings instead of UTF8Strings. That gets rid of all the problems of dealing with variable width encoding. You just have to transcode on input and output. Bam, solved.

@Keno
Copy link
Member

Keno commented Apr 17, 2014

Close, but no cigar. Currently we don't handle this, but consider

t⃗ = [1,2,3]

That's two unicode characters.

@StefanKarpinski
Copy link
Member

What if you NFC normalize it?

@mbauman
Copy link
Member Author

mbauman commented Apr 17, 2014

No dice. There are lots of these non-spacing, combining marks, and only a handful of combinations with them have 1-character equivalents.

As a more concrete example, check out the grave accent. It can be normalized and combine with vowels but not consonants:

julia> s = "à" # Bah, looks like Github or my browser is normalizing this to one-char, so copying it won't reproduce this.
"à"

julia> length(s), strwidth(s)
(2,1)

julia> length(normalize_string(s,:NFC))
1

julia> s = "b̀" # but it's not possible to normalize this to one char
"b̀"

julia> length(s), strwidth(s)
(2,1)

julia> length(normalize_string(s,:NFC))
2

julia> length(normalize_string(s,:NFKC))
2

julia> length(normalize_string(s,:NFKD))
2

julia> length(normalize_string(s,:NFD))
2

@StefanKarpinski
Copy link
Member

I knew it couldn't possibly be that easy. Seems like UTF-32 isn't really much better than UTF-8. I wonder if it makes sense to have another layer of abstraction over normal strings where the unit of data is the code point, but it's not even clear what the correct unit of is at the higher level. This is essentially the same issue as the reverse problem (#6165).

@stevengj
Copy link
Member

stevengj commented Dec 4, 2014

What you want is graphemes, I think. utf8proc/libmojibake will break up a string into graphemes for you.

(Probably we should expose this as an iterator: for g in graphemes(s) where g iterates over substrings.)

@stevengj
Copy link
Member

stevengj commented Dec 7, 2014

@StefanKarpinski, see #9261 for a grapheme iterator that adds precisely this layer of abstraction.

@StefanKarpinski
Copy link
Member

Awesome. This is great to have.

@stevengj
Copy link
Member

Now that we have a graphemes iterator and (at a lower level) a UTF8Proc.isgraphemebreak(Char,Char) function to detect extended grapheme breaks, couldn't the REPL logic be updated to work with graphemes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants