Split regex behavior

FlyingWombat · January 3, 2020, 6:13pm

For split S, I originally thought it was just the inverse of select s.
Where the selections would be all that doesn’t match the regex.
( - selected char, + secondary cursor, ^ primary cursor)

foo,bar
--+ --^
~
split:,

But I don’t get why this happens: where some matches of the regex are selected.

abbba baaab
+--+ -+ ++^
~
split:a

I would have thought that split:a on abba would select the inverse of the regex (select only the b’s and spaces), like this:

abbba baaab
 --+ -+   ^
~
split:a

Feature or Bug? Thoughts?

prion · January 4, 2020, 12:13am

The behavior is more obvious if you type something after performing the split

abbba baaab
# split: a
# i_
_a_bbba_ ba_a_a_b

As you can see it splits the current selection into smaller sub-sections (meaning it retains the current anchor & cursor) by ending the current selection when it meets the regex.

Position 0: start selection
Position 1: matches regex, end selection. Cursor must go on the a character as it is 0 width selection.
Position 2,3,4: continue selection.
Position 5: matches regex so stop selection…

I’m just guessing here but the weird cursor behavior where it sometimes ends up on an “a” seems to be due to have a “0 width selection” between each a. If you split on a+ you get something close to what you expected.

If you want to achieve an inverse selection then you can either do this to select anything that’s at least one character long where all of the characters are not a (written ^a)

abbba baaab
# select: [^a]+
abbba baaab
 --+ -+   ^
# a*
abbb*a b*aaab*

you can also do a zero-width negative lookahead but then I think you have to do a merge afterwards, unless someone else knows how to do it all at once

abbba baaab
# select: (?!a)
abbba baaab
 +++ ++   ^
# <a-_>     (a.k.a. Alt-Underscore)
abbba baaab
 --+ -+   ^
# a*
abbb*a b*aaab*

As an aside if you do an append after the split: a rather than an insert then you almost get the behaviour of the inverse selection but you also end up with a cursor at the end of you selection, which you probably don’t want.

FlyingWombat · January 6, 2020, 11:17pm

I think you’re right about it being 0-width selections between matches.
I haven’t looked at the code, but now that you mention it, I’d guess it just places a cursor before each match, and an anchor after.

((?!expr).)+ gives an inverted match, for simple expressions, but chokes on more complicated ones. For example: an expression to select brackets with text \[.*?\] does not work with the negative lookahead, since “Quantifiers cannot be used in lookarounds”.

The annoying thing is, I could see use-cases for either side: where sometimes you’d want cursors in between adjacent matches (current behavior), and sometimes you wouldn’t (inverted select).

I’d be interested to know which the intention was, for the code design.

FlyingWombat · January 7, 2020, 2:06am

Oh, I just realized, you were right again, using the + quantifier (almost) gives the “inverse match” behavior I was expecting.
Just surround the expression with a capturing group, and add the + quantifier (expr)+
A more complex example:

[]foo[bar][]asdf[qwer]
+ --+       ---+
~
split:(\[.*?\])+

There’s still a pesky cursor at the beginning, when there’s a match starting at pos 0. And again, there are times you might want that, and times you might not.