Kakoune quoting in POSIX shell

Screwtapello · December 5, 2024, 2:25pm

Kakoune is pretty flexible about integrating with other tools; as long as it can read from stdin and write to stdout, you can probably use it to write a Kakoune plugin. Because a lot of plugins need to send commands to Kakoune to execute, it uses a somewhat unusual quoting scheme designed to be easy to implement in any language:

replace all instances of an apostrophe with two apostrophes
wrap the whole thing in a pair of apostrophes

For example, the :info command takes exactly one argument, the string to display. If you want to display the name of They Might Be Giants’ first single, you have to make it a single string and escape the apostrophes, like this:

:info 'Don''t Let''s Start'

However, although Kakoune is designed to integrate with most languages, the only language guaranteed to be present is POSIX shell. And although Kakoune’s quoting system is designed to be easy to implement in most languages, POSIX shell is surprisingly deficient at string processing, especially for a language that is all about banging strings together. So an interesting question is, what’s the best way to do Kakoune-quoting in pure POSIX shell? Or since “best” is difficult to define, at least what’s the fastest?

I originally started looking into this in issue 3340, where I came up with a few different approaches — a short one based on sed, a longer one based on the shell slicing operators, and some variants like “is backslash-quoting in a shell-script slower or faster than double-quote-quoting”. I also came up with a benchmarking harness that executed each implementation continuously for 5 seconds, and counted the number of iterations (which means that you can run a fast implementation a statistically significant number of times, without a slow implementation making the tests take forever).

The fastest implementation I was able to come up with was this one, which I called single_builtin_quoter_no_backslashes_ntmp— “single” means it only quotes a single value, not each argument individually; “builtin_quoter” means it uses only POSIX shell constructs, “no_backslashes” means it uses double-quotes to protect apostrophes, and I forget what “ntmp” means.

single_builtin_quoter_no_backslashes_ntmp() {                                       
    text="$*"                                                                       
    printf "'"                                                                      
    while true; do                                                                  
        case "$text" in                                                             
            *"'"*)                                                                  
                printf "%s''" "${text%%"'"*}"                                       
                text=${text#*"'"}                                                   
                ;;                                                                  
            *)                                                                      
                printf "%s' " "$text"                                               
                break                                                               
                ;;                                                                  
        esac                                                                        
    done                                                                            
}

However, yesterday somebody named “arachsys” on GitHub showed up with more implementations, the best of which according to my benchmark is this:

arachsys_single_builtin_quoter_2() {                                                
    set -- "$1" ""                                                                  
    while [ "${1#*\'}" != "$1" ]; do                                                
        set -- "${1#*\'}" "$2${1%%\'*}''"                                           
    done                                                                            
    printf "'%s' \n" "$2$1"                                                         
}

Rather than a case statement, it compares the string to a sliced version of itself, which I assumed would make for some kind of O(n²) behaviour as it worked through the string, but at least in this microbenchmark it’s really fast.

They also pointed out another implementation approach: if you’re using a machine where /bin/sh is more capable than basic POSIX shell, such as bash or zsh, you might have a built-in search-and-replace operation:

arachsys_single_bash_quoter() {                                                     
    set -- "$*"                                                                     
    printf "'%s' \n" "${1//\'/\'\'}"                                                
}

That is a lot shorter, and does all the hard work in C, so it should be really fast. But let’s benchmark it, just to check. I ran the benchmarks on my ancient netbook, because performance matters more there than on the latest Ryzen desktop. I’m running up-to-date Debian Testing, with dash 0.5.12-9 and bash 5.2.32-1+b2. Using the testing framework I linked above, and choosing the best-of-three trials for each implementation:

Implementation	dash	bash
single_builtin_quoter_no_backslashes_ntmp	78,804	18,498
arachsys_single_builtin_quoter_2	89,146	25,242
arachsys_single_bash_quoter	—	41,545

If you know your script is going to run in bash, you should use the bash-specific quoting mechanism: It’s simple and compact and nearly twice as fast as anything else. If you don’t know you’ll be running in bash, you should consider arachsys’ plain POSIX implementation, it’s currently the fastest quoting implementation I know of.

ChrisW · December 5, 2024, 4:47pm

Ah, I guess that would be me… and I was gratuitously cheating because I’ve worked on various shells and remember where the skeletons are usually hidden.

Once you get beyond the painfully expensive exec-out-to-sed versions, the $(fn x) construct to interpolate the result of a function into a string is often by far the most expensive thing you’ll do, so solutions that can avoid that will inevitably be vastly cheaper than ones that do.

For example, taking the bash/zsh parameter expansion version, all this looks really quick on my aging NUC:

$ x="Don't Let's Start"
$ time for i in {1..100000}; do printf '%s\n' "${x//\'/\'\'}"; done >/dev/null
real	0m0.645s
user	0m0.602s
sys	0m0.043s

$ time for i in {1..100000}; do printf '%s\n' "$x"; done >/dev/null
real	0m0.429s
user	0m0.393s
sys	0m0.036s

$ time for i in {1..100000}; do echo "$x"; done >/dev/null
real	0m0.376s
user	0m0.325s
sys	0m0.051s

So printing quoted is about 6.5us whereas printing unquoted is about 4us. (echo vs printf shaves a bit off too as you can see, but that breaks when strings begin with a dash. It’s more noise than signal at this level of speed, and you definitely don’t want to wrap with much or you’ll just measure the wrapper instead.)

If you wrap it in a shell function:

$ quote() { printf '%s\n' "${1//\'/\'\'}"; }
$ time for i in {1..100000}; do quote "$x"; done >/dev/null
real	0m1.068s
user	0m1.011s
sys	0m0.057s

it’s still only about 10us. But using that function in something like echo "echo $(quote "$x")" could be a hundred times slower because of fork() overhead:

$ time for i in {1..10000}; do echo "echo '${x//\'/\'\'}'"; done >/dev/null
real	0m0.097s
user	0m0.085s
sys	0m0.012s

$ time for i in {1..10000}; do echo "echo '$(quote "$x")'"; done >/dev/null
real	0m11.582s
user	0m3.475s
sys	0m9.051s

The reason good shells have printf -v is that this gives you a way to abstract with functions but not pay the subshell price:

$ quotev() { printf -v "$1" '%s\n' "${2//\'/\'\'}"; }
$ time for i in {1..10000}; do quotev xq "$x"; echo "echo '$xq'"; done >/dev/null
real	0m0.158s
user	0m0.153s
sys	0m0.005s

Sadly POSIX fails to specify printf -v. It doesn’t require that printf be a builtin at all. Strictly speaking it doesn’t even provide local so you’re pretty much stuck using positional parameters like I do here if you want to write a clean utility function that doesn’t trample on its caller’s namespace. Joy!

ChrisW · December 5, 2024, 6:11pm

PS Although you wrapped my parameter expansion in a function like this

if you want to concatenate arguments like that before quoting, you can just

quote() {
  printf "'%s'\n" "${*//\'/\'\'}"
}

The set -- "$*" won’t be crazily expensive, but I bet it’s over 10% of the cost of the whole function when the other operations as cheap as these.

For me this function runs in around 11us on your Don't Let's Start string whereas my POSIX one

quotep() {
  set -- "$*" ""
  while [ "${1#*\'}" != "$1" ]; do
    set -- "${1#*\'}" "$2${1%%\'*}''"
  done
  printf "'%s'\n" "$2$1"
}

takes around 40us. I suspect that my linking against musl rather than glibc will skew the costs of some string operations vs others, so YMMV.

And of course, none of this matters if it end up in a $(...) construct anyway.

Screwtapello · December 6, 2024, 8:55am

Ah, thanks. I think of $@ and $* as being “more special” than regular variable expansions, so I wasn’t sure whether I could use $* in a substring operation. Glad to know I can!

With that change, I get 46,247 iterations rather than 41,545, so yeah, about 10%.

Yeah, the grail for this kind of thing would be a function that quoted each argument individually rather than squishing them all together with $*. Then instead of:

printf 'info %s\n' "$(quote_single "Don't Let's Start")"

…you could do the much simpler:

quote_multi info "Don't Let's Start"

…but I don’t think I’ve seen an implementation that’s appreciably simpler than just “calling quote_single in a loop”.

ChrisW · December 6, 2024, 12:37pm

Sometimes half the skill with shell scripting is breaking things up in the right way so you can join them cleanly and efficiently with the constructs and idioms available.

When writing plugins to be strictly POSIX-shell compliant, quotep is fast on its own, at least on short strings like filenames - presumably what we’re quoting 90% of the time, i.e. the right thing to optimise. But somehow it’s a bit awkward. It isn’t the right ‘shape’ to be used efficiently inside scripts in the way the bash parameter expansion "'${x//\'/\'\'}'" is.

In practice, I imagine I’d probably use one of these two variants instead:

quoten() {
  set -- "$*" ""
  while [ "${1#*\'}" != "$1" ]; do
    set -- "${1#*\'}" "$2${1%%\'*}''"
  done
  printf " '%s'" "$2$1"
}

quotev() {
  set -- "$1" "${*:2}" ""
  while [ "${2#*\'}" != "$2" ]; do
    set -- "$1" "${2#*\'}" "$3${2%%\'*}''"
  done
  eval "$1=\"'\$3\$2'\""
}

quoten is just quotep without the newline, like echo -n, so you can build up commands like this, without interpolating a subshell

{ echo -n "frobnicate %val{setting}"
  quoten "$filename1"
  quoten "$filename2"
  echo
} > "$kak_command_fifo"

[Edit: oops, even echo -n isn’t guaranteed in the Austin Group’s weird backwater, although any shell that matters will implement it, a bit like local. That first line needs to be a printf for strict compliance.]

quotev is a pure-POSIX version of the printf -v mechanism. It writes the quoted output into a variable named as its first argument. The eval here is safe: we’re interpolating the literal strings '$2' and '$3' into the command to be evaluated, not their values which would be catastrophic. Only the variable name in $1 is interpolated by value. eval needs a bit of care but if we’re doing POSIX shell, we’ve already eschewed all safety improvements in shell scripting since the 1980s.

This lets you do

quotev filename1 "$filename1"
quotev filename2 "$filename2"
echo "frobnicate %val{setting} $filename1 $filename2" \
  > "$kak_command_fifo"

I’d probably use quotev myself because it’s a bit more flexible when you want to use the quoted strings efficiently in the arguments of commands rather than just echo them, but it’d depend on the script I was writing it in.

These won’t show an advantage in your direct benchmark. I expect the eval in quotev will be a bit slower than a printf '%s'. However, when you start interpolating quotep to use the result, the performance is dominated by the cost of the subshell:

$ time for i in {1..10000}; do quotep "$x"; done >/dev/null
real	0m0.446s
$ time for i in {1..10000}; do y=$(quotep "$x"); echo "echo $y"; done >/dev/null
real	0m12.526s
$ time for i in {1..10000}; do echo "echo $(quotep "$x")"; done >/dev/null
real	0m12.619s

whereas quoten and quotev can be used efficiently:

$ time for i in {1..10000}; do quotev y "$x"; echo "echo $y"; done >/dev/null
real	0m0.530s
$ time for i in {1..10000}; do echo -n echo; quoten "$x"; echo; done >/dev/null
real	0m0.482s

alexherbo2 · December 6, 2024, 12:41pm

What is more expensive between:

quote_posix_args evaluate-commands -client "$kak_client" -verbatim edit -- "$some_file"

and

quoted_file=$(quote_posix_arg "$some_file")
quoted_client=$(quote_posix_arg "$kak_client")
echo "evaluate-commands -client $quoted_client -verbatim edit -- $quoted_file"

EDIT: Given implementation of the function with multiple arguments vs. a single one.

At once we need block or nesting, I find the function supporting a single arg be more flexible (I have rare case when I need to pass a rest). I generally quote what I need and do a single echo at the end with quoted values.

ChrisW · December 6, 2024, 12:48pm

Oh, assuming the first option is a looping map function, so

quote_posix_args foo bar baz

prints

'foo' 'bar' 'baz'

that one will be massively faster, even if that multi arg function has a loop and a shift in it. The cost of those $() subshells will dominate anything else you might do, unless you shell out to a non-builtin.

I think the only catch with a multi-arg quoter is where you want to quote some arguments and not others, e.g. if you want to include %val{foo} in your command without quoting it to a literal '%val{foo}'.

alexherbo2 · December 6, 2024, 12:52pm

Yeah, I’m pretty happy with the possibility to literally use %val{foo} from shell to kakoune evaluation so we don’t have to worry about quoting at all.

So in the quote_posix_args example, even though there is 7 args to quote, it still less costly than explicitly quoting 2 values in the other example due to $(...) use?

ChrisW · December 6, 2024, 1:04pm

Screwtapello:

you could do the much simpler:
quote_multi info "Don't Let's Start"
…but I don’t think I’ve seen an implementation that’s appreciably simpler than just “calling quote_single in a loop”.

Sorry, I can’t do better than a loop in pure POSIX shell either I’m afraid. At least that loop would be fast. I think in practice, once you’ve got rid of the subshell cost, you probably then want to optimise for brevity and clarity rather than raw speed? Unfortunately, kak passes %sh{} bodies as giant command-line arguments to sh -c rather than over pipes to sh $fifo or just stdin pipe to sh. (I do keep wondering about having a hack at fixing this.)

With bash parameter expansions, you can do this to kak-quote every argument in a loop-free way:

  $ set -- "fo'o" "ba'a" "ba'z"
  $ set -- "${@//\'/\'\'}"
  $ set -- "${@/#/\'}"
  $ set -- "${@/%/\'}"
  $ echo "$*"
  'fo''o' 'ba''a' 'ba''z'

This will be very fast too. Parameter expansions aren’t composable, so you can’t combine those three sets into a single replacement as far as I know.

ChrisW · December 6, 2024, 1:08pm

Yes, much cheaper. The subshell is high tens or low hundreds of times the cost of the quoting function.

Edit: …unless you’re quoting big strings of course. All the above discussion assumes they’re reasonable filenames that might contain the odd naughty character, not large blocks of text. quotep and friends will be outperformed by non-quadratic streaming versions once you’re much beyond a typical 30-40 character pathname, although bash parameter expansions will probably stay fast up to a few kB.

sed is a pretty good choice for quoting big texts, although not interpolating them into commands at all so they don’t need quoting would be better!