Non-printable characters in Kakoune

Hi! I just stumbled into a bit of a nasty situation with some non-printable Unicode characters (in this case: \u200B “Zero-width space”). Is there a way to render these characters with a stand-in so it is possible to see that they are there?

Once I realized what was happening, I was able to select the region of text and pipe it to tr.

tr -cd '[:print:]'

I use something like this:

add-highlighter shared/non_ascii_characters regex '[^\x00-\x7f]+' '0:DiagnosticError'

then onto the character:

define-command show_character_info %{
  evaluate-commands -draft %{
    execute-keys ',;'
    evaluate-commands -client %val{client} -verbatim echo -markup %sh{
      printf '{Information}"%s" (U+%04x) Dec %d Hex %02x\n' "$kak_selection" "$kak_cursor_char_value" "$kak_cursor_char_value" "$kak_cursor_char_value"
    }
  }
}

alias global char show_character_info
1 Like

There’s no built-in way to render zero-width characters in an editable way. You can get a clue that something is up, because the cursor will disappear when it’s on a zero-width character, but if you’re not moving character-by-character you might not ever see it.

Relevant issues include:

I combined those two test-cases into a single file, which renders in Kakoune like this:

image

Here’s a plugin which highlights all the non-printable-ASCII characters in the buffer, when you run show-ascii-enable:

declare-option -hidden range-specs show_nonascii_ranges
set-face global NonAsciiChar @value

define-command show-nonascii-update \
    -docstring "Update the locations of non-ASCII characters" \
%{
    # Reset the existing highlights
    set-option window show_nonascii_ranges %val{timestamp}

    # We need to get data out of the current buffer,
    # so we'll need to save and restore some registers
    evaluate-commands -save-regs cd| %{

        # We're going to search for non-ASCII characters
        # and switch buffers, so do this in draft mode
        # to avoid messing up the user's state.
        evaluate-commands -draft %{
            # Select all the characters are not:
            # - newline
            # - the space character
            # - other printable ASCII
            execute-keys '%s[^\n -~]<ret>'

            # 's' makes the *last* selection in the buffer the primary,
            # but that means it'll be listed first in %val{selections_desc}
            # So we have to cycle the selections around to the beginning
            # to keep everything aligned.
            execute-keys )

            # Save the locations (descs) to "d
            set-register d %val{selections_desc}

            # Save the actual characters to "c
            set-register c %val{selections}

            # Let's build up the range-specs value in a temporary buffer.
            edit -scratch

            # Paste the descs
            execute-keys '"d<a-P>'

            # Add the "range-specs" formatting syntax,
            # and put each entry on a separate line
            # so we can select it later with 'x'.
            # We excluded newlines from the list of chars to select,
            # so we can safely use lines as a delimiter.
            # We leave the last character before the newline selected,
            # so we can paste after it.
            execute-keys a|{NonAsciiChar}<ret><esc>h

            # Paste the characters beside the descs,
            # leaving them selected.
            execute-keys '"cp'

            # Replace them with hexadecimal equivalents.
            # I wish there were a more efficient way to do this,
            # but most buffers probably don't have
            # too many non-ASCII characters.
            execute-keys -itersel \
            '|printf [U+%04X] "$kak_cursor_char_value"<ret>'

            # Copy the whole thing back into "c,
            # so we can get it out of this scratch buffer
            execute-keys xH"cy
            delete-buffer
        }

        # Add it to the range-specs option so we can see it.
        set-option -add window show_nonascii_ranges %reg{c}
    }
}

define-command show-nonascii-enable \
    -docstring "Add a highlighter that shows non-ASCII characters" \
%{
    add-highlighter window/show-nonascii replace-ranges show_nonascii_ranges
    show-nonascii-update
}

define-command show-nonascii-disable \
    -docstring "Remove the highlighter that shows non-ASCII characters" \
%{
    remove-highlighter window/show-nonascii
}

The result looks like this:

image

The plugin isn’t perfect; it doesn’t detect new non-ASCII characters added after the highlighter was added, and it formats characters as Unicode code-points even though the buffer might not be Unicode at all. It would also be nice to have a way to jump to a non-ASCII character (if any) instead of requiring you to browse through the entire buffer.

It’s nice to be able to see zero-width characters and copy or delete them, though.

1 Like

Wow! Thank you for your detailed responses. I’ll have to play around with these solutions and see what works best. If I make something myself, I’ll report back.

Thanks again!