Hi! I just stumbled into a bit of a nasty situation with some non-printable Unicode characters (in this case: \u200B “Zero-width space”). Is there a way to render these characters with a stand-in so it is possible to see that they are there?
Once I realized what was happening, I was able to select the region of text and pipe it to tr.
There’s no built-in way to render zero-width characters in an editable way. You can get a clue that something is up, because the cursor will disappear when it’s on a zero-width character, but if you’re not moving character-by-character you might not ever see it.
Relevant issues include:
I combined those two test-cases into a single file, which renders in Kakoune like this:
Here’s a plugin which highlights all the non-printable-ASCII characters in the buffer, when you run show-ascii-enable:
declare-option -hidden range-specs show_nonascii_ranges
set-face global NonAsciiChar @value
define-command show-nonascii-update \
-docstring "Update the locations of non-ASCII characters" \
%{
# Reset the existing highlights
set-option window show_nonascii_ranges %val{timestamp}
# We need to get data out of the current buffer,
# so we'll need to save and restore some registers
evaluate-commands -save-regs cd| %{
# We're going to search for non-ASCII characters
# and switch buffers, so do this in draft mode
# to avoid messing up the user's state.
evaluate-commands -draft %{
# Select all the characters are not:
# - newline
# - the space character
# - other printable ASCII
execute-keys '%s[^\n -~]<ret>'
# 's' makes the *last* selection in the buffer the primary,
# but that means it'll be listed first in %val{selections_desc}
# So we have to cycle the selections around to the beginning
# to keep everything aligned.
execute-keys )
# Save the locations (descs) to "d
set-register d %val{selections_desc}
# Save the actual characters to "c
set-register c %val{selections}
# Let's build up the range-specs value in a temporary buffer.
edit -scratch
# Paste the descs
execute-keys '"d<a-P>'
# Add the "range-specs" formatting syntax,
# and put each entry on a separate line
# so we can select it later with 'x'.
# We excluded newlines from the list of chars to select,
# so we can safely use lines as a delimiter.
# We leave the last character before the newline selected,
# so we can paste after it.
execute-keys a|{NonAsciiChar}<ret><esc>h
# Paste the characters beside the descs,
# leaving them selected.
execute-keys '"cp'
# Replace them with hexadecimal equivalents.
# I wish there were a more efficient way to do this,
# but most buffers probably don't have
# too many non-ASCII characters.
execute-keys -itersel \
'|printf [U+%04X] "$kak_cursor_char_value"<ret>'
# Copy the whole thing back into "c,
# so we can get it out of this scratch buffer
execute-keys xH"cy
delete-buffer
}
# Add it to the range-specs option so we can see it.
set-option -add window show_nonascii_ranges %reg{c}
}
}
define-command show-nonascii-enable \
-docstring "Add a highlighter that shows non-ASCII characters" \
%{
add-highlighter window/show-nonascii replace-ranges show_nonascii_ranges
show-nonascii-update
}
define-command show-nonascii-disable \
-docstring "Remove the highlighter that shows non-ASCII characters" \
%{
remove-highlighter window/show-nonascii
}
The result looks like this:
The plugin isn’t perfect; it doesn’t detect new non-ASCII characters added after the highlighter was added, and it formats characters as Unicode code-points even though the buffer might not be Unicode at all. It would also be nice to have a way to jump to a non-ASCII character (if any) instead of requiring you to browse through the entire buffer.
It’s nice to be able to see zero-width characters and copy or delete them, though.
Wow! Thank you for your detailed responses. I’ll have to play around with these solutions and see what works best. If I make something myself, I’ll report back.