I’ve been poking at my half-finished GTK UI for Kakoune, and I just hit a crash that surprised me. Basically, if I’m editing a Python file (with kak-lsp and the python language server, and I paste a unicode character (I’ve been using @, the full-width @, but also works) into my document, the character appears, and then the screen updates and my GTK UI crashes because Kakoune sent it invalid UTF-8.
Specifically, the invalid fragment looks like this:
Note that “\xef\xbc\xa0” is the UTF-8 encoding of @.
If I had to guess, I’d say that the Python language server is detecting an invalid identifier and reporting it in character coördinates, while something later (kak-lsp? Kakoune itself?) is interpreting those as byte coördinates, and trying to apply different syntax-highlighting to different bytes of a UTF-8-encoded stream.
Has anybody else ever come across behaviour like this? Does anybody happen to know which software is at fault?
Posting kak-lsp logs might help to investigate its role in malformed message.
In general, some notes about what happens on kak-lsp side:
All strings are handled as UTF-8
Position taken from language server and passed to Kakoune is just offset to accommodate indexing base difference, no content analysis is performed:
LSP ranges are 0-based, but Kakoune’s 1-based.
LSP ranges are exclusive, but Kakoune’s are inclusive.
Also from LSP spec: If you want to specify a range that contains a line including the line ending character(s) then use an end position denoting the start of the next line.
I had some time to look into this more deeply today. I found a way to reproduce the weirdness in kak -n, so I filed a bug. While researching, I discovered that the language server protocol counts coordinates in UTF-16 code-units while Kakoune seems to count in whole codepoints, so I filed that bug too.
Regardless, if I create a file called foo.py containing a single U+2603 SNOWMAN and open it with python-language-server installed, the kak-lsp logs say (in part):
…which (if I understand the LSP spec correctly) means it’s highlighting the newline after the snowman rather than the snowman itself, which makes very little sense to me. I tried to do the same experiment with VS Code, but couldn’t figure out how to actually make the IDE use the language server, so gave up before filing a bug against the python-language-server project.