Python, kak-lsp and the JSON UI: something's weird

I’ve been poking at my half-finished GTK UI for Kakoune, and I just hit a crash that surprised me. Basically, if I’m editing a Python file (with kak-lsp and the python language server, and I paste a unicode character (I’ve been using @, the full-width @, but :snowman_with_snow: also works) into my document, the character appears, and then the screen updates and my GTK UI crashes because Kakoune sent it invalid UTF-8.

Specifically, the invalid fragment looks like this:

{ "face": { "fg": "default", "bg": "default", "attributes": [] }, "contents": "\xef" },
{ "face": { "fg": "red", "bg": "default", "attributes": [] }, "contents": "\xbc\xa0" },

Note that “\xef\xbc\xa0” is the UTF-8 encoding of @.

If I had to guess, I’d say that the Python language server is detecting an invalid identifier and reporting it in character coördinates, while something later (kak-lsp? Kakoune itself?) is interpreting those as byte coördinates, and trying to apply different syntax-highlighting to different bytes of a UTF-8-encoded stream.

Has anybody else ever come across behaviour like this? Does anybody happen to know which software is at fault?

Posting kak-lsp logs might help to investigate its role in malformed message.

In general, some notes about what happens on kak-lsp side:

  1. All strings are handled as UTF-8
  2. Position taken from language server and passed to Kakoune is just offset to accommodate indexing base difference, no content analysis is performed:

LSP ranges are 0-based, but Kakoune’s 1-based.
LSP ranges are exclusive, but Kakoune’s are inclusive.
Also from LSP spec: If you want to specify a range that contains a line including the line ending character(s) then use an end position denoting the start of the next line.

I had some time to look into this more deeply today. I found a way to reproduce the weirdness in kak -n, so I filed a bug. While researching, I discovered that the language server protocol counts coordinates in UTF-16 code-units while Kakoune seems to count in whole codepoints, so I filed that bug too.

Regardless, if I create a file called foo.py containing a single U+2603 :snowman_with_snow: SNOWMAN and open it with python-language-server installed, the kak-lsp logs say (in part):

"diagnostics": [
    {
        "source": "pyflakes",
        "range": {
            "start": {"line": 0, "character": 1},
            "end": {"line": 0, "character": 3}
        },
        "message": "invalid character in identifier",
        "severity": 1
    }
]

…which (if I understand the LSP spec correctly) means it’s highlighting the newline after the snowman rather than the snowman itself, which makes very little sense to me. I tried to do the same experiment with VS Code, but couldn’t figure out how to actually make the IDE use the language server, so gave up before filing a bug against the python-language-server project.

1 Like