Call for a testing kak-lsp better position/range handling

ulis · April 25, 2019, 12:39am

As you may already know, position handling in LSP is a bit of a problem in both clients and servers because of unnecessary inconsistency introduced by the spec. See https://github.com/Microsoft/language-server-protocol/issues/376 for more details on the topic.

When talking about kak-lsp in particular, the way it handles Position.character is just converting between 0 and 1 base (LSP is 0-based, and Kakoune uses base 1) and between exclusive (LSP) and inclusive (Kakoune) ranges. It doesn’t care about what Position.character is: byte, code unit or code point in UTF-8 or UTF-16 or whatever. But because Kakoune itself treats column as a byte offset in most of the places, we can say that kak-lsp is effectively working with UTF-8 code units or just byte offsets.

It works well with language servers which violate protocol in the same way (i.e. pyls or bingo) but leads to problems like https://github.com/ul/kak-lsp/issues/191 or https://github.com/ul/kak-lsp/issues/98 when language server conforms protocol or violates it in a different way (i.e. RLS which uses UTF-8 code points) as soon as line contains characters outside of Basic Latin set.

In the https://github.com/ul/kak-lsp/tree/better-character-offset-handling branch I made a few steps forward the solution of that problem:

Copy of the latest version of the buffer (which was already sent to kak-lsp in didOpen and didChange requests) is now stored for further analysis.
When position/range arrives from Kakoune side or is about to be sent to Kakoune, it is converted from/to byte offset based on the buffer content treated as UTF-8.
The default mode of conversion is to treat LSP Position.character as an offset in UTF-8 code points. It means that now kak-lsp should work with spec-conforming servers within the entire Basic Multilingual Plane, and even outside it with language servers which violate spec by using UTF-8 code points (i.e. RLS).
For language servers which use UTF-8 code units (i.e. pyls) there is a new option which could be set in kak-lsp.toml: offset_encoding = "utf-8" (like this https://github.com/ul/kak-lsp/blob/01420842ed9a65501afcae551028bc401f842a9c/kak-lsp.toml#L55). Why just “utf-8” when UTF-8 offset could be represented in both ways (code units and code point)? Because it is inspired by this convention https://clangd.github.io/extensions.html#utf-8-offsets However, please note that offset_encoding option only influences kak-lsp behaviour and is not sent to language server!

This branch is not ready to merge into master yet, because I want to write a bit more docs and create a few unit tests; few places need a decision about the level of gracefulness in error handling as well. But it is functionally complete and I ask you a favour of trying it with your typical workflow if you are keen. There are several things you can do:

If you like Rust and want to spend time reading kak-lsp code, I’d appreciate a code review.
If your projects don’t use characters outside of Basic Latin, please check that kak-lsp works for you as before.
If your projects do use characters outside of Basic Latin, kak-lsp should now work with them properly within Basic Multilingual Plane! Beware that your language server might need adding offset_encoding = "utf-8" to the config if it uses bytes to encode Position.character

Thanks for the help!

andreyorst · April 25, 2019, 7:10am

Sure! I’m going to test it, in the software project I’m working on there are lot of strings in non Latin symbols.

Thanks for your hard work, @ulis!