As you may already know, position handling in LSP is a bit of a problem in both clients and servers because of unnecessary inconsistency introduced by the spec. See https://github.com/Microsoft/language-server-protocol/issues/376 for more details on the topic.
When talking about kak-lsp in particular, the way it handles
Position.character is just converting between 0 and 1 base (LSP is 0-based, and Kakoune uses base 1) and between exclusive (LSP) and inclusive (Kakoune) ranges. It doesn’t care about what
Position.character is: byte, code unit or code point in UTF-8 or UTF-16 or whatever. But because Kakoune itself treats
column as a byte offset in most of the places, we can say that kak-lsp is effectively working with UTF-8 code units or just byte offsets.
It works well with language servers which violate protocol in the same way (i.e.
bingo) but leads to problems like https://github.com/ul/kak-lsp/issues/191 or https://github.com/ul/kak-lsp/issues/98 when language server conforms protocol or violates it in a different way (i.e. RLS which uses UTF-8 code points) as soon as line contains characters outside of Basic Latin set.
In the https://github.com/ul/kak-lsp/tree/better-character-offset-handling branch I made a few steps forward the solution of that problem:
- Copy of the latest version of the buffer (which was already sent to kak-lsp in didOpen and didChange requests) is now stored for further analysis.
- When position/range arrives from Kakoune side or is about to be sent to Kakoune, it is converted from/to byte offset based on the buffer content treated as UTF-8.
- The default mode of conversion is to treat LSP
Position.characteras an offset in UTF-8 code points. It means that now kak-lsp should work with spec-conforming servers within the entire Basic Multilingual Plane, and even outside it with language servers which violate spec by using UTF-8 code points (i.e. RLS).
- For language servers which use UTF-8 code units (i.e.
pyls) there is a new option which could be set in
offset_encoding = "utf-8"(like this https://github.com/ul/kak-lsp/blob/01420842ed9a65501afcae551028bc401f842a9c/kak-lsp.toml#L55). Why just “utf-8” when UTF-8 offset could be represented in both ways (code units and code point)? Because it is inspired by this convention https://clangd.github.io/extensions.html#utf-8-offsets However, please note that
offset_encodingoption only influences kak-lsp behaviour and is not sent to language server!
This branch is not ready to merge into master yet, because I want to write a bit more docs and create a few unit tests; few places need a decision about the level of gracefulness in error handling as well. But it is functionally complete and I ask you a favour of trying it with your typical workflow if you are keen. There are several things you can do:
- If you like Rust and want to spend time reading kak-lsp code, I’d appreciate a code review.
- If your projects don’t use characters outside of Basic Latin, please check that kak-lsp works for you as before.
- If your projects do use characters outside of Basic Latin, kak-lsp should now work with them properly within Basic Multilingual Plane! Beware that your language server might need adding
offset_encoding = "utf-8"to the config if it uses bytes to encode
Thanks for the help!