Searching For: Non-exact match finding (mostly similar)

robertmeta · January 25, 2021, 3:39pm

I am hunting around for something I presumed existed, but haven’t been able to find. A console utility for finding duplicate lines that are non-exact matches and scoring the matches. There are a ton of algorithms for this – but I couldn’t find a pre-baked utility.

Do I have to write it?

Example:

more += Found .* abyssal rune of Zot
more += .* abyssal rune of Zot

These two lines are mostly similar, with a Levenshtein Distance of 6 – I would like to show possible dups / same meaning by shorted levenshtein distances (or something better, been years since I dug into this stuff).

Does nothing already do this?!

robertmeta · January 25, 2021, 3:42pm

Damerau–Levenshtein distance seems to be a better modern upgrade.

danr · January 26, 2021, 8:07am

I had a search and closest I found was this python lib that it is perhaps easy to write a CLI arond: GitHub - seatgeek/fuzzywuzzy: Fuzzy String Matching in Python

Edit: and this GitHub - dedupeio/dedupe: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

brianhicks · January 27, 2021, 3:05pm

I wrote a program that does this a while back:

I use it to pop open a fuzzy file finder that finds files with similar paths. Useful for jumping to sibling or test files quickly!

In this case, you’d invoke it like: similar-sort 'more += Found .* abyssal rune of Zot' < sourcefile. If you need to remove the original line, you’d just grep -v it back out of the output. Could be faster, but also nice and composable

robertmeta · January 28, 2021, 4:02pm

@brianhicks nice, this got me on exactly the right path.