Wednesday, December 16, 2009

Emacs: Better Ido Flex-Matching

I've had this blog for a long time ... perhaps I should actually post things.  Most will be about Emacs; there are plenty of beginner-level Emacs blogs out there, so I'll lean towards the intermediate/advanced.

Better Ido Flex-Matching

First up: Better Ido flex-matching.  Ido lets you switch buffers, choose files, and so on by presenting you with a list of choices and narrowing the list down as you type.  Flex-matching does "fuzzy" matching of your input to the list of items; i.e. you can just type pieces of what you want, and the pieces don't even need to be next to each other.

This is great, but it has some quirks.  Suppose my list of items was ("chops" "scone" "close" "confide") and I typed "co".  Ido throws out "chops" and "close" -- even though they have 'c' followed by 'o' -- and only presents "scone" and "confide" since they have "co" in them.  Now I add 's' so the input is "cos" and Ido discards the previous, seemingly only, matches and switches to "chops" and "close".  It presents them in that order as well, even though I consider "close" to be a better match since the 'o' and 's' are next to each other.

I found this behavior inconsistent and surprising, so I decided to change the algorithm.  If you want to try my version but aren't interested in how it works, skip the next section.

The Algorithm

The first thing to do is see if someone else has solved this problem.  Trying different searches for "string matching", you get a lot of hits on finding substrings, which isn't what I want, and a lot for Levenshtein Distance.  The Levenshtein algorithm is close (there's even an elisp implementation, and similarly fuzzy-match), but it's more for things like spelling suggestions.  It will always match,  giving you a measure of how close the strings are.  They don't even have to have any of the same characters in them.  There are a couple academic papers also, but they had various problems as well.  I know what I want, I'll just write it myself!

We need to calculate a correlation value between the input and each item; that is, how closely they match.  A correlation value of 'nil' will mean no match, and zero or higher will mean a match with higher values meaning better matches.  The correlation value will be increased by one for every pair of adjacent letters in the input that are adjacent in the item (only the first time counts).  A special case of adding one is if the first letters in both match, as if the beginning of the string could be considered a character.  For example, if the input is "cos" and the item is "close", it will get one point for both starting with 'c', and one point for "os" being adjacent in both for a total of two.  If the item was "chops", it would only get one point (starts with 'c' and has 'o' and 's', but none adjacent).

A hash table is created from the input, with each character being the key and a descending list of the character's positions in the input as the value.  For example, if the input is "aba", the table would have a -> (2 0), b -> (1).  Why they need to be in descending order will be explained later.  Since the same input is used for each item, the hash table is only created once.

For each item, we start with an empty (all nil value) vector with a length equal to that of the input.  We then go through the characters of the item in order, using each to index into the hash table and retrieve the character's corresponding locations in the input string.  We look in the vector at these locations, and if the preceding vector location is non-nil (or we are at the beginning of the vector) we fill in the vector location with a cons cell.  This is why they must be in descending order, otherwise you would incorrectly fill in locations where there were double letters.  The car of the cell is the index of the item where the character was found, and the cdr is the current correlation that we will carry along.  If the car of the preceding vector location (the previous character location in the item) is the current item character location minus one, we add one to the correlation.  When all vector locations have been filled in, we have a match and we save the correlation.  As more matches happen we keep the highest correlation value.

I know I'm kind of skimming over things without a lot of explanation, but this post is already getting too detailed.

The Implementation

Here's the implementation, woven into Ido with a variable to switch to the regular flex-matching if you want:

(defun my-ido-fuzzy-match (str items)
  "Better ido fuzzy matching"
  (let ((str-len (length str)))
    (if (= str-len 0)
        (reverse items)
      (let ((char-lookup (make-hash-table :test 'equal)))
        ;; Make hash table of all characters with their corresponding indexes
        (let ((chars (split-string (if ido-case-fold (downcase str) str) "" t))
              (idx 0)
              elt)
          (dolist (char chars)
            (setq elt (gethash char char-lookup))
            (if elt
                (push idx elt) ;; It's important that the indexes are in descending order
              (setq elt (list idx)))
            (puthash char elt char-lookup)
            (setq idx (1+ idx))))
        ;; Go through all the items
        (let (corr matches)
          (dolist (item items)
            (setq corr (my-ido-match-get-correlation str-len char-lookup (ido-name item)))
            (when corr
              (push (cons item corr) matches)))
          ;; Sort matches and return
          (mapcar 'car (if ido-rotate
                           matches
                         (sort matches (lambda (x y) (> (cdr x) (cdr y)))))))))))

(defun my-ido-match-get-correlation (str-len char-lookup item)
  "Get the correlation for this item"
  (let ((partial-matches (make-vector str-len nil))
        (chars (split-string (if ido-case-fold (downcase item) item) "" t))
        (char-idx 0)
        elt-idxs corr prev-partial-match curr-partial-match)
    (dolist (char chars)
      (setq elt-idxs (gethash char char-lookup))
      (when elt-idxs
        (dolist (elt-idx elt-idxs)
          ;; Current and previous partial matches
          (setq curr-partial-match (aref partial-matches elt-idx))
          (setq prev-partial-match (and (> elt-idx 0)
                                        (aref partial-matches (1- elt-idx))))
          ;; Create a new partial match if necessary
          (when (and (not curr-partial-match)
                     (or prev-partial-match (= elt-idx 0)))
            (setq curr-partial-match
                  (aset partial-matches elt-idx
                        (cons char-idx (if (and (= elt-idx 0) (= char-idx 0)) 1 0)))))
          ;; Set (match-position . correlation)
          (when curr-partial-match
            (setcar curr-partial-match char-idx)
            (when prev-partial-match
              (setcdr curr-partial-match
                      (if (= char-idx (1+ (car prev-partial-match)))
                          (1+ (cdr prev-partial-match))
                        (cdr prev-partial-match))))
            ;; Update final correlation
            (when (= elt-idx (1- str-len))
              (if corr
                  (setq corr (max corr (cdr curr-partial-match)))
                (setq corr (cdr curr-partial-match)))))))
      (setq char-idx (1+ char-idx)))
    corr))

(defvar my-ido-use-fuzzy-match t
  "*Use my-ido-fuzzy-match for ido matching")

(defadvice ido-set-matches-1 (around my-ido-set-matches-1 activate)
  "Choose between the regular ido-set-matches-1 and my-ido-fuzzy-match"
  (if my-ido-use-fuzzy-match
      (setq ad-return-value (my-ido-fuzzy-match ido-text (ad-get-arg 0)))
    ad-do-it))

5 comments:

Leo said...

Can you discuss your improvement with ido's maintainer?

Best,

Leo

Unknown said...

Awesome post, please don't stop posting. I've watched planet emacsen for a long time now and sadly most of the posts are not concerned with having a better IDE so thank you for posting this.

Alexey Romanov said...

Thank you, I was just looking for such an algorithm!

Anonymous said...

For java I think I would prefer to add a simple camel case matching, for example FBB should limit down to FooBarBaz.java. Seems more useful for the size of projects I work with.

dendy said...

thanks for sharing!