Home Developer & Utility Tools Guide

How diff algorithms work: Myers, patience, and why some diffs read better

Two implementations of the same algorithm, fed the same files, can produce diffs that look completely different. Here is why.

Take a refactor: you moved a function from the top of the file to the bottom, renamed a variable inside it, and tweaked one line. Paste the two versions into the diff tool that ships with your editor. Paste the same two versions into git diff --patience on the command line. You get two diffs that don't even look like they're describing the same change.

The algorithms picked different "common ground" between the files. Both are correct in the sense that applying their edit script to the old file gives you the new one. Only one of them is readable. The difference between a diff that tells you what happened and a diff that fights you is the algorithm, not the input.

The diff problem

Diff is, at its heart, the longest common subsequence (LCS) problem applied to lines (or tokens, depending on the granularity). Find the longest sequence of lines that appears, in order, in both files. Everything not in that sequence is either an addition (in the new file but not the old) or a deletion (in the old but not the new). The naive LCS algorithm is O(n·m) in time and space, which for two files of 10,000 lines is 100 million cells of dynamic-programming table — slow but tractable.

For files that share most of their content, that's wasteful. Almost every modern diff algorithm is a refinement that exploits similarity to avoid building the whole table.

Myers's O(ND) algorithm — what every modern diff inherits

Eugene Myers published the canonical algorithm in 1986. The insight: instead of computing LCS directly, search for the shortest edit script between the two files. The time complexity is O(ND), where N is the total length of both files and D is the number of differences. For similar files (small D), it's blindingly fast. For completely different files (large D), it degenerates to roughly O(N²).

Almost every command-line diff and every IDE diff uses Myers or a Myers variant under the hood. diff on Linux, the diff inside VS Code, the diff GitHub shows on pull requests — all of them, by default, are some flavour of Myers.

Why Myers sometimes produces ugly diffs

Myers minimises the number of edit operations. That's not the same as producing the most readable diff. Source code is full of short, repeated lines: closing braces, blank lines, single-word imports, lone else: lines. Myers happily matches a closing brace from the top of one file to a closing brace at the bottom of the other, because that match is syntactically valid and saves an edit operation. The result: a diff that "matches" closing braces across moved functions and shows the actual code movement as massive add/remove blocks.

You can spot the symptom: a diff where a function looks like it was deleted at the top and rewritten at the bottom, when what actually happened was a move. Reviewing this diff means manually reconstructing the move in your head — the thing the diff was supposed to do for you.

Patience diff

Bram Cohen — yes, the BitTorrent one — invented patience diff in 2008 for the Bazaar version-control system. Git adopted it shortly after, as git diff --patience. Named after the card game patience, where you build sequences from unique anchor cards.

The trick: find lines that appear exactly once in each file and once in the other. Those become anchors. The diff between two anchors is computed recursively, but the overall structure is pinned to those uniquely-appearing lines.

For source code, patience is dramatically better than Myers. Functions almost always have unique signature lines — def parse_input(raw) appears in one place in each file, not in twenty. Those signatures become anchors. Moves and rewrites become legible instead of getting hidden in line-soup matching.

Histogram diff

Available in git since version 2.11 (2016), and widely treated as the most readable algorithm of the three — but it's opt-in. Git's default is still Myers; you switch to histogram via git config diff.algorithm histogram or per-command with --histogram. Histogram is a refinement of patience: instead of "unique in both files", consider lines by frequency, and anchor on the rarest matching lines first. Slightly faster than patience and produces nearly identical output. Most diff tools that say they implement patience actually implement histogram.

Why diff readability matters

A diff isn't just a record of what changed. It's how reviewers form a mental model of the change. A noisy diff hides bugs — the eye glazes over a 200-line patch full of false matches, and a subtle off-by-one slips through. A clean diff makes bugs obvious. The reviewer reads exactly the lines that changed; the structure around them is unchanged context they don't have to re-evaluate.

Patience and histogram add maybe 5 to 10% to diff runtime in exchange for diffs that read like human-written summaries of changes. For a code review, that's a great trade. For a daily-driver diff tool, the runtime cost is invisible.

How to switch your tools to a better algorithm

Git lets you pick the algorithm per-invocation (git diff --histogram) or set a default for the whole repo:

  • git config --global diff.algorithm histogram — sets histogram as the default for every git diff and git log -p.
  • git config --global merge.conflictStyle zdiff3 — pairs nicely; merge conflict markers include the common ancestor, which makes resolution easier.

GitHub's web diff appears to use histogram or a similar refinement (the cleaner pull-request view is one tell), so reviewing on the web is usually fine. Local git diff is the one most people forget to switch, and it's the one that matters for the actual day-to-day work of writing code.

Line vs token vs character diff

Everything so far is line-level. Most diff tools can also operate on smaller units:

  • Token diff splits each line into words or identifiers and diffs at that granularity. Useful for prose, where a long line with one word changed shouldn't show as "line removed, line added".
  • Character diff goes further still. Useful when comparing strings inside log lines, but usually overkill for code.

Most line-diff tools fall back to a word-level highlight inside a changed line, so you get the best of both: line-level structure for the overall change, word-level detail for what actually moved inside each line.

When to use the text-diff-checker tool here

The text diff checker on this site is for the ad-hoc comparison case — two paragraphs of email, two versions of a config snippet, two log lines you want to verify are identical. When firing up diff or putting two files in a git repo isn't worth the friction, paste them in and scan the highlights.

Related reading

The JSON formatter has a built-in semantic diff for structured data — a different problem, since reordered object keys aren't really differences. The JSON parse errors guide is the companion piece for when the diff isn't your problem and the JSON is.