Handling Files that Do Not Fit in Memory

511 Handling Files that Do Not Fit in Memory Contents|Index|Previous|Next

Handling Files that Do Not Fit in Memory

diff operates by reading both files into memory. This method fails if the files are too large, and diff should have a fallback.

One way to do this is to scan the files sequentially to compute hash codes of the lines and put the lines in equivalence classes based only on hash code. Then compare the files normally. This does produce some false matches.

Then scan the two files sequentially again, checking each match to see whether it is real. When a match is not real, mark both the "matching" lines as changed. Then build an edit script as usual.

The output routines would have to be changed to scan the files sequentially looking for the text to print.