48

I have two files which look identical to me (including trailing whitespaces and newlines) but diff still says they differ. Even when I do a diff -y side by side comparison the lines look exactly the same. The output from diff is the whole 2 files.

Any idea what's causing it?

DewinDell
  • 745
  • 5
    Try to compare unprintable characters. The simplest way to watch them is sed -n l filename. If it won't help, add a data example and diff output here. – rush Aug 17 '12 at 13:18
  • 1
    Ahh yes thank you, the lines in a file are ending with $ and in the other one ending with \r$ – DewinDell Aug 17 '12 at 13:33
  • 1
    A quick fix is to use dos2unix on both the files (or the one you suspect to be from a Windows machine). – chembrad Jun 02 '15 at 18:43
  • As a complement to existing answers: the file command will hint you about file content, including things like ASCII text, with CRLF line terminators vs ASCII text. – Stéphane Gourichon Dec 15 '15 at 16:44
  • I know that I’m late to the party, and this specific question has been answered (i.e., MinaHany’s problem has been solved), but — anybody who has a problem like this should do an ls -l (or stat) on both files and compare the sizes (and include that information in any question). That’s a minimal, obvious first step toward diagnosing the situation. – G-Man Says 'Reinstate Monica' Mar 03 '20 at 07:58

7 Answers7

34

Odd .. can you try cmp? You may want to use the '-b' option too.

cmp man page - Compare two files byte by byte.

This is one of the nice things about Unix/Linux .. so many tools :)

Levon
  • 11,384
  • 4
  • 45
  • 41
33

Try:

diff file1 file2 | cat -t

The -t option will cause cat to show any special characters clearly - eg. ^M for CR, ^I for tab.

From the man page (OS X):

 -t      Display non-printing characters (see the -v option), and display tab characters as `^I'.

-v Display non-printing characters so they are visible. Control characters print as ^X' for control-X; the delete character (octal 0177) prints as^?'. Non-ASCII characters (with the high bit set) are printed as `M-' (for meta) followed by the character for the low 7 bits.

JosephH
  • 2,559
18

Might the differences be caused by DOS vs. UNIX line endings, or something similar?

What if you hexdump them? This might show differences more obviously, eg:

hexdump -C file1 > file1.hex
hexdump -C file2 > file2.hex
diff file1.hex file2.hex
mrb
  • 10,288
  • Well, the two hexes are different. every time there's a 0d 0a in a file the other one just has 0a – DewinDell Aug 17 '12 at 13:29
  • 5
    In one, you have DOS line endings (CRLF) and in the other, UNIX line endings (LF). That's why they look different to diff but not when you look at them visually. Look at https://en.wikipedia.org/wiki/Newline#Conversion_utilities – mrb Aug 17 '12 at 13:32
  • 1
    Got it! Thanks a lot. Levon's suggestion of using cmp shows the difference more clearly though :) – DewinDell Aug 17 '12 at 13:39
6

My first guess, which turns out to be confirmed, is that the files use different line endings. It could be some other difference in whitespace, such as the presence of trailing whitespace (but you typically wouldn't get that on many lines) or different indentation (tabs vs spaces). Use a command that prints out whitespace and control characters in a visible form, such as

diff <(cat -A file1) <(cat -A file2)
diff <(sed -n l file1) <(sed -n l file2)

You can confirm that the differences only have to do with line endings by normalizing them first. You may have a dos2unix utility; if not, remove the extra CR (^M, \r, \015) character explicitly:

diff <(tr -d '\r' <file1) <(tr -d '\r' <file2)

or, if file1 is the one with DOS endings

 tr -d '\r' <file1 | diff - file2
4

Other answers are complete enough, but in providing ways of showing somehow invisible differences explicitly. However, there's another option: ignoring these differences, which are somehow unimportant. In some cases, it's not useful to be informed about these differences.

diff command has some useful options regarding this:

--strip-trailing-cr
    strip trailing carriage return on input

-B, --ignore-blank-lines
    ignore changes where lines are all blank

-Z, --ignore-trailing-space
    ignore white space at line end

Personally, I found --strip-trailing-cr useful, especially when using -r (i.e. --recursive) option on large projects, or when Git's core.autocrlf is not false (i.e. is either true or input).

For more information on these options and more, see its man page (or via man diff).

Note: Using these options affects the performance of getting results, especially in the case of huge files/directories. In one of my own cases, it increased the manipulation time from 0.321s to 0.422s.

1

For anyone on Windows you can do this with fc. It can use binary compare.

fc /B file1 file2
  • Hi Johan, and welcome to the UNIX & Linux Stack Exchange! Our target systems here are UNIX/Linux systems; you might find Windows-centric answers more on-topic at SuperUser or Server Fault. Thank you! – Jeff Schaller Nov 18 '20 at 17:01
0

In side-by-side view add --suppress-common-lines to the options.

All other answers and comments here, are good to know, but not sufficient at all. The original question is explicitly about side-by-side comparison. Even files produced with cp will be completely listed in side-by-side mode - all problems with line feed, spaces or special characters aside. You will always need --suppress-common-lines to get the desired result.

This may be not obvious for non english natives, as common may be interpreted as 'normal' and not 'mutual'. Perhaps it would be easier if it was saying 'suppress-equal-lines' or similar. And I was really surprised that there was no short, one letter option for such a 'common' :) task.

Frnk
  • 1