1

I have two large files consisting mainly of numbers in matrix form, and I'd like to use diff (or similar command) to compare these files and determine which numbers are different.

Unfortunately, quite a lot of these numbers differ only by sign, and I'm not interested in those differences. I only care when two numbers are different in magnitude. (i.e. I want 0.523 vs. 0.623, but NOT 0.523 vs. -0.523)

Is it possible to make diff ignore the sign and only print numbers that are different in magnitude?

EDIT: Some input examples, as requested:

File 1:

21   -0.0081318   0.0000000   0.0000000   0.0000000  -0.0138079
22    0.0000000   0.0000000   0.0000000   0.1156119   0.0000000
23    0.0000000   0.0047536   0.0000000   0.0000000   0.0000000

File 2:

21   -0.0081318   0.0000000   0.0000000   0.0000000   0.0032533
22    0.0000000   0.0000000   0.0000000  -0.0250637   0.0000000
23    0.0000000  -0.0047536   0.0000000   0.0000000   0.0000000

Assuming my files are formated mostly like this (except much, MUCH longer), I want to print out the differences, but ignore such differences when they're only in sign. For example, I don't care for 0.0047536 vs. -0.0047536, but I do want to print 0.1156119 vs. -0.0250637.

johnymm
  • 113
  • 2
    What does "mainly in matrix form" mean? An input sample might help, as well as an output sample. – RudiC Oct 03 '18 at 21:20
  • Could you just delete / eliminate all minusses from the input? With tr, or sed? – RudiC Oct 03 '18 at 21:21
  • 2
    Please at least provide examples of input and desired output. Do you only want to know IF they differ or the difference or both values or a combination thereof. How should it be displayed? Also I'd suggest referring to either a mathematics suite (e.g. GNU octave) or a suitable programming language (e.g. FORTRAN) for this task. Especially if it is large files. BTW: diff compares whole files or strings and does so line by line, not word by word. Not the right tool here. – FelixJN Oct 03 '18 at 21:39
  • @RudiC Could you show me how I could eliminate all minuses from the input and still leave the format of the text intact ? – johnymm Oct 03 '18 at 23:27
  • Do you know the matrix sizes? Do they vary? How should the output look like? – FelixJN Oct 03 '18 at 23:36
  • @Fiximan All files I'm trying to compare are very large (over 10000 lines) but should be identically formatted; i.e. all the matrices printed in each file are exactly the same size and shape. Only the individual numbers may differ. I modified my original post to include an example (I hope that helps). – johnymm Oct 03 '18 at 23:40
  • please tell me what's the problem with my answer; do you want the lines printed in the output as they were before normalization? –  Oct 04 '18 at 00:15

3 Answers3

3

Given your shell provided "process substitution" (likes recent bashes), try

diff <(tr '-' ' ' <file1) <(tr '-' ' '<file2)
1,2c1,2
< 21    0.0081318   0.0000000   0.0000000   0.0000000   0.0138079
< 22    0.0000000   0.0000000   0.0000000   0.1156119   0.0000000
---
> 21    0.0081318   0.0000000   0.0000000   0.0000000   0.0032533
> 22    0.0000000   0.0000000   0.0000000   0.0250637   0.0000000
RudiC
  • 8,969
  • This also works. Thanks! I can't believe it was this easy haha. I should be more familiar with basic Unix commands. – johnymm Oct 04 '18 at 14:45
1
$ xdiff(){ diff -bu <($1 "$2") <($1 "$3"); }
$ xdiff 'sed s/-\([.0-9]\)/\1/g' file1 file2

you can do other normalization to the data. for instance, to treat all of 0.01, .01, -.0100, -.01e as the same:

$ norm(){ awk '{for(i=1;i<=NF;i++){$i=$i<0?-$i:+$i};print}' "$@"; }
$ xdiff norm file1 file2
  • This approach seems to delete the minuses, but also displaces the entire line to the left - so diff still displays the line. Is it possible to replace the the minus with an empty space, so the format remains the same ? – johnymm Oct 04 '18 at 00:52
  • 1
    the -b option given to diff should tell it to treat any amount of whitespace as the same, so it shouldn't display the line. but yes, you can replace - by a space by adding a space before the \1 (eg sed s/../\ \1/g). –  Oct 04 '18 at 00:58
  • I'm not on my computer right now but I'll try it as soon as I can and let you know if it works. – johnymm Oct 04 '18 at 01:00
  • When I add a space before "\1" it gives me the error:
    sed: -e expression #1, char 14: unterminated `s' command
    
    – johnymm Oct 04 '18 at 04:47
  • yes, it will be split with IFS. it'll work if you put it in a function though: norm(){ sed 's/-/ /g' "$@"; }; xdiff norm file1 file2. rather than extend that minimal xdiff example, you could use the script from my answer here like this: x 'diff -bu' "sed 's/-/ /g'" file1 file2 –  Oct 04 '18 at 04:57
  • Ok, on a first glance that seems to have worked. Thanks so much!! I'll verify it in the morning to make sure it's the correct output, but it looks ok as far as I can tell. – johnymm Oct 04 '18 at 05:08
0

To print lines from file2 whose corresponding absolute values differ from the corresponding fields from the same line in file1, save the lines from file1 in memory, the compare their absolute values:

function abs(x) {
  return x < 0 ? -x : x;
}

# file 1
NR == FNR {
  file1[$1]=$2" "$3" "$4" "$5" "$6
}

# file 2
NR != FNR {
  split(file1[$1], file1fields);
  if ( abs($2) - abs(file1fields[1]) ||
       abs($3) - abs(file1fields[2]) ||
       abs($4) - abs(file1fields[3]) ||
       abs($5) - abs(file1fields[4]) ||
       abs($6) - abs(file1fields[5]) )
        print;
}

Save that in a file, then run awk -f /path/to/that/file file1 file2

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
  • I'm not sure what you mean by "save the lines from file1 in memory" – johnymm Oct 04 '18 at 04:51
  • @johnymm it means file1 is completely saved to RAM, then file2 is read (more or less) line by line and the corresponding lines are compared. If file1 exceeds your PC's memory, you will have to split the files beforehand. – FelixJN Oct 04 '18 at 10:46