Find unique lines, ignoring certain columns

Question

I want to be able to go through a file and find lines that are different in the fields I expect to be stable, ignoring differences in the fields I know will be different.

I want to ignore columns 2,5,6 - changes in columns 1,3,4 are ones that are worth reporting.

To give an example:

I would not want the first two lines reported as containing a change, but I would want the second pair to two lines to be reported.

The file was already sorted with

sort -k1,1 -k3,3n -k4,4n

Any suggestions? (Apologies for the question formatting, I'm new)

The data in non-image form with addition lines before and after:

NZ_CP020102     B4U62_RS00130   26852   28543   DNA polymerase III subunit gamma/tau    NCIB3610a
NZ_CP020102     TESTGENOMECL_26 26852   28543   DNA polymerase III subunit gamma/tau    TESTGENOME
NZ_CP020102     B4U62_RS00135   28567   28890   YbaB/EbfC family nucleoid-associated protein    NCIB3610a
NZ_CP020102     TESTGENOMECL_27 28567   28890   YbaB/EbfC family nucleoid-associated protein    TESTGENOME
NZ_CP020102     B4U62_RS00140   28905   29501   recombination protein RecR      NCIB3610a
NZ_CP020102     TESTGENOMECL_28 28905   29501   recombination protein RecR      TESTGENOME
NZ_CP020102     B4U62_RS00145   29519   29743   DUF2508 domain-containing protein       NCIB3610a
NZ_CP020102     TESTGENOMECL_29 29519   29743   DUF2508 domain-containing protein       TESTGENOME
NZ_CP020102     B4U62_RS00150   29810   30073   sigma-K factor-processing regulatory protein BofA       NCIB3610a
NZ_CP020102     TESTGENOMECL_30 29810   30073   sigma-K factor-processing regulatory protein BofA       TESTGENOME
NZ_CP020102     B4U62_RS00155   30317   31869   16S ribosomal RNA       NCIB3610a
NZ_CP020102     TESTGENOMECL_31 30317   31870   16S ribosomal RNA       TESTGENOME
NZ_CP020102     B4U62_RS00160   31969   32045   tRNA-Ile        NCIB3610a
NZ_CP020102     TESTGENOMECL_32 31969   32045   tRNA-Ile        TESTGENOME

The only two lines that should be returned as relevantly different are the two 16s line, due to the difference in column 4.

For the most part, the lines are paired, but there can be omissions

NZ_CP020102     B4U62_RS00085   20006   20596   pyridoxal 5'-phosphate synthase glutaminase subunit PdxT        NCIB3610a
NZ_CP020102     TESTGENOMECL_17 20006   20596   pyridoxal 5'-phosphate synthase glutaminase subunit PdxT        TESTGENOME
NZ_CP020102     TESTGENOMECL_4554       20704   20925   hypothetical protein    TESTGENOME
NZ_CP020102     B4U62_RS00090   20918   22195   serine--tRNA ligase     NCIB3610a
NZ_CP020102     TESTGENOMECL_18 20918   22195   serine--tRNA ligase     TESTGENOME

In this case, the unpaired line would be what I would be interested in.

Essentially, I'm looking to suppress columns, run a diff, but then have the diff output include the suppressed columns.

The two lines with 16s ribosomal RNA should be returned, since they differ in column 4. — davidrh, Nov 16 '18 at 19:26
so is it fair to say that you want to report lines that differ from an existing triplet when two of the three columns in question are the same and a 3rd column is different? — Jeff Schaller, Nov 16 '18 at 19:29
Are there pairs of lines, each pair characterized by a certain value in column 2? — sudodus, Nov 16 '18 at 19:32
Not exactly. It's more like two files that are guaranteed to differ in columns 2,5,6 are expected to be the same in columns 1,3,4, and my file contains alternating lines from these two files. I want diff-like functionality to detect any and all changes to only those columns. Complicating this is that lines can be missing, so I can't just do pairwise comparisons of every two lines. — davidrh, Nov 16 '18 at 19:33
How many lines are there? Would it be 'few enough' to work in a spreadsheet program, for example Open Office Calc? — sudodus, Nov 16 '18 at 19:47
9216 in my current file, but I have no reason to suspect other data won't be significantly larger. Probably not more than an order of magnitude larger though. — davidrh, Nov 16 '18 at 19:53
Open Office Calc, and the newer Libre Office Calc, have max number of lines (alias rows) = 1,048,576 = 2^20. So it should work for you, and you can do the work interactively in a graphical user interface. This might be a good alternative. — sudodus, Nov 16 '18 at 19:59
I'm not sure what you are suggesting I do with Open Office? Also, I do intend this to be part of a larger pipeline, so that's not an ideal solution — davidrh, Nov 16 '18 at 20:02
I see. Well, many people have experience of doing rather advanced things in spreadsheets, but if you don't know how to do it, and you want it to be part of a larger program system, I guess it would be better to use command line tools or do it in a programming language. Maybe you can keep track of the lines (and bring back the suppressed lines) by adding another column with the line number. — sudodus, Nov 16 '18 at 20:20

fra-san · Answer 1 · 2018-11-24T15:08:20.860

Premise: this is more a proof of concept than a real solution, positioning itself somewhere between inelegant and utterly ugly.

Your requirements are not completely clear to me, so here are my assumptions:

You want to single out all and only the rows that have a unique combination of a subset of their columns (1, 3 and 4). If you are interested in something different - say, in only couples of rows (no more and no less than two) with identical values for two columns and distinct values for the third column - please update your question to make it clear.
You want to print out the whole content of the selected rows (all the six columns).
Fields in your rows are tab-separated - not space-separated. (It would be a bit odd otherwise, since your fifth column seems to include white space characters). This method will have to be adapted to different field separators.

In this code (replace the two occurrences of your_file with the actual file name):

grep -f \
<(sort -k 1b,1b -k 3n,3n -k 4n,4n your_file |
  nl -n rz -w 9 |
  cut -f 1,2,4,5 |
  uniq -f 1 -c |
  grep '^[[:space:]]*1' |
  sed 's/\(^[[:space:]]*1[[:space:]]*\)\(.*\)/^\2/' |
  cut -f 1) \
<(sort -k 1b,1b -k 3n,3n -k 4n,4n your_file |
  nl -n rz -w 9) |
  sed 's/\(^[0-9]\+[[:space:]]*\)\(.*\)/\2/'

Your file is sorted by columns 1, 3 and 4 (probably not necessary since you said in your question that you already did it).
The sorted rows are numbered: leading numbers of fixed 9-characters length, zero-padded.
cut extracts only the tree fields you are interested in comparing, plus the first one (the row number).
uniq counts duplicates for each resulting line, ignoring the first field (the row number), and adds the count to the beginning of each line.
grep selects only the lines whose count equals 1.
sed removes the count from the beginning of each resulting line.
cut extracts only the first field, i.e. the row number (this way is easier than avoiding the previous sed and trying cut -f 2).
grep uses the resulting set of line numbers to filter the initial set of numbered rows.
The final sed removes the leading line number from the filtered row set.

Using the 14 rows in your question as input, it gives:

NZ_CP020102     B4U62_RS00155   30317   31869   16S ribosomal RNA       NCIB3610a
NZ_CP020102     TESTGENOMECL_31 30317   31870   16S ribosomal RNA       TESTGENOME

Notes:
You might want to use export LC_ALL=C before running this code to avoid issues when sorting some characters.
Tested on Linux with bash and GNU tools.

I'll see if I can get this to work, thanks!
(I must say I'm surprised that it isn't a function of diff to only look at certain fields but to output the entire line. You'd think you could ask it to "diff on columns 2 and 5") — davidrh, Nov 19 '18 at 16:36
Such a function might exist among the common *nix tools, I'm just not aware of it. Please note that, while you are asking for a diff-like solution, you provided a single sample file and your title spells "Find unique lines". Other users here could possibly need a different question to provide a diff-like answer. — fra-san, Nov 19 '18 at 17:38
True, but in that case, I'd just create two files using grep "TESTGENOME" and grep "NCIB3610a" - they just currently happen to be one file, but it's essential two files combined into one — davidrh, Nov 19 '18 at 18:36

Find unique lines, ignoring certain columns

1 Answers1