3

I'm trying to do simple grep and grep -v so I will get the lines from a.txt that exists in b.txt and not in c.txt.

Example of 3 files

a.txt:

a
b
c
d
e

up.txt:

a.up
b.up
c.up

dw.txt:

a.dw
b.dw

Desired output:

c

I wrote the below code but the grep looks on the $(sed...) as one single line at a time and not as a whole:

sed 's/.up//' /tmp/b.txt | grep -f /tmp/a.txt | grep -vf $(sed 's/.dw//' /tmp/c.txt)
slm
  • 369,824
Nir
  • 1,335
  • 1
    have you tried diff? (kdiff3 is a visual tool, not for streaming/scripts, there are other diff tools, including diff). a 3way diff tool will do best. – ctrl-alt-delor Jul 08 '18 at 15:52

8 Answers8

4

With single fast GNU awk command:

awk -F'.' \
'{
     if (ARGIND == 1) a[$1];
     else if (ARGIND == 2 && $1 in a) comm[$1];
     else if (ARGIND == 3){
         delete a;
         if ($1 in comm) delete comm[$1]
     }
 }
 END{ for (i in comm) print i }' a.txt b.txt c.txt

The output:

c

  • -F'.' - treat . as field separator
  • ARGIND - The index in ARGV(array of command line arguments) of the current file being processed
  • comm - array of common items between first 2 files (a.txt and b.txt)
  • good one, is ARGIND common for all awk or specific to gawk, etc? you could also reverse the order of files to avoid deleting from array and having to use END... if (ARGIND == 1) c[$1]; else if (ARGIND == 2 && !($1 in c)) b[$1]; else if (ARGIND == 3 && $1 in b){ print } – Sundeep Jul 08 '18 at 14:16
  • @Sundeep, it's GNU extension – RomanPerekhrest Jul 08 '18 at 15:23
3

Assuming the files are all sorted and that we're using a shell that understands process substitutions (like bash):

$ join -t . -v 1 -o 0 <( join -t . a.txt b.txt ) c.txt
c

or, for other shells,

$ join -t . a.txt b.txt | join -t . -v 1 -o 0 - c.txt
c

This uses join twice to perform relational joins between the files. The data is interpreted as dot-delimited fields (with -t .).

The join between a.txt and b.txt is straight forward and produces

a.up
b.up
c.up

These are all the lines from the two files whose first dot-delimited field occurs in both files. The output consists of the join field (a, b, c) followed by the other fields from both files (only b.txt has any further data).

The second join is a bit more special. With -v 1 we ask to see the entries in the first file (the intermediate result above) that can't be paired with any line in the second file, c.txt. Additionally, we only ask to see the join field itself (-o 0). Without the -o flag, we would get c.up as the result.


If the files are not sorted, then each occurance of a filename file could be replaced by <( sort file ) in the command.

Kusalananda
  • 333,661
2

comm

Assuming the files are sorted, and duplicate lines removed:

comm -12 a.txt <(cut -d. -f1 b.txt) | comm -23 - <(cut -d. -f1 c.txt)

This is written for Ubuntu, using Bash and GNU utilities, but hopefully it works for other OS's.

Explanation

  • comm -12 Print the lines that both files share (read man comm for details)
  • <(...) Process substitution - Use a command in place of an input file
  • cut -d. -f1 For each line, remove everything after the first dot
  • comm -23 Print the lines that are unique to the first file
  • - Read from stdin instead of a file
wjandrea
  • 658
2

If the files given are sorted and there are no internal duplicates, use this:

$ comm -12 a.txt <(sed 's/\.[^.]*$//' up.txt) | comm -23 - <(sed 's/\.[^.]*$//' dw.txt)

In shells that have Process Substitution (<(…)). For other shells read below.


What you describe in this sentence:

get the lines from a.txt that exists in b.txt and not in c.txt

could be reduced to the set operations:

( a intersect b ) complement c

There are several ways to execute set operations on files, many are listed on this answer

I like the way the command comm could perform most operations.
But the files you present are not the clean set to use. The extensions need to be erased/removed. The generic way to remove the extensions with sed is:

$ sed 's/\.[^.]*$//' file

So, the two clean files will be created with:

$ sed 's/\.[^.]*$//' up.txt > up.txt.clean
$ sed 's/\.[^.]*$//' dw.txt > dw.txt.clean

With those two files, a one-liner solution is:

$ comm -12 a.txt up.txt.clean | comm -23 - dw.txt.clean
c

Or, doing ( up.txt complement dw.txt) intersect a.txt:

$ comm -23 up.txt.clean dw.txt.clean | comm -12 - a.txt
c

Both commands could be implemented directly from original files in some shells with:

$ comm -12 a.txt <(sed 's/\.[^.]*$//' up.txt) | comm -23 - <(sed 's/\.[^.]*$//' dw.txt)

If the process substitution is not available it is possible to use only one file as follows:

$ sed 's/\.[^.]*$//' up.txt | comm -12 a.txt - >result1.txt
$ sed 's/\.[^.]*$//' dw.txt | comm -23 result1.txt -
c
$ rm result1.txt
0
$ grep -f a.txt <(cut -d '.' -f 1 up.txt) > common.txt
$ grep -vf <(cut -d '.' -f 1 dw.txt) common.txt

Compares first word between 2 files and writes to matching word to common.txt. Compares dw.txt with common.txt and print inverted match ie. 'c'.

slm
  • 369,824
Nisha
  • 37
0
perl -F\\. -lane '
   $h{@ARGV}{$F[0]}++,next if @ARGV;
   print if exists $h{2}{$_} && !exists $h{1}{$_};
' up.txt dw.txt a.txt

Build up the hash %h with top level keys as "2" and "1", with 2 referring to the first argument (up.txt), 1 referring to dw.txt. For the data given, the hash structure would look something like:(the order could differ)

%h = (
   1 => { a => 1, b => 1, },
   2 => { a => 1, b => 1, c => 1, },
);

as can be seen there are two mini-hashes inside the main hash %h. So when the time comes to read in the third argument (a.txt), we make the decision to print the record based on whether that record can be seen (as a key) in the mini-hash %2 AND not seen in the mini-hash %1, inside the main hash %h (also referred to as a hash-of-hashes, or HoH).

0

A variation of Roman's answer and to make it simpler:

gawk -F. 'ARGIND==1{ seen[$1]; next } 
         ARGIND==2{ delete seen[$1]; next }
         ($1 in seen)
' fileUP fileDW fileA
  • This ARGIND==1{ seen[$1]; next } holding the first column of fileUP into an associated array named seen.
  • This ARGIND==2{ delete seen[$1]; next } deletes those exist in fileDW.
  • and this ($1 in seen) prints remained ones when it's also exist in fileA
αғsнιη
  • 41,407
-1

Here's another alternative using that's similar to yours using: grep, sort & uniq, and sed.

$ sed 's/\.\(dw\|up\)//' up.txt dw.txt | grep -xFf a.txt | sort | uniq -u
c

This works by producing a list of matches for each of the files up.txt and dw.txt using a.txt as a input file to grep. This produces output like this:

$ sed 's/\.\(dw\|up\)//' up.txt dw.txt | grep -xFf a.txt
a
b
c
a
b

The key details here are that we're:

  • using sed to chop off any trailing extensions from the 2 files up.txt and dw.txt
  • With extensions deleted, we then use grep to filter any corresponding matches from a.txt
  • The matching that we tell grep to perform is exact, -x
  • The -F tells grep to treat the patterns in a.txt as fixed strings

With the above output in hand, you can simply run this through sort and then use uniq to get only the the lines that do not repeat.

References

slm
  • 369,824