0

I have two files as below. I am trying to figure out a way to create a master file containing all entries from file 1 and second column of file 2 by adding 0 wherever entry is missing in file 2. All these files are tab seperated. I tried join command, but could not get it to work.

  • Example of File1
    orange
    banana
    berry
    cherry
    strawberry
    
  • Example of File2:
    orange   1   
    banana   2   
    cherry   1   
    
  • Desired output
    Output    Value
    orange     1
    banana     2
    berry      0
    cherry     1
    strawberry 0
    

What I tried:

join File1 File2 |less
  • Welcome to the site. Please edit your post to include some more details: (1) How are the columns delimited (space/tab/...)? (2) Do you want to have the "Header" line in the output? (3) Which order should the lines in the output be? That of File1 or that of File2 (if they can be different)? – AdminBee Dec 03 '21 at 15:18
  • Yes, I added the information you suggested. Thanks – biostat newbee Dec 03 '21 at 15:27

4 Answers4

3
$ join -a 1 -e 0 -o 0,2.2 <(sort File1) <(sort File2)
banana 2
berry 0
cherry 1
orange 1
strawberry 0

This uses join to perform a relational JOIN operation between the files. This requires both files to be sorted, which is why we sort them in a process substitution each (you could obviously pre-sort the data if you wish). The command will list all lines from the first input file (-a 1) and replace missing fields with 0 (-e 0). The fields in the output will be the join field (the first field in each file by default, and written 0 in the argument of the -o option) and the second field from the second file (2.2).

Pro: Fast (especially if the data is already sorted) and memory-efficient.
Con: Re-orders the data.


To preserve the order of the original File1, you may use awk instead:

$ awk 'NR == FNR { key[$1] = $2; next } { $2 = ($1 in key) ? key[$1] : 0 }; 1' File2 File1
orange 1
banana 2
berry 0
cherry 1
strawberry 0

This reads the 1st column of File2 as keys in the key associative array, and the 2nd column as their associated values.

When File1 is being read (NR is no longer equal to FNR), we set the 2nd column to either the value from the key array, if there is a key corresponding to the 1st column, or to 0 if there is no such key.

You may shorten the code somewhat by abusing the fact that a uninitialized value is zero in arithmetic contexts:

$ awk 'NR == FNR { key[$1] = $2; next } { $2 = 0+key[$1] }; 1' File2 File1
orange 1
banana 2
berry 0
cherry 1
strawberry 0

Pro: Output is ordered according to File1.
Con: Data from File2 is stored in memory (only really matters if reading huge number of lines).

Kusalananda
  • 333,661
  • This is a very good example of how to use join command. Thanks THEY for this clever answer –  Dec 04 '21 at 11:06
0

I don't know any command that can do that, but it can be scripted:

while IFS='' read -r l1; do
  grep "^${l1}" File2 || echo -e "${l1}\t0"
done < <(cat File1)
Syco
  • 196
0

It is possible using awk, but it requires buffering File2 (so if you have huge files you may hit a limit, although usually that is unlikely).

awk 'BEGIN{FS=OFS="\t"}
     !mainfile{val[$1]=$2;next}
     {if ($1 in val) {$2=val[$1]} else {$2=0}} 1' File2 mainfile=1 File1

It works as follows:

  • Before processing anything, we set the input and output field separator to TAB.
  • While processing the first input file (File2 in this case), indicate by the variable mainfile being uninitialized, we simply record the "value" for each of the "fruits" in an associative array val. After that, we immediately skip processing to the next input line (and skip the part that should only apply while processing File1).
  • Before processing the next file, awk will first evaluate the mainfile=1 statement to set mainfile to, well, 1.
  • Once mainfile is set, the first rule is ignored, and we only process the second rule. Here, we check if there is a "value" mapped to the entry in column 1. If so, we use that value for column 2, otherwise we set column 2 to 0.
  • The seemingly stray 1 outside of the rule blocks instructs awk to print the current line, including any modifications made.
AdminBee
  • 22,803
0

Sounds like something you can hack together very quickly in python

#!/usr/bin/env python3
"""
Overly grande solution to the problem posed in the question.
"""
import sys

Check whether the user specified two arguments to the script

if not len(sys.argv) == 3: sys.stderr.write(f"Usage: {sys.argv[0]} File1 File2") sys.exit(-1)

Empty dictionary

dictionary = {}

Fill the dictionary with zero-entries from the first file

with open(sys.argv[1]) as file: for line in file: dictionary[line.strip()] = 0

Replace the zero entries when found in second file

with open(sys.argv[2]) as file: for line in file: entry, value = line.split() dictionary[entry] = int(value)

print the result table

print("Output\tValue") for key, value in dictionary.items():

We're using the format string syntax to give the count

field a constant length of 8, so that things are nicely

right-aligned.

print(f"{key}\t{value:8}")

  • I tried this method and its giving me duplicate row entries Output Value orange 0 banana 0 berry 0 cherry 0 strawberry 0 orange 1 banana 2 cherry 1 – biostat newbee Dec 03 '21 at 17:46
  • aaaah yes, my bad, the lines of the first file are read such that every entry contains the "newline" character at the end of the line. Fixing that by using strip... (there should be a line break after every 0-count entry, is that right?) – Marcus Müller Dec 03 '21 at 18:05
  • Thank you. that worked perfectly – biostat newbee Dec 06 '21 at 21:31
  • you're welcome! If an answer helps you, you should generally feel encouraged to upvote it! – Marcus Müller Dec 06 '21 at 23:22