Merging two Unix files

Question

I have 2 pipe delimited files say file1 and file2. Where file1 has say 33 columns and file2 may have 34/35/36 columns. So let's not consider how many columns we have.
What I want to do is compare the values in file1 & file2 (from column 1 till column 32). If all values are same then take the values from file2 and append for all same records in file1.
Say for 1st records in file2 have 5 matches in file1 the take value "|84569|21.5|1" and append it to all matches on file1 (see file3 for expected results). Similarly, for 2nd record in file2 we have 5 matches in file1, so take the value "|0" and append it to all matched records in file1. Same goes with 3rd record from file2. There are 3 matches so take value "|21457879|12.4" and append it on all 3 matched rows in file1

If you are thinking how are we selecting from where to take values from file2 for appending in file1 then we should take it from column no 34. Though the start position is fixed but end position is not. Like if you in example "a" the we have taken values from col 34/35/36 but for "b" we have just col 34. however for "c" we have values in col 34/35.

I do not know how to format the data in my examples below. So giving it as it is.

file1

a|a1|a2|a3|a4|...|a32|acb@sma.com
a|a1|a2|a3|a4|...|a32|acd@sm.com$1553:2015-02-14 
a|a1|a2|a3|a4|...|a32|axwer@xi.com30:2015-03-01 
a|a1|a2|a3|a4|...|a32|acbw@ma.com$121:2015-01-31 
a|a1|a2|a3|a4|...|a32|art@ma.com$293:2015-02-28 
b|b1|b2|b3|b4|...|b32|asmi@g.in$542:2013:05:24 
b|b1|b2|b3|b4|...|b32|kasmi@g.in$542:2013:05:24 
b|b1|b2|b3|b4|...|b32|asmi@g.in14:2013:05:24 
b|b1|b2|b3|b4|...|b32|asmi@g.in$542:2013:05:24 
b|b1|b2|b3|b4|...|b32|asmi@g.in232:2014:05:24 
c|c1|c2|c3|c4|...|c32|Asce@ita.in 
c|c1|c2|c3|c4|...|c32|$200:2011:12:06 
c|c1|c2|c3|c4|...|c32|kst@gre.in$214:2001:01:31

file2

a|a1|a2|a3|a4|...|a32|acb@sma.com|84569|21.5|1 
b|b1|b2|b3|b4|...|b32|asmi@g.in$542:2013:05:24|0 
c|c1|c2|c3|c4|...|c32|Asce@ita.in|21457879|12.4

Expected File: File3

a|a1|a2|a3|a4|...|a32|acb@sma.com|84569|21.5|1 
a|a1|a2|a3|a4|...|a32|acd@sm.com$1553:2015-02-14|84569|21.5|1 
a|a1|a2|a3|a4|...|a32|axwer@xi.com30:2015-03-01|84569|21.5|1 
a|a1|a2|a3|a4|...|a32|acbw@ma.com$121:2015-01-31|84569|21.5|1 
a|a1|a2|a3|a4|...|a32|art@ma.com$293:2015-02-28|84569|21.5|1 
b|b1|b2|b3|b4|...|b32|asmi@g.in$542:2013:05:24|0 
b|b1|b2|b3|b4|...|b32|kasmi@g.in$542:2013:05:24|0 
b|b1|b2|b3|b4|...|b32|asmi@g.in14:2013:05:24|0 
b|b1|b2|b3|b4|...|b32|asmi@g.in$542:2013:05:24|0 
b|b1|b2|b3|b4|...|b32|asmi@g.in232:2014:05:24|0 
c|c1|c2|c3|c4|...|c32|Asce@ita.in|21457879|12.4 
c|c1|c2|c3|c4|...|c32|$200:2011:12:06|21457879|12.4 
c|c1|c2|c3|c4|...|c32|kst@gre.in$214:2001:01:31|21457879|12.4

How many records are in file2? If it's not too huge the whole file could be stored as an associative array in awk, with columns1-32 as the keys and subsequent columns as the data. — PM 2Ring, Apr 24 '15 at 10:56
It will help your problem description to know that the name of the database operation that you are trying to perform is a join. It will help potential answerers to know that the SUS defines a join command. But if you find yourself doing database tasks the hard way, like this, then really you would do very well to use a database for database tasks. The join command has its limitations. select * from table1 natural join table2 is a simple start here, although a left outer join would be better I think. — JdeBP, Apr 24 '15 at 12:03
Good call, @JdeBP. It is a job for a database. OTOH, if file2 is relatively small compared to RAM, then a script in awk, Python or Perl may be adequate. IMHO. — PM 2Ring, Apr 24 '15 at 12:25
@JdeBP I need to do it thorugh UNIX, Using DB is not an option for me and believe me, I would have loved to do it in DB. Hence I came here for help as i am relatively unkown to Linux/Unix scripting — mdx, Apr 27 '15 at 06:32
Unix and database are not mutually exclusive. Indeed, there are sections of this very site devoted to [tag:mysql], [tag:postgresql], [tag:mariadb], and (given the likes of http://unix.stackexchange.com/questions/187474/ , I hope) [tag:mongodb]. Those are not all of the database options that one has, moreover. — JdeBP, Apr 27 '15 at 07:23
Based on the examples I would conclude that the expected output for lines in both input files are equal to the contents of the second file. And thus that the expected output is the contents of file 2 with all the missing lines of file 1 added to it. Is that assumption correct? — Bram, Oct 12 '15 at 10:26

glenn jackman · Answer 1 · 2015-04-27T10:15:09.080

1

In this answer, you need to specify how many fields make up the "key". Apparently in your real data it's 32, but in your sample data, the first 7 fields is the key:

awk -F'|'  -v nKeys=7 '
    NR==FNR {
        suff = ""
        for (i=nKeys+2; i<=NF; i++) suff = suff FS $i
        NF = nKeys
        suffixes[$0]=suff
        next
    } 
    {
        printf "%s", $0
        NF = nKeys
        print line suffixes[$0]
    }
' file2 file1

We can calculate the number of key fields:

awk -v nKeys=$(( $(head -1 file1 | tr '|' '\n' | wc -l) - 1 )) ...

But we can probably safely hard-code the number.

edited Apr 27 '15 at 10:15

answered Apr 24 '15 at 13:37

glenn jackman

85,964

@ glenn Thanks for the help. When i tried running tthe above script it fails by saying: Error details- cmd line:10 (FILENAME=file1 FNR=1) fatal:not enough arguments to satisfy format string `%s, $0' ^ ran out for this one – mdx Apr 27 '15 at 06:26
Oops, missing quote, should be printf "%s", $0 – glenn jackman Apr 27 '15 at 10:14
@mdx - if you click on the words "accept the answer" in my comment you should see a self-explanatory pic – don_crissti May 08 '15 at 06:42

Merging two Unix files

1 Answers1