2

I have large list of data like so

FILE1 (4600 rows)

Genome  Gene    Boolean
E15-12  VFG000923   1
E15-13  VFG000924   1
E15-14  VFG000926   1
E15-15  VFG000928   1
E15-16  VFG000930   1
E15-17  VFG000932   1
E15-18  VFG000933   0
E15-19  VFG001448   0
E15-24  VFG013465   1

I want to sub the info in col2 to look like:

FILE2 (180rows)

VFG000923|fepA
VFG000924|fepB
VFG000926|fepD
VFG000928|fepG
VFG000930|entF
VFG000932|entE
VFG000933|entB
VFG001448|kpsD
VFG001450|kpsM
VFG044165|entS

Out

Genome   Gene         Boolean
E15-12  VFG000923|fepA  1
E15-13  VFG000924|fepB  1
E15-14  VFG000926|fepD  1
E15-15  VFG000928|fepG  1
E15-16  VFG000930|entF  1
E15-17  VFG000932|entE  1
E15-18  VFG000933|entB  0
E15-19  VFG001448|kpsD  0
E15-20  VFG001450|kpsM  1

Using code by @val0x00ff (see comments)

Genome  Gene    Boolean

E15-14 VFG000923|fepA 1

E15-14 VFG000924|fepB 0

E15-14 VFG000926|fepD 1

E15-14 VFG000928|fepG 0

Is there a way to do this with sed or awk?

AudileF
  • 185
  • 3
  • 11
  • 1
    Is file2 already created or you want to create it? Assuming later, if col1 of file2 coming from col2 of file1, where is col2 of file2 coming from? Why number of rows less in file2, – Utsav May 03 '17 at 11:11
  • file2 already exist. file1 contains several genomes with different matches for the contents of file2 – AudileF May 03 '17 at 11:29
  • Then what is your expected output based on file1 and file2 content you gave? – Utsav May 03 '17 at 11:31
  • Yes, please [edit] your question and show us your desired output. You just say "I want to sub the info in col2 to look like:" and while I guess "sub" means "substitute", I have no way of guessing what the info is supposed to look like. Do you mean the 2nd column of file1 should be the line from file2 matching the gene name? – terdon May 03 '17 at 11:46
  • 1
    Also, when giving example files, please make sure the examples actually make sense. None of the gene names you show in file1 are present in file2! – terdon May 03 '17 at 11:55
  • 1
    it's a hacky way.. but too bothered to break my head with awk: paste FILE1 <(sed '1i\\' FILE2) |awk 'NR>1{$2=$NF;$NF=""}1' – Valentin Bajrami May 03 '17 at 14:07
  • @val0x00ff This works but in inserts a newline between the rows. Is there a way to stop that? – AudileF May 03 '17 at 14:35
  • @AudileF can you show what it prints? It doesn't print newlines on my end. – Valentin Bajrami May 03 '17 at 14:48

3 Answers3

2

With sed:

sed '/|/{H;d;};G;s/\([A-Z0-9]*\)\(.*\n\)\1\(|[^[:cntrl:]]*\)/\1\3\2\1\3/;P;d' FILE2 FILE1

should do the trick. It's an adaption of this answer. Detailled explanation is there.

Philippos
  • 13,453
1

This should work:

$ awk 'NR==FNR{k=sub(/\|.*/,$1); a[k]=$1; next} ($2 in a){$2=a[$2]}1' file2 file
Genome Gene Boolean
E15_14 VFG000923|fepA 1
E15_14 VFG000924|fepB 1
E15_14 VFG000926|fepD 0
E15_14 VFG000928|fepG 1
E15_14 VFG000930|entF 0
E15_14 VFG000932|entE 0
E15_14 VFG000933|entB 1
E15_14 VFG001448|kpsD 1
E15_14 VFG001450|kpsM 1
E15_14 VFG044165|entS 0

Or, a little easier to read:

awk 'NR==FNR{
        k=sub(/\|.*/,$1); 
        a[k]=$1; 
        next
    } 
    ($2 in a){
        $2=a[$2]
    }1' file2 file

Explanation

  • NR==FNR{ } : NR is the current input line number and FNR is the line number of the current file. When reading more than one file, the two will be equal only while reading the first file.
  • k=sub(/\|.*/,$1); : remove the part after the | from the line (this only happens for the 1st file because of the NR==FNR as explained above).
  • a[k]=$1; : save the first field of the first file as a value in the array a whose key is the gene name (the 1st field with everything after the | removed).
  • next : skip to the next line. This ensures we don't execute the next block while reading the first file.
  • ($2 in a) : if the second field exists as a key in the array a (this will only be run for the second file).
  • $2=a[$2] : set the second field to whatever was stored in a for $2.
  • 1 : this is awk shorthand for "print the current line". it works because the default action when something evaluates to true in awk is to print the current line. Since 1 is always true, that will print.
terdon
  • 242,166
  • Thanks for the help, but I'm afraid it didn't work for me. – AudileF May 03 '17 at 12:50
  • 2
    @AudileF please don't just say "it didn't work", that is completely meaningless. You need to tell me how it failed. Did it print nothing? The wrong thing? Did it give an error? Please edit your question and show how it failed. – terdon May 03 '17 at 12:53
  • No error messages. It just essentially copied what was in FILE1. – AudileF May 03 '17 at 13:34
  • @AudileF then your files aren't exactly as you've shown us. I suggest you come into /dev/chat and ping me (@terdon) and we can debug it if you like. – terdon May 03 '17 at 14:13
  • @terdon Perhaps CRLF line endings? This would also be consistent with the comments to the question. – Satō Katsura May 03 '17 at 14:51
  • @SatoKatsura hmm, I don't think so. I just tried by making all \n into \r\n in the example files and it worked as expected. Probably some other strangeness. – terdon May 03 '17 at 14:59
  • @AudileF Are you sure you gave FILE2 as the 1st argument and FILE1 as the 2nd argument? That's the key here. –  May 04 '17 at 07:07
  • @terdon I would really want OPs on S.E. to also provide a real-sized data input on pastebin site or some other which essentially serves the role of a scratchpad /tmp web site –  May 04 '17 at 07:09
  • @RakeshSharma Yes – AudileF May 04 '17 at 08:45
  • @RakeshSharma it is usually better to have the data right here instead of in an external site. – terdon May 04 '17 at 08:51
  • @AudileF there is something strange about your input file(s) please either post an exact extract of the file in your question or on pastebin.com as suuggested. Or just ping me in /dev/chat and we can try to sort it out. – terdon May 04 '17 at 08:52
1
perl -lne '
   @ARGV and %h=(%h, /(.*)\|/ => $_),next;
   !@ARGV and !$a++ and print,next;
   print s//$h{$1}/r if exists $h{(/\h\K(\S+)(?=\h)/)[0]};
' FILE2 FILE1

Explanation

  • When reading in the smaller file (FILE2) we populate the hash %h which is keyed on data before | and value is the whole line.
  • When reading in the larger file (FILE1) we first print it's first line using: !@ARGV and !$a++ meaning @ARGV has been emptied out and we are seeing the variable $a for the first time.
  • On the remaining lines we see whether the 2nd field which is identified via the regex /\h\S+\h/ to see whether this key exists in the hash %h. When this is found to be true, we replace that field with the value corresponding to that key.