2

I need to run search/replace function on a large ascii text file. A short excerpt from input file:

gene_id "MSTRG.1";
gene_id "MSTRG.1";
gene_id "MSTRG.2";
gene_id "MSTRG.3";

The MSTRG string will be replaced with another ID which is present in a template file:

MSTRG.1 AT1G01030
MSTRG.2 AT1G01010
MSTRG.3 AT1G01035

A simple while loop is iterating over each line of the template and making replacements:

while read bef aft
do
  echo "Searching for $bef"
  echo "Replacing with $aft"
  sed "s/$bef/$aft/g" input > output
done < template

What is happening is that the MSTRG.2 and onward entries are being replaced correctly, but MSTRG.1 remains unchanged. The output looks as follows:

gene_id "MSTRG.1";
gene_id "MSTRG.1";
gene_id "AT1G01010";
gene_id "AT1G01035"; 

Update

Here is what I did in the end.

while read bef aft
do
  sed -i "s/$bef/$aft/g" input
done < template
cryptic0
  • 191
  • I'm slightly surprised that MSTRG.2 is being replaced in the final output file. You are clobbering the file in each iteration, so I'd expect that only the last replacement would be visible. Just do the redirection to output after the loop instead of in each iteration (just like you're already redirecting the loop's input from template). I'm not posting this as an answer since I'm not sure why you say (or how it's possible) that MSTRG.2 is being replaced. – Kusalananda Oct 15 '19 at 16:04
  • Not sure I understand. How do I redirect the output after the loop is finished? – cryptic0 Oct 15 '19 at 16:11
  • while ...; do ...; done <template >output. And redirect the two echo calls to standard error to avoid getting them in the output: echo ... >&2. – Kusalananda Oct 15 '19 at 16:26
  • You don‘t need the read loop at all, but do it in one pass by adapting the scheme given in the linked answer. – Philippos Oct 15 '19 at 19:46
  • The answer you linked to is far more complicated than a simple while loop. – cryptic0 Oct 15 '19 at 19:47

3 Answers3

3

Your issue is that you are clobbering the output file in each iteration of the loop, leaving only the most recent changes in output and none of the earlier ones.

Instead, you can easily transform your template file into a series of sed commands:

$ awk '{ printf("s/%s/%s/g\n", $1, $2) }' template
s/MSTRG.1/AT1G01030/g
s/MSTRG.2/AT1G01010/g
s/MSTRG.3/AT1G01035/g

... and then apply these to your file:

$ awk '{ printf("s/%s/%s/g\n", $1, $2) }' template | sed -f - input
gene_id "AT1G01030";
gene_id "AT1G01030";
gene_id "AT1G01010";
gene_id "AT1G01035";

Some implementations of sed don't recognise - as meaning standard input. To use this approach with such a sed, replace -f - with -f /dev/stdin.

Or, you could just do it all in awk:

$ awk 'FNR == NR { pat[$1] = $2; next } { for (p in pat) gsub(p, pat[p]); print }' template input
gene_id "AT1G01030";
gene_id "AT1G01030";
gene_id "AT1G01010";
gene_id "AT1G01035";

Note that all of the above variations uses what is in the first column of template as a regular expression, meaning that the . (dot) will match any character.

Kusalananda
  • 333,661
1

Instead of overwriting the output file in each loop iteration, you could copy the input file as output file and operate on this output file.

With sed's -i option changes are written in-place to the same file, so previous replacements don't get lost:

cp input output
while read bef aft
do
  echo "Searching for $bef"
  echo "Replacing with $aft"
  sed -i "s/$bef/$aft/g" output
done < template
Freddy
  • 25,565
1
#!/usr/bin/perl -i

use strict;

# The %re hash holds the regexp searches and replacement strings.
my %re = ();

my $tfile = shift;
open(TEMPLATE, "<", $tfile) || die "couldn't open $tfile for read: $!\n";
while(<TEMPLATE>) {
   chomp;
   my ($search,$replace) = split;
   $re{qr/$search/} = $replace;
};
close(TEMPLATE);

while (<>) {
  foreach my $s (keys %re) {
    s/$s/$re{$s}/g;
  };
  print;
}

This reads the template file and builds up an associative array (aka "hash") called %re of regular expression searches and replaces.

Then it loops over every remaining filename on the command line (e.g. input) and performs all of those search and replace operations on every line of input. It uses qr// to pre-compile the regular expressions - this is only a trivial optimisation if there aren't many lines in template, but can result in a very significant speedup if there are many lines.

The -i on the #!/usr/bin/perl -i line causes perl to make in-place edits on the input files, rather than just print the changes to stdout. Change this to, e.g., -i.bak if you want it to keep a backup copy of the files before they were changed.

Save as, e.g., cryptic0.pl, make it executable with chmod +x cryptic0.pl and run it like this:

$ ./cryptic0.pl template input

The script will not produce any output on the terminal. Instead, it will edit the input file(s).

For example, your input file will be changed to:

$ cat input
gene_id "AT1G01030";
gene_id "AT1G01030";
gene_id "AT1G01010";
gene_id "AT1G01035";

BTW, this script will change all matches on all lines to their appropriate replacement string. If you are certain that there can only be one match on any given line, you can speed it up by changing this line:

s/$s/$re{$s}/g;

to this:

s/$s/$re{$s}/ && last;

This causes the script to skip out of the foreach loop to the print statement, and then move on to the next input line as soon as it has had one successful search & replace.


BTW, see Why is using a shell loop to process text considered bad practice? for why it's not a good idea to do text processing in with sh loops. Use awk or perl or sed or anything else instead of sh or bash.

cas
  • 78,579