Replacing multiple matched strings using columns in guide file

Question

I have 2 files (FileA & FileB),

FileA:

s12 >g01
s16 >g02
s48 >g03
s52 >g04
s80 >g05
s81 >g06
s87 >g07
s91 >g08
s92 >g09
s93 >g10
s94 >g11
s96 >g12
s97 >g13
s98 >g14
s99 >g15
s100 >g16

FileB:

s12:1148.1652412 [PCC6803]
ABCDEFGHIJKLMNOPQRST
s16:1235.1653193 [PCC6803]
UVWXYZABCDEFGHIJKLMN
s48:5877.1652308 [PCC6803]
OPQRSTUVWXYZABCDEFGH
.
.
.

I wanted to edit FileB so that all "column 1 strings from FileA" that exist in FileB will be changed to "column 2 strings from FileA"

Desired output:

>g01 [PCC6803]
ABCDEFGHIJKLMNOPQRST
>g02 [PCC6803]
UVWXYZABCDEFGHIJKLMN
>g03 [PCC6803]
OPQRSTUVWXYZABCDEFGH

I will need to process the editing for around 20 files that are in the same format as FileB.

Is there any command that can do this kind of editing? And doing it simultaneously or using one line command in Linux terminal? Thanks in advance!

Update: I have tried the example from Replace multiple strings with different set of mapped strings but it doesn't work.

replacements=(
        s12:\>g01
        s16:\>g02
        s48:\>g03
        s52:\>g04
        s80:\>g05
        s81:\>g06
        s87:\>g07
        s91:\>g08
        s92:\>g09
        s93:\>g10
        s94:\>g11
        s96:\>g12
        s97:\>g13
        s98:\>g14
        s99:\>g15
        s100:\>g16
)

for row in "${replacement[@]}"; do
        original="$(echo $row | cut -d: -f1)";
        new="$(echo $row | cut -d: -f2)";
        sed -i -e "s/${original}/${new}/g" FileB;
done

Will the strings from fileA always be at the beginning of the line and followed by a : (e.g. s12:) in fileB? — terdon, Jun 10 '19 at 08:08
@terdon yup and the lines including s* in fileB are following the sequence like in fileA as well. — web, Jun 10 '19 at 08:16
By the way, seeing your input data, you might be interested in our sister site, [bioinformatics.se]. — terdon, Jun 10 '19 at 08:33

score 3 · Answer 1 · answered Jun 10 '19 at 08:25

$ awk 'FNR==NR { id[$1]=$2; next } { split($1,a,":"); if (a[1] in id) $1=id[a[1]]; print }' fileA fileB
>g01 [PCC6803]
ABCDEFGHIJKLMNOPQRST
>g02 [PCC6803]
UVWXYZABCDEFGHIJKLMN
>g03 [PCC6803]
OPQRSTUVWXYZABCDEFGH

The first block will only be triggered while reading from the first file (fileA). It reads the mappings for the s* strings to the >g* strings into the associative array id with the s* strings as keys.

The second block will only be triggered while reading from the second file (fileB). It will split the first field of each line on : into a temporary array a. If the first element of the split result is a key in the id array, the whole first field is replaced by the value for that key. The possibly modified line is then printed.

FNR is the line number (really the record number, but records are lines by default) of the current file, while NR is the overall line number. If FNR==NR we are therefore reading from the first file.

Rakesh Sharma · Answer 2 · 2019-06-11T08:22:11.577

One method is to utilize sed to form the s/// commands from the contents of fileA to be operated upon the contents of fileB.

$ sed -Ee 's/(.*) (>.*)/s|^\1:\\S+|\2|;t/' fileA | sed -Ef - fileB

Output:

>g01 [PCC6803]
ABCDEFGHIJKLMNOPQRST
>g02 [PCC6803]
UVWXYZABCDEFGHIJKLMN
>g03 [PCC6803]
OPQRSTUVWXYZABCDEFGH

Explanation:

Let's look at the problem from the reverse end, viz., changing the fileB. Now how would your sed command look like to edit the first line of fileB ?

something along these lines: s/^s12:\S+/>g01/ and then you're done with this line. so you tag an empty t line to tell sed that for this line no more edits are required.
similarly for the remaining lines too.
so with that I now have to build up the sed commands looking at fileA, where you've specified all possible mappings of search n replace to be performed.
the task that is needed is to somehow transform the fileA into valid sed s/// commands, such that when they are applied upon fileB , we should be getting the desired results.
this task is performed by the first sed command: s/(.*) (>.*)/s|^\1:\\S+|\2|;t/
The first portion: s/(.*) (>.*)/ is the lhs of the sed substitute command and is a regex wherein we grab and store the two fields in any given line of fileA, e.g., s12 >g01 So \1 shall store s12 and \2 shall store >g01. Of course, the unsaid assumption here is that the lines are containing exactly 2 fields with one space, no leading spaces, and 2nd field begins with a greater-than symbol >.
So the line of fileA s12 >g01 gets transformed to s|^s12:\S+|>g01|;t based on the rhs of the sed command. this transformed line is then applied to the fileB and we get our results.
For an easy understanding, comment the pipe and look at what the first sed command generates and it will start to be clear. HTH.

Hi, do you mind to explain the second column in between //? (i.e: /s|^\1:\S+|\2|;t/) — web, Jun 11 '19 at 04:42

score 2 · Answer 3 · answered Jun 10 '19 at 08:29

Your sed command is almost right. You have defined an array called replacements, but in your for loop, you use replacement. That's why it isn't working. Also, you want to replace the entire line until the first space, so not only s/$original/$new/. This one should do what you want:

replacements=(
        s12:\>g01
        s16:\>g02
        s48:\>g03
        s52:\>g04
        s80:\>g05
        s81:\>g06
        s87:\>g07
        s91:\>g08
        s92:\>g09
        s93:\>g10
        s94:\>g11
        s96:\>g12
        s97:\>g13
        s98:\>g14
        s99:\>g15
        s100:\>g16
)

for row in "${replacements[@]}"; do
        original="$(echo $row | cut -d: -f1)";
        new="$(echo $row | cut -d: -f2)";
        sed -i -e "s/^${original}:[^ ]*/${new}/g" FileB;
done

Now this isn't a very efficient way of doing this since you need to process the entire fileB for each replacement. A faster way could be:

$ awk 'NR==FNR{a[$1]=$2; next}{split($1, b, /:/); if(b[1] in a){$1=a[b[1]]}}1;' FileA FileB
>g01 [PCC6803]
ABCDEFGHIJKLMNOPQRST
>g02 [PCC6803]
UVWXYZABCDEFGHIJKLMN
>g03 [PCC6803]
OPQRSTUVWXYZABCDEFGH

And to make the change for multiple file names:

awk 'NR==FNR{
        a[$1]=$2; 
        next
     }
     {
        split($1, b, /:/); 
        if(b[1] in a){
            $1=a[b[1]]
        }; 
        print > FILENAME".fixed"
    }' FileA FileB FileC FileD ... FileN

That will create a fileB.fixed, fileC.fixed, fileD.fixed etc. until FileN.fixed. If you're satisfied it worked, you can then rename these back to the original file name (assuming you have perl-rename, which is the default on Ubuntu and Debian):

rename 's/fixed//' *fixed

Or, if you don't have perl-rename:

for f in *fixed; do mv -- "$f" "${f%%.fixed}"; done

seshoumara · Answer 4 · 2019-06-10T15:00:38.843

0

It can be done using only one GNU sed call. Instead of FileB, you can give as many files you have in the format of FileB, but FileA must be given first. To be safe, the command will make a backup of the input files. If you're happy with the modified files, you can delete the backup ones after that.

sed -ri.bk '1{x;s:^:cat /dev/fd/3:e;x};/:/{G;s/^([^:]+)\S+(\s+)([^\n]+).*\1\s+(>[^\n]+).*/\4\2\3/}' 3< FileA FileB

Thanks to @Stéphane Chazelas for giving me the idea to use a custom file descriptor, to get around the problem that the hold space is discarded on each new file when using -i.

edited Jun 10 '19 at 15:00

answered Jun 10 '19 at 12:17

seshoumara

862
5
7

-i implies -s which means the hold space doesn't survive the change of file. – Stéphane Chazelas Jun 10 '19 at 13:05
Alternatively you could slurp in the content of FileB upon processing the first line of each other file (something like h;z;s/^/cat FileA/e;x). Here it's more a job for awk or perl though. – Stéphane Chazelas Jun 10 '19 at 13:10
@StéphaneChazelas Thank you for the help. I knew -i implies -s, but I could swear nowhere in the online manual did it say that the hold space is discarded. Makes sence though, but it eliminates a good trick. If one really wanted to have a clean hold space on each file, 1{x;z;x} would have done it. Too bad! Now I can do like you said, but I hate to have a parameter hard coded in the script, or I could write the hold space to a temporary file and read it back, but I might get into access permissions as the files might not be in CWD. I'll delete my answer in 30 min. – seshoumara Jun 10 '19 at 14:18
If by parameter you mean the name of the file, you can always use cat < "$FILE" and pass the name of the file via the environment ($FILE), or use cat /dev/fd/3 and 3< FileA (though that'd only work on Linux if there's more than one file to process). – Stéphane Chazelas Jun 10 '19 at 14:32
@StéphaneChazelas I updated my answer based on your last comment, using a FD. I won't delete my submission anymore, since it still looks similar to how I intended it in the first place and now it works. Learned a lot from you today. – seshoumara Jun 10 '19 at 15:06
The FileA must be given first would need to be adapted. It's not that it has to be given first now, it's just that it is the target of the redirection. If more than one file is given, then it will only work on Linux and if FileA is a regular file. On other systems opening /dev/fd/3 works like a dup(3), so after FileA has been read fully, the second open(/dev/fd/3) will just get you another dup of fd 3 which is now at the end of the file. – Stéphane Chazelas Jun 10 '19 at 18:42
@StéphaneChazelas I still kept FileA first, so that the user won't worry about where to put 3<. From his perspective that piece is part of the command. Also, I don't understand the rest of your comment, there is only one file redirected, FileA, not more. But good to know otherwise. – seshoumara Jun 10 '19 at 19:39

Replacing multiple matched strings using columns in guide file

4 Answers4