How to find the last occurrence of a string in column 1 and replace the corresponding value in column 3?

Question

I have three columns in a file as such:

apple1        10109283      20012983
apple1        10983102      10293809
apple1        10293893      2349823049
apple10       109283019     109238901
apple10       192879234     234082034
apple10       234908443     3450983490

I would like to find the last occurrence of the string in column 1 (in this case line 3 or 6) and replace the corresponding number in column 3 with a different number. Example (replace line 3 column 3 with 444444444"

apple1        10109283      20012983
apple1        10983102      10293809
apple1        10293893      444444444
apple10       109283019     109238901
apple10       192879234     234082034
apple10       234908443     3450983490

So far I tried using sed but it didn't work:

sed '$s/apple1*$/444444444/'

Kusalananda · Answer 1 · 2019-07-31T22:46:17.450

tac file |
awk -v string='apple1' -v replace='444444444' '
    !flag && $1 == string { $3 = replace; flag = 1 }
                          { print }' |
tac

This pipeline first reverses the ordering of the lines in the data using tac from GNU coreutils. The last line is where the 1st column is a particular string is easier to find that way.

The awk command simply compares the first column to the given string, and if we haven't yet made a replacement (!flag is non-zero), we modify the third column as soon as we find the string in the 1st column. When doing so, we also set flag to one so that no further replacements are made.

The rest of the awk program simply prints the current line (including the modified one).

At the end of the pipeline, we reverse the order of the lines again with tac.

The output of this, given the data in the question, is

apple1        10109283      20012983
apple1        10983102      10293809
apple1 10293893 444444444
apple10       109283019     109238901
apple10       192879234     234082034
apple10       234908443     3450983490

The columns on the modified line are a bit different from those of the other lines due to the modification of the 3rd column. To make it look nicer, you may pass the result through an additional column -t stage at the end of the pipeline. If you do, the output would look like

apple1   10109283   20012983
apple1   10983102   10293809
apple1   10293893   444444444
apple10  109283019  109238901
apple10  192879234  234082034
apple10  234908443  3450983490

with multiple spaces between the columns.

With sed, it's not so easy as just replacing the 3rd column on the first line where the string occurs in the 1st column (assuming we reverse the lines of the data as in the above pipeline). We must also not replace the 3rd column in on any subsequent lines, even if the 1st column matches our string.

This is a sed editing script that does it correctly (there may be any number of variations of this that may work):

/^apple1\>/ ! {
        p
        d
}

s/[[:digit:]]*$/444444444/

:loop
n
$ ! b loop

The first part takes care of printing lines at the start of the input that does not match apple1 in the first column. The \> in the expression matches the end of the word apple1 so that we don't accidentally match apple10 or apple12 or whatever other similar string may occur. The p (print) and d (delete + continue with next line from the top of the script) within { ... } are executed for each line at the start of the input that does not match the expression.

The s command (substitute) is executed for the first line of input that does match apple1 at the start of the line. It simply substitutes the string of digits at the end of the line with our 4s.

Then comes a section labelled loop that takes care of passing through the the rest of the data unmodified by printing the current line and reading the next line with n (n does both printing and reading). The "current line" will have the modification done by the s command on the first trip through this loop.

The very last line branches back to the loop label if we're not yet at the last line of input.

Example run:

$ tac file | sed -f script.sed | tac
apple1        10109283      20012983
apple1        10983102      10293809
apple1        10293893      444444444
apple10       109283019     109238901
apple10       192879234     234082034
apple10       234908443     3450983490

Philippos · Accepted Answer · 2019-08-01T06:39:08.493

1

Pure sed solution without piping and tac

For a case like that the line-by-line approach of sed does not help. Better process the whole buffer at once, as the -z option of GNU sed does (you seem to be using linux and GNU sed, for a portable replacement see this Q&A).

Now you can take advantage of the greedy nature of .*: The pattern .*apple1 will match everything including the last occurrence of apple1, because all other occurences are eaten by .*.

Then just add the next fields (\s+ for the column separator, [0-9]+ for the second column and another \s+, all GNU extended regular expressions) and surround it with () so you can reuse it in the replacement as \1. Then add the third column outside the () to get it replaced and it comes out as

sed -zE 's/(.*\napple1\s+[0-9]+\s+)[0-9]+/\14444444/'

That's it.

Note for non-GNU sed users: A portable solution would be less handy:

sed -E 'H;1h;$!d;x;s/(.*\napple1[[:space:]]+[0-9]+[[:space:]]+)[0-9]+/\14444444/'

edited Aug 01 '19 at 06:39

answered Aug 01 '19 at 06:00

Philippos

13,453

Thanks for this straightforward answer. As a followup, how do I automate this? I have a second file (reference file) with two columns, that has the string in column 1 (apple1, apple10, etc), and the number that needs to replace the existing number in the first file (4444444, etc.). So far I have the following but doesn't work: – sf1 Aug 01 '19 at 15:05
while IFS= read -r k v ; do sed -zE 's/(.*\n"$k"\s+[0-9]+\s+)[0-9]+/\1"$v"/' file1.txt done < file2.txt – sf1 Aug 01 '19 at 15:05
The double quotes inside the single quotes are treated literally and the variables will not get expanded. Put double quotes around the whole script instead of single quotes so you can use variables inside. – Philippos Aug 01 '19 at 20:35
Thank you! I changed to the double quotes and it gives me a series of errors that looks like: sed: -e expression #1, char 11: unterminated `s' command – sf1 Aug 01 '19 at 22:05
Did you remove the inner double quotes? So it would be sed -zE "s/(.*\n$k\s+[0-9]+\s+)[0-9]+/\1$v/"? I can't test it by myself right now. – Philippos Aug 02 '19 at 07:43
Oh sorry, that was the issue. After removing the inner double quotes it runs without errors, but the output still doesn't have any of the numbers changed. – sf1 Aug 02 '19 at 15:06

score 0 · Answer 3 · answered Aug 01 '19 at 08:10

Tried with Below command and it worked fine

for i in `awk '{print $1}' file1| awk '{if(!seen[$1]++)print }'`; do j=`awk -v i="$i" '$1 == i {print $0}' file1| awk '{print NR}'| sed -n '$p'`; awk -v i="$i" '$1 == i {print $0}' file1|awk -v i="$i" -v j="$j" 'NR==j{$3="444444444"}1'; done

How to find the last occurrence of a string in column 1 and replace the corresponding value in column 3?

3 Answers3