Grep repeating patterns with a loop

Question

I have two files:

file1:

ABA
FFR
HHI
HAB

file2:

ABAABAABAABAABAABAABAABAABATRCFUJIKHRTHVFHJJHVHJJKKHGCC
FFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFHJKGHKKBVDTHJNJ
HHIHHIHHIHHIHHIDEDRJFKOLGCUOUUKJGLNJKKKKJKJKJGGHHBCFDII
HABHABHABHABHABHABHABHABGTHFOOLLLHHHUUJCIICXXTKCIABAGGC

Each line in file1 is a pattern that is repeating in the beginning of the corresponding lines in file2. I would like to get the parts of each line from file2 that are not the repeating patterns from file1.

desired output:

TRCFUJIKHRTHVFHJJHVHJJKKHGCC
FHJKGHKKBVDTHJNJ
DEDRJFKOLGCUOUUKJGLNJKKKKJKJKJGGHHBCFDII
GTHFOOLLLHHHUUJCIICXXTKCIABAGGC

I tried to use this loop:

while read -r line
do
grep -v "$line{1,}"   file2.txt
done < file1.txt

But I go this output:

ABAABAABAABAABAABAABAABAABATRCFUJIKHRTHVFHJJHVHJJKKHGCC
FFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFHJKGHKKBVDTHJNJ
HHIHHIHHIHHIHHIDEDRJFKOLGCUOUUKJGLNJKKKKJKJKJGGHHBCFDII
HABHABHABHABHABHABHABHABGTHFOOLLLHHHUUJCIICXXTKCIABAGGC
ABAABAABAABAABAABAABAABAABATRCFUJIKHRTHVFHJJHVHJJKKHGCC
FFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFHJKGHKKBVDTHJNJ
HHIHHIHHIHHIHHIDEDRJFKOLGCUOUUKJGLNJKKKKJKJKJGGHHBCFDII
HABHABHABHABHABHABHABHABGTHFOOLLLHHHUUJCIICXXTKCIABAGGC
ABAABAABAABAABAABAABAABAABATRCFUJIKHRTHVFHJJHVHJJKKHGCC
FFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFHJKGHKKBVDTHJNJ
HHIHHIHHIHHIHHIDEDRJFKOLGCUOUUKJGLNJKKKKJKJKJGGHHBCFDII
HABHABHABHABHABHABHABHABGTHFOOLLLHHHUUJCIICXXTKCIABAGGC

what if the pattern that repeats in the beginning is found somewhere else? — pLumo, Jul 21 '22 at 09:47
Is it intentional that file1 contains exactly the same number of lines as file2 and in the same order? Do the patterns in file1 always consist of exactly 3 characters? — Bodo, Jul 21 '22 at 09:49
File1 and file2 has the same amount of lines in the same order and the pattern is always 3 character. — Lia, Jul 21 '22 at 09:56
Yes it is possible that the pattern is repeated later. doesn't tell us what to do if that happens. Remove the matching strings? Leave them alone? Something else? — Ed Morton, Jul 22 '22 at 13:31
I need everything after the repeating pattern including any of the patterns if they appear somewhere later in the given line. — Lia, Jul 22 '22 at 14:17

ilkkachu · Accepted Answer · 2022-07-21T10:14:36.800

With e.g. ABA in the variable, grep -v "$line{1,}" would give grep the pattern ABA{1,}, meaning it'd look for a single A, a single B and then at least one A. The last repetition doesn't matter, though, as there's nothing after that, so even a single ABA would match that.

Well, except that by default, grep uses basic regular expressions (BRE), where the counted repetition must be written with backslashes, as \{n,m\}. In extended regular expressions (ERE), {1,} would one or more repeats (and so would +); but in BRE, it's just four literal characters (and + also is a regular character).

But grep prints full lines that match, or with -v, don't match; it doesn't remove parts of the line. (Except with grep -o where it only prints the matching part, but I don't think that would work with -v.) Also, with that loop, grep would look at all the lines for each pattern, which is why you get the contents of file2 repeated multiple times.

We'd need a loop that reads one line from each input on each iteration. It could be done in shell, but it would be slow. Something like AWK would be better, e.g.:

$ awk '{getline pat < "file1"; sub("^(" pat ")*", ""); print}' file2
TRCFUJIKHRTHVFHJJHVHJJKKHGCC
FHJKGHKKBVDTHJNJ
DEDRJFKOLGCUOUUKJGLNJKKKKJKJKJGGHHBCFDII
GTHFOOLLLHHHUUJCIICXXTKCIABAGGC

The AWK program implicitly loops over the lines of file2 (and other files given on the command line), and here, we explicitly read one line from file1 each iteration. Then "^(" pat ")*" constructs a pattern like ^(ABA)*, which is matched against the current line, and substituted with the empty string.

This would not remove any instances of the pattern from further in the line, and e.g. ABAABAFOOABABAR would turn into FOOABABAR. If you want to remove those too, change it to gsub("(" pat ")*", "");.

I seem to recall having used grep -vo at least once, and I'm pretty sure it did work as expected. — Kevin, Jul 22 '22 at 01:34
@Kevin, well, I'm not even sure what to expect from the combination, really. -v says to print lines without matches, and -o says to print the matching parts. Their intersection is empty. GNU grep seems to print nothing, but return an exit status based on -v; Busybox either printed nothing or segfaulted; and the BSD grep on my mac printed the non-matching lines. So, they basically did the same as -vq, or just the plain -v. — ilkkachu, Jul 22 '22 at 08:50
IIRC it prints "the part of the line that didn't match," at least on GNU. — Kevin, Jul 22 '22 at 17:47

Bodo · Answer 2 · 2022-07-21T10:07:34.943

A solution using awk which will remove the repeated pattern from every line in file1 from the corresponding line in file2:

awk 'NR==FNR { pattern[NR]="^(" $0 ")*"; next } { sub(pattern[FNR], ""); print }' file1 file2

Explanation:

NR==FNR condition that matches the first file only.
pattern[NR]="^(" $0 ")*"; construct a pattern from the string and add it to an array using the current line number as the index. ABA -> ^$ABA$* = any number of repeated string ABA at the beginning of the line.
next skip all further processing. This results in the following action being applied to the second (and following) file(s) only.
sub(pattern[FNR], "") substitute the pattern for the current line number with an empty string
print print the (modified) line

A possible solution using awk which will remove every pattern in file1 from every line in file2:

awk 'NR==FNR { pattern[count++]="^(" $0 ")*"; next } { for(i = 0; i < count; i++) sub(pattern[i], ""); print }' file1 file2

Explanation:

NR==FNR condition that matches the first file only.
pattern[count++]="^(" $0 ")*"; construct a pattern from the string and append it to an array. ABA -> ^(ABA)* = any number of repeated string ABA at the beginning of the line. count will be the number of lines after processing file1
next skip all further processing. This results in the following action being applied to the second (and following) file(s) only.
for(i = 0; i < count; i++) loop over all patterns
sub(pattern[i], "") substitute the pattern by an empty string
print print the (modified) line

AWK fields, strings, and generated arrays start at 1, not 0, so you should try to get into the habit of having your user-defined arrays start at 1 too so you don't get tripped up by the difference some day: awk 'NR==FNR { pattern[++count]="^(" $0 ")*"; next } { for(i = 1; i <= count; i++) sub(pattern[i], ""); print }' file1 file2. Starting at 1 actually makes some other things easier too, e.g. the very common way of printing the contents of an array a[] with n elements all on 1 line for (i=1;i<=n;i++) printf "%s%s", a[i], (i<n?OFS:ORS) — Ed Morton, Jul 22 '22 at 13:44

score 1 · Answer 3 · answered Jul 21 '22 at 10:38

1

Following your approach of a while read-bash-loop, sed could do the trick as follows:

#!/bin/bash
i=0
while read pat ; do
    ((i++))
    sed -n "${i}s/^\($pat\)\{1,\}//g;${i}p" file2
done < file1

I am a bit confused regarding your interpretation of "repreating pattern", where I assume it should be present at least twice, e.g. \{2,\} would feel more fitting to me.

answered Jul 21 '22 at 10:38

FelixJN

13,566

1

Sorry for the confusion, previous steps determine the minimum repeat number in file2. I left it to 1 because this number is not decided yet. – Lia Jul 21 '22 at 11:43
@Lia in that case, the number may even be a variable: \{${n},\} and setting n beforehand is fully valid syntax. – FelixJN Jul 21 '22 at 18:55

score 1 · Answer 4 · answered Jul 22 '22 at 13:33

$ paste file1 file2 | awk '{sub("^("$1")*","",$2); print $2}'
TRCFUJIKHRTHVFHJJHVHJJKKHGCC
FHJKGHKKBVDTHJNJ
DEDRJFKOLGCUOUUKJGLNJKKKKJKJKJGGHHBCFDII
GTHFOOLLLHHHUUJCIICXXTKCIABAGGC

The above assumes neither file contains spaces and file1 doesn't contain regexp metacharacters.

Kramer · Answer 5 · 2022-07-21T10:42:02.200

0

There you go mate:

challenge.sh

#!/bin/bash
readarray -t searchStrs < file1.txt
linesInFile=$((${#searchStrs[@]} - 1))
line=0
while [ ${line} -le ${linesInFile} ]
do
        srchStr=$(echo ${searchStrs[$line]})
        result=$(grep -E "^${srchStr}" file2.txt | sed "s@${srchStr}@@g")
        line=$((${line} + 1))
        echo ${result}
done

./challenge.sh
TRCFUJIKHRTHVFHJJHVHJJKKHGCC
FHJKGHKKBVDTHJNJ
DEDRJFKOLGCUOUUKJGLNJKKKKJKJKJGGHHBCFDII
GTHFOOLLLHHHUUJCIICXXTKCIABAGGC

cat file2.txt
ABAABAABAABAABAABAABAABAABATRCFUJIKHRTHVFHJJHVHJJKKHGCC
FFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFHJKGHKKBVDTHJNJ
HHIHHIHHIHHIHHIDEDRJFKOLGCUOUUKJGLNJKKKKJKJKJGGHHBCFDII
HABHABHABHABHABHABHABHABGTHFOOLLLHHHUUJCIICXXTKCIABAGGC

edited Jul 21 '22 at 10:42

answered Jul 21 '22 at 10:28

Kramer

166

1

but the AWK solution above is just pure magic :) I love awk but my challenge was to not use it xD – Kramer Jul 21 '22 at 10:45
Please read why-is-using-a-shell-loop-to-process-text-considered-bad-practice. – Ed Morton Jul 22 '22 at 13:35
You should also copy/paste your script into http://shellcheck.net and fix the issues it tells you about. – Ed Morton Jul 22 '22 at 13:35

Bodo · Answer 6 · 2022-07-22T17:43:11.790

This solution uses sed and nl instead of awk. It assumes that the character # is never part of the patterns in file1 and that no pattern starts with a tab character. It will replace the patterns from file1 only in the corresponding line of file2. (cat -n file1 can be used as an alternative to nl file1.)

It uses a single nl process and two sed processes independent from the number of lines in file1 and file2.

sed -f <(nl file1|sed 's/ *\([0-9]*\)\t\(.*\)/\1s#^\\(\2\\)*##/') file2

Step-by-step execution instead of explanation:

$ nl file1
     1  ABA
     2  FFR
     3  HHI
     4  HAB

$ nl file1|sed 's/ *\([0-9]*\)\t\(.*\)/\1s#^\\(\2\\)*##/'
1s#^\(ABA\)*##
2s#^\(FFR\)*##
3s#^\(HHI\)*##
4s#^\(HAB\)*##

$ sed -f <(nl file1|sed 's/ *\([0-9]*\)\t\(.*\)/\1s#^\\(\2\\)*##/') file2
TRCFUJIKHRTHVFHJJHVHJJKKHGCC
FHJKGHKKBVDTHJNJ
DEDRJFKOLGCUOUUKJGLNJKKKKJKJKJGGHHBCFDII
GTHFOOLLLHHHUUJCIICXXTKCIABAGGC

To show that the script will only replace the corresponding pattern, I use an additional file3 which contains combinations of patterns at the beginning of the lines:

ABAABAABAABAABAABAABAABAABAHABHABHABTRCFUJIKHRTHVFHJJHVHJJKKHGCC
FFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFFRFHJKGHKKBVDTHJNJ
HHIHHIHHIHHIHHIFFRDEDRJFKOLGCUOUUKJGLNJKKKKJKJKJGGHHBCFDII
HABHABHABHABHABHABHABHABABAABAGTHFOOLLLHHHUUJCIICXXTKCIABAGGC

$ sed -f <(nl file1|sed 's/ *\([0-9]*\)\t\(.*\)/\1s#^\\(\2\\)*##/') file3
HABHABHABTRCFUJIKHRTHVFHJJHVHJJKKHGCC
FHJKGHKKBVDTHJNJ
FFRDEDRJFKOLGCUOUUKJGLNJKKKKJKJKJGGHHBCFDII
ABAABAGTHFOOLLLHHHUUJCIICXXTKCIABAGGC

Grep repeating patterns with a loop

6 Answers6