Flexible Pattern Matching

Question

I have a file that looks like this:

FILE1:

In the left part (before / ) I have the time in seconds, and in the right part I have a value for a parameter at that respective second.

Now I have another file that looks like this:

FILE2:

Now, I want to remove from the second file the common values from FILE1. For example I should remove from FILE2:

0 28
7200 11

and keep the rest of the lines as they are.

I was thinking to use a for loop in a bash script for each line in FILE1 and then search for that in FILE2, but I can't manage to recognize the pattern. If I try to use substr from awk it won't work because the time does not have the same digits (0 has 1 digit and 7200 has 4).

To read FILE1 I am doing something like this:

IFS=$'\n' read -d '' -r -a X < ./FILE1.csv

And to write the for loop I am doing something like this:

for x in "${X[@]}"
do
    gawk -i inplace -v var=${x} '{...}' FILE2.csv
done

I am also thinking to transform FILE1 in something like this:

to have basically 2 columns, but using the for as I used it above and var, won't work if I have more than 1 column. I think this second approach is better, but I have no idea how to make it process each column individually.

EDIT:

How would I do that if FILE1 is:

and FILE2 is:

0 14 2 19
0 15 157 67
0 20 28 57
0 28 25 67
7200 11 88 14
7200 14 34 247
7200 15 364 14

please update your question to add these https://unix.stackexchange.com/questions/698451/flexible-pattern-matching/698461?noredirect=1#comment1319983_698453 into your question so people don't misinterpret your question for answering — αғsнιη, Apr 11 '22 at 23:50

αғsнιη · Accepted Answer · 2022-04-09T04:49:06.750

4

Using awk:

awk 'NR==FNR { sec[$1, $2]; next } !($1, $2) in sec' FS='/' file1 FS=' ' file2
0 14
0 15
0 20
7200 14
7200 15

The FS (Field Seperator) before each input file defines that file's field seperator.

edited Apr 09 '22 at 04:49

answered Apr 08 '22 at 10:35

αғsнιη

41,407

Why are you using SUBSEP instead of just , here? You're defining the array and using ,, so it seems that SUBSEP only makes this more confusing and longer to type. What am I missing? – terdon Apr 08 '22 at 11:52
@terdon comma character outside of an array definition in awk is just a comma character but in array it substitute with the value of the SUBSEP (default in most awk implementations is \034). within array I could use sec[$1 SUBSEP $2] too but I preferred to use , as its short term and since outside of an array cannot use ($1,$2 in sec) so $1 SUBSEP $2 should be used. – αғsнιη Apr 08 '22 at 12:07
BTW, I could use quoted comma or any other characters instead both in array sec[$1","$2] and comparison $1","$2 parts, but over this I preferred to use that way using SUBSEP since that way it's more like hard coding – αғsнιη Apr 08 '22 at 12:17
That's just it, you could also have used nothing: sec[$1$2] and then !($1$2 in sec). Anyway, your call, it just seems strange to me to be introducing an obscure, internal variable which seems to only make the code more confusing. Maybe it's just me, but I had to look up SUBSEP to understand what this was doing. – terdon Apr 08 '22 at 12:19
3

@terdon that way it will gives you false positive output, for example file1 have 1/22 comparison with file2 have 12 2, those are not the same but then it will be deleted wrongly from file2. – αғsнιη Apr 08 '22 at 12:21
1

Ah-ha! Yes indeed, there's what I was missing! – terdon Apr 08 '22 at 12:38
1

You can use ($1,$2) in sec instead of $1 SUBSEP $2 in sec but consider awk '{key=$1 RS $2} NR==FNR { sec[key]; next } !(key in sec)' instead so you don't have to define the key construction twice. Use SUBSEP instead of RS as the separator if you prefer. – Ed Morton Apr 08 '22 at 16:37
Regarding I could use quoted comma or any other characters just be careful not to choose a character that could appear in the input. I'd look at the builtin variables first. RS is a good choice if it's a string and you're processing 1 record at a time as you're guaranteed that won't be present in the input. SUBSEP is obviously designed for the job too. FS is good as it can't exist in a field, OFS less so as it CAN appear in a field but is useful if printing the array indices later – Ed Morton Apr 08 '22 at 16:55
@EdMorton I was tried that and I see in my history logs that I didn't get it to working becasue I didn't use FS so no output I was getting and so I ended up using SUPSEP (but meanwhile I have added FS at that time for it). But why cannot we use ($1,$2) elsewhere for example in key=($1,$2)? ----- thank you in advance as always for seeing my answers and correcting/teaching my faults! and yes about "I could use quoted comma or any other characters" I know it but my 5min grace period to edit my comment was passed : ) – αғsнιη Apr 09 '22 at 04:54
1

@αғsнιη You're welcome. I couldn't give you chapter and verse, sorry, but in my mind at least you can only use , as an array subscript separator when it's clear you're using it in the context of an array and key=(1,2) doesn't have an array anywhere so it's ambiguous if you're trying to use the comma as part of a subscript, range, interval, arg list or something else while arr[1,2] is clearly an array context since [] is an array operator and so is (1,2) in arr since in is an array operator. – Ed Morton Apr 09 '22 at 12:23
I mis-read the code. But if i'm understanding it correctly, using NR==FNR makes it hard to do a 3+ file join, since that condition is only true on file 1 (unless one force resets NR upon file2) – RARE Kpop Manifesto Apr 12 '22 at 07:10
@RAREKpopManifesto there is no 3 files in subject to the question, it only come with 2 files. in case of 3 files or more I would use control variables to work in any awk or ARGIND for gnu awk or some other implementations and it also depend what is the requirement. here there are only two input files. additionally NR==FNR is not always a good to go with it for example if the first file was empty then it will attempt to read second file into the array wrongly and so variable control is much better. I'm using all of these in different answers anyway https://unix.stackexchange.com/a/698678/72456 – αғsнιη Apr 12 '22 at 08:36

Kusalananda · Answer 2 · 2022-04-08T15:00:33.290

Answering the added question at the end:

$ join -v 2 <(sed 's/ /:/' file1) <(sed 's/ /:/' file2) | sed 's/:/ /'
0 14 2 19
0 15 157 67
0 20 28 57
7200 14 34 247
7200 15 364 14

As with the other join variation further down in this answer (which provides the answer to the original question), this makes sure that the join key is a single string with no whitespace, and then picks out the lines from the second file for which the join key does not match any entries in the first file.

This makes the same assumption about the files having to be sorted in the same way. As join only ever keeps two lines in memory at once, we still have the same benefit over grep and any other solution that needs to keep all the entries from one file in memory.

Using the original file1 and file2 from your question, transforming the first file with tr on the fly to the same format as the second file, and using the reformatted data as a set of lines to remove from the second file.

$ grep -v -x -F -f <(tr '/' ' ' <file1) file2
0 14
0 15
0 20
7200 14
7200 15

The grep utility is used here to filter out (remove, exclude) lines from file2 that are the same as the transformed lines from file1.

The -x option forces a full line match (not substring match as usual), and -F makes grep use the patterns as fixed strings rather than regular expressions. The -f option tells the utility to read the patterns from the named file (the process substitution), and -v inverts the usual sense of the matching so that lines that does not match are outputted.

Also relevant to some of the text in your question:

Why is using a shell loop to process text considered bad practice?

A more efficient way of doing this is to use join. This may be a good idea if your file1 is large. On large inputs, this is expected to be considerably faster than using grep.

The below assumes that your two files are both sorted the same way, and transforms the second file into the same format as the first file (with spaces replaced by slashes) to generate lines without whitespace. We do the transformation in this way as join use whitespace as delimiters by default, and we need to consider the whole line, not just the first space-delimited field.

$ join -v 2 file1 <(tr ' ' '/' <file2) | tr '/' ' ' 
0 14
0 15
0 20
7200 14
7200 15

This performs a relational JOIN operation between the two datasets and returns the lines that are not matched in the second input to join (the transformed second file). Since we want to have space-delimited data as the final result, we replace the slashes with spaces at the end.

This will never hold more than two lines of data in memory at any one time, whereas the grep variation would need to keep the entirety of the first file in memory, and also would need to test each line of that file against every line of the second file.

Thank you! I also have another question. If in FILE2 I have more than 2 columns ( let's say another 2 parameters) and these columns are not part of FILE1 (FILE1 just has 2 columns), but I still want to eliminate the values in FILE1 from FILE2, based just on the first 2 columns, how would I do that? Are grep or join still a valid option? — IceCode, Apr 08 '22 at 10:25
@user See updated answer. I'm sorry for the delay; my day job takes precedence over things here. — Kusalananda, Apr 08 '22 at 15:01

score -1 · Answer 3 · answered Apr 09 '22 at 07:25

-1

I will solve this problem by using a shell loop.

cat FILE2 | tr " " / | \
while read i;do
  cat -n FILE1| grep -w "$i" | awk '{print $1}' | \
  while read j;do
    sed -i "${j}d" FILE1
  done
done

answered Apr 09 '22 at 07:25

赵宝磊

134

see Why is using a shell loop to process text considered bad practice? – αғsнιη Apr 09 '22 at 08:10

RARE Kpop Manifesto · Answer 4 · 2022-04-11T09:04:06.423

Here's a solution that doesn't require fudging with SUBSEP, looping through the fields, having the files being pre-sorted, or have a pre-set number of columns/fields :

 mawk -v \_=testfile_001.txt -F/ '
 BEGIN { 
    while(getline<_) {
          __[$!(NF=NF)] 
    }
    _*=close(_)*(FS="^$") } _^($_ in __)' testfile_002.txt
0 14
0 15
0 20
7200 14
7200 15

just realized setting FS="^$" for 2nd file is much faster since we're doing line-wide matching, so splitting fields is a waste of time.

Tested and proven working on gawk 5.1.1 (including flags -c/-P), mawk 1.3.4, mawk 1.9.9.6, and macos nawk

-- The 4Chan Teller

Flexible Pattern Matching

4 Answers4