-2

The text file contains top-selling songs. It is structured as follows:

Single,Artist,Record label,Released,Chart,Traditional sales peak,

some example lines:

Imagine,John Lennon,Apple,Oct-75,1,1714351
Uptown Funk,Mark Ronson featuring Bruno Mars,RCA,Dec-14,1,1647310
Wonderwall,Oasis,Creation,Oct-95,2,1502270

I tried to find songs that didn't go number 1 (5th field), i.e., Wonderwall. I am unsure of how to otly specify the fifth field. My thinking was to use cat top50.txt | grep-vE "^[^*,*,*,*,[1],]". However, this did not work, and I am not sure why.

I also want to find the songs with sales of 2 million

But I think I can't do this until I figure out how to target the grep to a certain field.

1 Answers1

1

Grep is the wrong tool for this. You should use a tool that is designed to dea with fields, like awk. For example, to get all lines whose 5th field is greater than 1:

$ awk -F, '$5 > 1' file
Wonderwall,Oasis,Creation,Oct-95,2,1502270

Or whose 6th field is at least two million:

awk -F, '$6 >= 2000000' file

It is not possible to do such things with grep since that doesn't let you compare values. The best you can do is some horrible hack like this to get those lines with 1 as the 5th field:

$ grep -E '([^,]+,){4}1,' file
Imagine,John Lennon,Apple,Oct-75,1,1714351
Uptown Funk,Mark Ronson featuring Bruno Mars,RCA,Dec-14,1,1647310

And reverse the match to get those which were not number 1:

$ grep -vE '([^,]+,){4}1,' file
Wonderwall,Oasis,Creation,Oct-95,2,1502270

That means "find exactly 4 repetitions of one or more non-, ([^,]+) followed by a comma and then a 1 and a comma after it".

Your attempt was looking for something completely different. In regular expressions, [ ] denote a character class. So [abc] means "one of a, or b, or c" and [^abc] means "one of anything except a, b, or c. So [^*,*,*,*,[1],] is the same as [^*,[]1] and will match any character that is not a [, a ], a 1, a , or a *. I think you were trying to do something like this:

$ grep -vE '^.*?,.*?,.*?,.*?,1,' file 
Wonderwall,Oasis,Creation,Oct-95,2,1502270

The * is a modifier, it means "0 or more of the previous". So it makes no sense by itself. To match any character 0 or more times, you would use .* not * alone. Next, a single .* would match all the way to the end of the line. This is called "greedy matching". For non-greedy, to find the shortest match possible instead of the longest, you want ? which is why I used .*? above.

terdon
  • 242,166
  • Thanks for the great explanation. I see that the application of grep for this is pretty bad, these were just some extracurricular tasks that limited me to using grep. I was trying to figure out the application and use of different wildcards, your explanation helped a lot.

    For the 2million part using grep, could I preface the field to start with [2-9 ]followed by 6 [0-9]?

    – ZerosOnesTwos Nov 15 '22 at 19:59
  • I used grep -E "^([^,]*,){5}[2-9][0-9][0-9][0-9][0-9][0-9][0-9]" and it seems to work for minimum 2 million sales – ZerosOnesTwos Nov 15 '22 at 20:03
  • do you have any advice on how I could find the year that appears the most in the text file using grep? – ZerosOnesTwos Nov 15 '22 at 20:44
  • @ZerosOnesTwos my only advice is not to use grep. It simply is the wrong tool for the job. You can't do this sort of calculation in grep alone. Look into tools like awk. – terdon Nov 16 '22 at 10:46