Awk : extract the actual value of a RegExp pattern match

Question

In the following awk code part, file contains a file name with its full Linux path that may include a directory of the type backup-YYMMDD where YYMMDD is a date.

I would like to assign YYMMDD to isDate[file], that is isDate[file]=YYMMDD.

How can I do that?

for (file in files) {
        if ( file ~ /(^|\/)(library|labs data|current)(\/|$)/ ) {
           isKeep[file]
        }
        else if ( file ~ /(^|\/)(backup-[0-2][0-9][0-1][0-9][0-3][0-9])(\/|$)/ ) {
            isDate[file]
        }
        else {
            isDelete[file]
        }
}

AdminBee · Accepted Answer · 2019-12-03T10:42:00.713

1

GNU awk has the match command which allows you to extract the actual value of string components characterized by a pattern. Thus, you could use

match(file,"^[[:print:]]*(backup-[0-2][0-9][0-1][0-9][0-3][0-9])[[:print:]]*$",pats);
isDate[file]=pats[1]

in the else if .... part of your program. The (array) variable pats will then be filled with all (...)-enclosed sub-expressions in your RegExp which are found in the string, starting with index 1 (pats[0] would be the actual value of the entire expression). Since we only have one sub-expression thus grouped (the backup-YYMMDD part), pats[1] contains what you are looking for.

Alternatively, you could try directly

...
   else if (match(file,"^[[:print:]]*(backup-[0-2][0-9][0-1][0-9][0-3][0-9])[[:print:]]*$",pats)==1) {
      isDate[file]=pats[1]
   }
...

Note that this approach, of course, relies on the being only one path component containing the backup-YYMMDD pattern.

Edit (on a note by the OP, @macxpat)

I used string constants ("^[[:print:]] ... $") for specifying the regular expression in this answer. However, as noted in the GNU Awk User's Guide, it is cleaner and more efficient to specify them as regular expression constants. Thus, better use

match(file,/^[[:print:]]*(backup-[0-2][0-9][0-1][0-9][0-3][0-9])[[:print:]]*$/,pats)

in the above examples!

edited Dec 03 '19 at 10:42

answered Dec 02 '19 at 08:57

AdminBee

22,803

Thank you! This exactly answers my question. I appreciate also your clear description: "a command which allows you to extract the actual value of string components characterized by a pattern". I have modified the title of my question in accordance, so it may be more useful to others. I was not able to make a successful search because I couldn't even formulate in plain words what I wanted to do! – macxpat Dec 03 '19 at 01:29
I have slightly modified your code to this ... else if (match(file, /(^|\/)backup-([0-2][0-9][0-1][0-9][0-3][0-9])(\/|$)/, pats)>0) { isDate[file]=pats[2] } .... I have two questions: 1) what is the meaning of [[:print:]] in your code? (couldn't find an answer) ; 2) Is there a particular reason for your use of a string constant ("…") instead of regexp constant (/…/)? – macxpat Dec 03 '19 at 01:32
1

You're welcome! Concerning question (1): As you probably know, [ ... ] defines a "character list", i.e. a list of characters accepted at this position of the RegExp. The [:print:] is a "character class" and interpreted as "any printable character" (as opposed to control characters). Thus, [[:print:]] is a character list containing (only) all printable characters. There are similar constructs for digits, alphanumeric etc., but note that these are all POSIX extensions. As for question (2): it's actually a bad habit probably taken over from C printf; using RegExp constants is cleaner. – AdminBee Dec 03 '19 at 07:37
1

As for character classes, the [[:digit:]] is particularly useful as replacement for [0-9] because the interpretation of that kind of specification depends on the locale-specific character sort order and may not always mean 0123456789! – AdminBee Dec 03 '19 at 10:15
Thanks for pointing out the possible differences between [0-9] and [[:digit:]] for some locales (found this post interesting too), and for the detailed explanation on [[:print:]] (finally found info on it and its friends on Wikipedia's RegExp page). It's indeed more secure to parse the path with it. BTW, paths in this database contain only one directory of the type backup-YYMMDD, so the match is safe. – macxpat Dec 03 '19 at 16:14
The function match normally returns the position of the pattern in the string matched by the regular expression. But I've noticed that if the RegExp contains ^[[:print:]]* in front of the pattern (like in your code), match always returns 1. Do you know why? Is it the reason why you used it? – macxpat Dec 05 '19 at 12:35
1

The reason for the behaviour is that I anchored the RegExp at the beginning of the line using the ^ symbol, so if the regular expression matches at all, it must by definition match at position 1 in the string. Notice that match returns the position where the entire RegExp occurs, not only the ( ... )-grouped sub-expression. In the check whether a line matches at all, I could as well have said if (match( ... )!=0) to exclude non-matching lines; forcing it to be equal 1 can serve as a (not really necessary) consistency check ... – AdminBee Dec 05 '19 at 12:44
1

Btw, please notice that "the comment section is not for extended discussions". If you have further need for clarification, opening up a [chat] may be the better means (although I personally think your questions are of interest to a wider audience). – AdminBee Dec 05 '19 at 12:49

Awk : extract the actual value of a RegExp pattern match

1 Answers1