GNU awk
has the match
command which allows you to extract the actual value of string components characterized by a pattern. Thus, you could use
match(file,"^[[:print:]]*(backup-[0-2][0-9][0-1][0-9][0-3][0-9])[[:print:]]*$",pats);
isDate[file]=pats[1]
in the else if ....
part of your program. The (array) variable pats
will then be filled with all (...)
-enclosed sub-expressions in your RegExp which are found in the string, starting with index 1 (pats[0]
would be the actual value of the entire expression). Since we only have one sub-expression thus grouped (the backup-YYMMDD
part), pats[1]
contains what you are looking for.
Alternatively, you could try directly
...
else if (match(file,"^[[:print:]]*(backup-[0-2][0-9][0-1][0-9][0-3][0-9])[[:print:]]*$",pats)==1) {
isDate[file]=pats[1]
}
...
Note that this approach, of course, relies on the being only one path component containing the backup-YYMMDD
pattern.
Edit (on a note by the OP, @macxpat)
I used string constants ("^[[:print:]] ... $"
) for specifying the regular expression in this answer. However, as noted in the GNU Awk User's Guide, it is cleaner and more efficient to specify them as regular expression constants. Thus, better use
match(file,/^[[:print:]]*(backup-[0-2][0-9][0-1][0-9][0-3][0-9])[[:print:]]*$/,pats)
in the above examples!
... else if (match(file, /(^|\/)backup-([0-2][0-9][0-1][0-9][0-3][0-9])(\/|$)/, pats)>0) { isDate[file]=pats[2] } ...
. I have two questions: 1) what is the meaning of[[:print:]]
in your code? (couldn't find an answer) ; 2) Is there a particular reason for your use of a string constant ("…") instead of regexp constant (/…/)? – macxpat Dec 03 '19 at 01:32[ ... ]
defines a "character list", i.e. a list of characters accepted at this position of the RegExp. The[:print:]
is a "character class" and interpreted as "any printable character" (as opposed to control characters). Thus,[[:print:]]
is a character list containing (only) all printable characters. There are similar constructs for digits, alphanumeric etc., but note that these are all POSIX extensions. As for question (2): it's actually a bad habit probably taken over from Cprintf
; using RegExp constants is cleaner. – AdminBee Dec 03 '19 at 07:37[[:digit:]]
is particularly useful as replacement for[0-9]
because the interpretation of that kind of specification depends on the locale-specific character sort order and may not always mean0123456789
! – AdminBee Dec 03 '19 at 10:15[0-9]
and[[:digit:]]
for some locales (found this post interesting too), and for the detailed explanation on[[:print:]]
(finally found info on it and its friends on Wikipedia's RegExp page). It's indeed more secure to parse the path with it. BTW, paths in this database contain only one directory of the typebackup-YYMMDD
, so the match is safe. – macxpat Dec 03 '19 at 16:14match
normally returns the position of the pattern in the string matched by the regular expression. But I've noticed that if the RegExp contains^[[:print:]]*
in front of the pattern (like in your code),match
always returns1
. Do you know why? Is it the reason why you used it? – macxpat Dec 05 '19 at 12:35^
symbol, so if the regular expression matches at all, it must by definition match at position 1 in the string. Notice thatmatch
returns the position where the entire RegExp occurs, not only the( ... )
-grouped sub-expression. In the check whether a line matches at all, I could as well have saidif (match( ... )!=0)
to exclude non-matching lines; forcing it to be equal1
can serve as a (not really necessary) consistency check ... – AdminBee Dec 05 '19 at 12:44