Confusion with Linux find regex

Question

I'm getting quite confused with Linux find command's regular expression usage.

I'm aware that there is an option regextype, but without that, according to the current man page, it is supposed to use Emacs regular expressions. This page seems to say that character classes are supported ("this is a POSIX feature"), but my experiments seem to show that nothing like [[:ascii:]] or [[:digit:]] or [[:alnum:]] ever works, quite apart from the fact that these are truly archaic ways of handling characters classes. Instead you seem to have to use [a-zA-Z] which, apart from anything else, is useless for Unicode characters.

So I turned to regextype: I find that you get a list of possible settings by going find -regextype help. This gives:

find: Unknown regular expression type ‘help’; valid types are ‘findutils-default’, ‘awk’, ‘egrep’, ‘ed’, ‘emacs’, ‘gnu-awk’, ‘grep’, ‘posix-awk’, ‘posix-basic’, ‘posix-egrep’, ‘posix-extended’, ‘posix-minimal-basic’, ‘sed’.

... so I assumed that by including -regextype posix-basic, for example, I'd be able to run something like this:

find . -maxdepth 1 -regextype posix-basic -regex .*\d.*

This produces results, but not the ones I was hoping for: all the files and folders in the current directory with the lower-case letter "d" in their names! I was expecting all names with at least one digit.

I've looked at quite a lot of Linux find regex questions here on Stack Exchange, but I don't think I've seen a single one where "modern" character class handling is demonstrated. Is any of the regextype options able to handle something like this:

find . -maxdepth 1 -regextype ??? -regex '.*\d{3}\s+.*'

where I mean "contains three digits followed by one or more empty space characters". I.e. something like regex rules from a normal language like Java, Python, Javascript, etc...?

later, following comments

Here's an exercise: make a directory and put a few files into it with random names. Then added files with the following names: 'ctb117b', 'ctb117c', 'trt117a'.

I then want to isolate the '117' files. There may be files called 'xxx0009333qqq'. So using a modern regex engine I'd go like this, for example (allowing for the preceding ./):

find . -regex './\w{3}\d\{3}.*'

Using these more venerable Linux regex rules, what do I put that works?

find . -regextype posix-basic -regex '.*[[:digit:]]{3}.*'

produces nothing. Nor does '.*[[:digit:]]+.*', for example. If anyone's sufficiently interested, please show me something which works for you (lists the above files).

\d is a Perl-like expression, also supported by some GNU tools as a short way of writing what would be written as [[:digit:]] in a POSIX expression. Same for \s ([[:blank:]]). The {3} modifier is a POSIX extended expression modifier. — Kusalananda, Jan 07 '20 at 20:34
When using *, be sure to quote it. find . -maxdepth 1 -regextype posix-basic -regex .*\d.* probably gave you unexpected results because *\d.* matched something in the current directory, so the shell expanded it before find ever saw your regex. — Andy Dalton, Jan 07 '20 at 20:35
[[:digit:]] works with all posix-* -regextype, and also with other (egrep, etc). — , Jan 07 '20 at 20:36
You could use -name '*[[:digit:]][[:digit:]][[:digit:]][[:blank:]]*' — Kusalananda, Jan 07 '20 at 20:36
@AndyDalton I just tried that command with first single quotes and then double quotes. Both returned the same (wrong) results. — mike rodent, Jan 07 '20 at 20:38
I am having no luck whatsoever with [[:digit:]]... in particular, surely [[:digit:]]+ is meant to mean "one or more digit(s)", and [[:digit:]]{3} is meant to mean "precisely three digits". This functionality is NOT working for me. I'm going to add an "exercise" at the end of my question for anyone sufficiently interested. — mike rodent, Jan 07 '20 at 20:44
Where does the find command that you are using come from? Did you build it yourself from sources? — Kusalananda, Jan 07 '20 at 20:46
@Kusalananda No! This is a standard Linux Mint (18.3) OS. I have not fiddled with the BASH in any way. It is version 4.3.48. — mike rodent, Jan 07 '20 at 20:52
use posix-extended instead of posix-basic. Neither + nor the {3} repetition works in BRE (basic regular expressions). — , Jan 07 '20 at 20:52
But it seems that you're confused about what you're looking for: the ^\w{3}\d{3}.*$ PCRE-like regex example you're giving (I've removed the extra backslash) will also match the xxx0009333qqq, not just the ctb117b, etc. — , Jan 07 '20 at 21:02
@mosvy Thanks, that's a useful piece of info. Yes, you're right about the example "other file", apologies. I'm looking for 3 non-digits, followed by 3 digits, followed by anything. So the "other file" was a bad example. — mike rodent, Jan 07 '20 at 21:02
Which find?, run find --version. Linux does not have a find. Linux is just a kernel. Used is several Operating Systems: e.g. Gnu/Linux. — ctrl-alt-delor, Jan 07 '20 at 22:42

score 1 · Accepted Answer · answered Jan 07 '20 at 20:42

1

I would recommend using this :

find . -maxdepth 1 -regextype posix-extended -regex '.*[[:digit:]]{3}\s+.*'

answered Jan 07 '20 at 20:42

Philippe

1,435

Thanks. I tried it. Firstly it produced no results, but that wouldn't be surprising, as there are no spaces in the names of the files (see suggested exercise). If I drop the \s+ things get listed. But explain this: if I put '.*[[:digit:]]{80}.*' it returns exactly the same files! It completely ignores the "number of matches" directive! – mike rodent Jan 07 '20 at 20:59
In your post, you said "contains three digits followed by one or more empty space characters". Are you sure you get same files when you put '.*[[:digit:]]{80}.*' and -regextype posix-extended ? – Philippe Jan 07 '20 at 21:02
Sorry, no... I've got that wrong. The {80} directive is working as expected. Apologies. By the way, are you saying that backslash character classes work with posix-extended? So should "\w" work? I find that it does... but './\w{3}\d{3}.*' doesn't... I must find the right documentation for posix-extended obviously! – mike rodent Jan 07 '20 at 21:07
\w does not work with posix-extended – Philippe Jan 07 '20 at 21:09
It seems to: find . -maxdepth 1 -regextype posix-extended -regex './\w{3}[[:digit:]]{3}.*' ... lists the right files. From the source: https://www.gnu.org/software/findutils/manual/html_node/find_html/posix_002dextended-regular-expression-syntax.html ... "\w" is one character class listed there. – mike rodent Jan 07 '20 at 21:11
Yes, you are right, posix-extended does accept \w – Philippe Jan 07 '20 at 21:14
You've really clarified things for me... thanks. – mike rodent Jan 07 '20 at 21:15
Nitpick: \w, \d, \s etc. are not POSIX extended regular expression things, but GNU's implementation of POSIX extended regular expressions includes them (for convenience). They are originally from Perl. – Kusalananda Jan 07 '20 at 21:23
@Kusalananda YES! Thanks for that. I took the trouble of examining that page... "\w" is said to be "matches a character within a word", which is not the same as "matches a letter"! Also that page does not list the "\s" matcher. All quite confusing. PS isn't it time someone upgraded find and other commands so they can use a more modern, and hopefully uniform, regex set of rules? – mike rodent Jan 07 '20 at 21:26
@mikerodent That's a sloppy way of saying "matches a word character", i.e. it's the same as [[:alnum:]_] (alphanumerics and underscore). That's what \w matches in Perl. – Kusalananda Jan 07 '20 at 21:29
Strange, https://www.boost.org/doc/libs/1_38_0/libs/regex/doc/html/boost_regex/syntax/basic_extended.html#boost_regex.syntax.basic_extended.escapes_matching_a_specific_character says \w is included – Philippe Jan 07 '20 at 21:31
@mikerodent Also read the section above that. – Kusalananda Jan 07 '20 at 21:32

Confusion with Linux find regex

1 Answers1

Linked