find/detect file name goups

Question

On fs i have files like: PREFIX_GROUPNAME_OTHERNAMES[.txt|.*]

eg:

A_ABC_A.txt
A_ABC_B.txt
A_ABC_C.txt
A_XYZ_A.txt
A_XYZ_B.txt
A_XYZ_C.txt

For some futher tasks i want to get the group names.

$# command i'm looking for
result:
> ABC XYZ

I know the name structure not the group names.

idea (but seems to be very expensiv! (on large listings):

scan all files
split names , create lists by goups names
return groups

find and awk maybe tr seems to be what im lookin for while finding a solution here

EDIT:

This gives a NOT UNIQUE list:

find ./ -iname '*.txt' | xargs -n 1 | cut -d '_' -f 2
> ABC
> ABC
> ABC
> XYZ
> XYZ
> XYZ

AdminBee · Accepted Answer · 2021-11-05T14:12:24.347

The following will only use shell string manipulation and the standard tool sort, in order to avoid parsing the output of ls or find, which is strongly discouraged:

for f in *.*; do gr=${f#*_};gr=${gr%_*}; printf "%s\n" "$gr"; done | sort -u

In your case, it should output exactly

ABC
XYZ

To explain:

We iterate over all file names that match *.* (should be a "minimally comprehensive" pattern to catch all filenames you stated)
Via shell string manipulation, we first remove everything up to the first _ and in a second step everything starting with the last _.
We output the result via printf (as pointed out by Stéphane Chazelas, it is unlikely that your shell is lacking that command)

The resulting output will not yet be unique. In order to remove duplicates, we pipe the output through sort -u.

Note that if - as you state - you have a lot of files matching this pattern, your for loop argument list may exceed the internal limits of your shell. Also, while this method avoids many pitfalls associated with special characters in filenames, the use of printf and sort means it will fail if the filenames contain newlines (which is a valid characters for filenames on many filesystems).

Stéphane Chazelas · Answer 2 · 2021-11-05T13:41:16.493

With zsh:

typeset -U groups=( **/*_*_*.*(Ne['REPLY=${${(s[_])REPLY:t}[2]}']) )

typeset -U groups=(...): define groups as an array with Unique members
**/*_*_*.*: file names with at least one . and at least two _s before the rightmost ., at or below the current working directory
(Ne['code']): glob qualifiers to further qualify the glob
N: Nullglob: expand to nothing if there's no match
e['code'] transform each glob expansion¹ (in $REPLY in the code)
$REPLY:t: the tail (basename) of the file.
${(s[_])var}: splits on _ (and then we take the second with [2]).

With bash (the GNU shell), GNU find and GNU awk, you can do something similar with:

readarray -td '' groups < <(
  LC_ALL=C find . -name '.?*' -prune -o \
    -name '*_*_*.*' -printf '%f\0' |
    gawk -v RS='\0' -v ORS='\0' -F _ '!seen[$2]++ {print $2}'
)

Those make no assumption as to what characters or non-characters may be found between those first two _ characters.

Both skip hidden files and files in hidden directories. To include them, add the D glob qualifier in zsh or remove the -name '.?*' -prune -o in find.

If there's a large list of files, the find-based one will be more memory friendly as it doesn't store the whole list in memory. You can take a similar approach in zsh with:

typeset -A seen=()
: **/*_*_*.*(Ne['! seen[${${(s[_])REPLY:t}[2]}]='])
groups=( ${(k)seen} )

^{¹ the exit status of that code also determines whether the file is selected or not, but here the code always returns true}

Thank you! but i can not use zsh the command will be used for! but i use zsh locally and i still wonder how crazy syntax can be sometimes :-) — f b, Nov 05 '21 at 13:49

score 0 · Answer 3 · answered Nov 05 '21 at 15:13

0

While getting answers i found also a solution. Like @AdminBee mentioned it also:

On huge result lists from the file system you may deside to go with find and xargs if you can not limit the search patten (eg: '*.txt').

for f in ./some/path/*.txt; do gr=${f#*_};gr=${gr%_*}; echo "$gr"; done | sort -u
> ABC
> XYZ
find ./ -iname '*.txt' | xargs -n 1 | cut -d '_' -f 2 | sort -u
> ABC
> XYZ

answered Nov 05 '21 at 15:13

f b

13

What's the point of xargs -n 1 (short for xargs -n 1 echo)? Is that so you get an empty line if there's no matching file? Or is that so the \b / \n / \0xxx be expanded by echo? – Stéphane Chazelas Nov 05 '21 at 18:34
Hi! sry. i dont know or remember in detail anymore. i'm not using the shell quite often. but ages of ages ago i end up in crashes of bash or (a|da)sh because the file list results breaks in a buffer overflow. xargs solves this and since then, if i have 1T.. 10T++ files i just use that way. Though: now we have 64bit and tonns of CPU and RAM where most of us dont even think about problems in that way or they dont come up anymore because of such power. – f b Nov 06 '21 at 21:51

find/detect file name goups

3 Answers3