With zsh
:
typeset -U groups=( **/*_*_*.*(Ne['REPLY=${${(s[_])REPLY:t}[2]}']) )
typeset -U groups=(...)
: define groups
as an array with U
nique members
**/*_*_*.*
: file names with at least one .
and at least two _
s before the rightmost .
, at or below the current working directory
(Ne['code'])
: glob qualifiers to further qualify the glob
N
: N
ullglob: expand to nothing if there's no match
e['code']
transform each glob expansion¹ (in $REPLY
in the code
)
$REPLY:t
: the t
ail (basename) of the file.
${(s[_])var}
: splits on _
(and then we take the second with [2]
).
With bash
(the GNU shell), GNU find
and GNU awk
, you can do something similar with:
readarray -td '' groups < <(
LC_ALL=C find . -name '.?*' -prune -o \
-name '*_*_*.*' -printf '%f\0' |
gawk -v RS='\0' -v ORS='\0' -F _ '!seen[$2]++ {print $2}'
)
Those make no assumption as to what characters or non-characters may be found between those first two _
characters.
Both skip hidden files and files in hidden directories. To include them, add the D
glob qualifier in zsh
or remove the -name '.?*' -prune -o
in find
.
If there's a large list of files, the find
-based one will be more memory friendly as it doesn't store the whole list in memory. You can take a similar approach in zsh
with:
typeset -A seen=()
: **/*_*_*.*(Ne['! seen[${${(s[_])REPLY:t}[2]}]='])
groups=( ${(k)seen} )
¹ the exit status of that code also determines whether the file is selected or not, but here the code always returns true