Group and count file names following a pattern

Question

I have a large number of files in a folder with a specific naming system. It looks somewhat like this:

my_file_A_a.txt
my_file_A_d.txt
my_file_A_f.txt
my_file_A_t.txt
my_file_B_r.txt
my_file_B_x.txt
my_file_C_f.txt
my_file_D_f.txt
my_file_D_g.txt
my_file_E_r.txt

I would like a command line, or a series of commands (can use temp files, I have write access), that would return something like:

A: 4
B: 2
C: 1
D: 2
E: 1

It could be done with a lot of ls -1 *A* | wc -l commands, but it would take a long time as there are a few hundred "groups" to count.

Also, each group name is unique. There is an A group, a B group, but no AB group.

@Whitehot your example seems to summarize the 3rd part of the filenames (A,B,C,D,E) not the 4th part. — Jeff Schaller, Sep 29 '21 at 19:25
@JeffSchaller the file names are all quite long, so I figured I would save some characters and shorten them a bit. The logic behind the command line(s) should still be the same — Whitehot, Sep 30 '21 at 14:23
The length isn't a particular issue, although potential answers might have tried to hard-code the position of the grouping, but it would help if you made an [edit] to clarify how you can distinguish the grouping key. — Jeff Schaller, Sep 30 '21 at 14:29

score 6 · Answer 1 · edited Sep 29 '21 at 15:43

Assuming that your filenames are "well-behaved", i.e. they don't contain newlines, the following combination of ls and awk would work:

ls -d my_file* | awk -F'_' 'NF==4{count[$3]++} END{for (i in count) printf "%s: %d\n", i, count[i]}'

This will redirect the output of the ls command that lists all files starting my_file* to an awk program. The awk program will use the _ as field separator and check the 3rd field to track the occurence in an array count, which uses the group number as "array index".

At the end, it prints an overview of how often each group occurred.

Notice

There is a "minimum" safeguard against completely malformed filenames by requiring exactly 4 such fields. This assumes the _ cannot be part of the a,d,f,... part of the filenames in your example.
The output will not necessarily be sorted according to the category names. The sorting order will depend on how awk traverses the array indices in the for (i in count) loop. If sorting is desired, you can add a further pipe to sort. Alternatively, if you use GNU Awk, you can add a configuration setting via
```
BEGIN{PROCINFO["sorted_in"]="@ind_str_asc"}
```
before the NF==4{...} rule. This will ensure arrays are traversed according to the array index, sorted in lexicographical (ASCII) order.
This will work with the limitation stated at the beginning, and because your file name structure is rather simple. In general it is discouraged to parse the output of ls.

Jim L. · Accepted Answer · 2021-09-29T19:41:33.993

4

for f in my_file_*_*.txt
do
    f="${f#my_file_}"
    printf "%s\n" "${f%%_*.txt}"
done |
sort |
uniq -c

The for loop reformats each filename f to strip off the leading my_file_ and the trailing _whatever.txt, then sorts that output, and uses uniq to count the number of occurrences of each unique value.

edited Sep 29 '21 at 19:41

answered Sep 29 '21 at 19:36

Jim L.

7,997
1
13
27

This one seems to work the fastest for me, and it's quite elegant in its execution I find. I'll accept it tomorrow if nobody posts anything faster. – Whitehot Sep 30 '21 at 14:52
Can this be modified to sort files by a prefix instead? E.g. A_1245.txt, B_43525.txt... – Seano Nov 03 '23 at 18:18
1

@Seano Sure, just don't strip it off. The first line after the do statement strips off the prefix. If you wish to sort without any stripping of a leading prefix, then omit (or modify) that statement to suit your needs. If you want to sort numerically instead of lexically, read the sort man page for your OS, and add likely the -n flag to the sort command. – Jim L. Nov 04 '23 at 03:29
Thanks. Is there a way to only print out files whose prefixes match a pattern? E.g. All 3-digit integers like 340_dghdfgh.txt, 902_erstgtrg.txt and not anything else? – Seano Nov 04 '23 at 13:28
1

@seano That's a large part of what this thread is about! If it's a truly simple pattern, you might be able to accomplish that in the file globbing: for f in [0-9][0-9][0-9]_*. For complex pattern matching, grep and regular expressions are your friend. Often, the best way to learn is to experiment. There's nothing in this command stream that's going to damage any data. – Jim L. Nov 04 '23 at 23:01

score 3 · Answer 3 · edited Sep 30 '21 at 09:52

I would approach it with a loop over a wildcard, then extract the field from the filename with bash's regular expression feature in its [[ Conditional Expression construct.

unset collect
declare -A collect
for f in ./*_*_*_*.txt
do 
  [[ $f =~ [^_]+_+[^_]+_+([^_]+)_+[^_]+.txt ]] &&
  ((collect["${BASH_REMATCH[1]}"]++))
done
for group in "${!collect[@]}"
do
  printf '%s: %d\n' "$group" "${collect["$group"]}"
done

The only parenthesized field is the 3rd underscore-delimited one; once it's captured, we increment that value in an associative array (collect).

jubilatious1 · Answer 4 · 2021-09-29T20:31:41.103

Using Raku (formerly known as Perl_6)

raku -e '.say for dir.split("_")[2,5,8...*].Bag.pairs.sort;'

Sample Input (current directory listing):

my_file_A_a.txt
my_file_A_d.txt
my_file_A_f.txt
my_file_A_t.txt
my_file_B_r.txt
my_file_B_x.txt
my_file_C_f.txt
my_file_D_f.txt
my_file_D_g.txt
my_file_E_r.txt

Sample Output:

A => 4
B => 2
C => 1
D => 2
E => 1

As a brief explanation, the current directory dir() listing is obtained and split on _ underscore. [File names are assumed not to start/end with _ underscore]. The elements obtained are thus:

raku -e 'dir.split("_").raku.say;'
("my", "file", "A", "a.txt my", "file", "A", "d.txt my", "file", "A", "f.txt my", "file", "A", "t.txt my", "file", "B", "r.txt my", "file", "B", "x.txt my", "file", "C", "f.txt my", "file", "D", "f.txt my", "file", "D", "g.txt my", "file", "E", "r.txt").Seq

After that, Raku has a fairly robust mechanism for generating/understanding sequences: simply typing in [2,5,8...*] lets you pull out the letters A,B,C,D,E (every third element, numbering starts from 0). Then Bag, pairs, and sort.

(If you're sure you have no blank spaces in your file names, you could add a second call to split(" ") after the first one. Then the elements you would pull out would be [2,6,10...*] ).

NOTE 1: If you have extraneous file names that don't fit the pattern listed by the OP (and are mucking up your counts), then you could change the dir call to something like dir(test => / [ <-[_]>+ _ ] ** 3 /) which subsets filenames on a regex where one-or-more non-underscores are followed by an underscore, repeated three times.

NOTE 2: If you want two columns of output (no => in-between), simply change .say to .put. Or if you prefer a more 'Raku-ish' output, try using .raku.say, which returns the following:

:A(4)
:B(2)
:C(1)
:D(2)
:E(1)

https://docs.raku.org/routine/dir
https://docs.raku.org/type/Bag
https://raku.org

Kusalananda · Answer 5 · 2021-09-29T19:22:28.110

A filename containing four underscore-delimited fields and ending with the string .txt is matched by the extended globbing pattern +([!_])_+([!_])_+([!_])_+([!_]).txt. Each +([!_]) matches one or more non-underscore characters, just like [^_]+ would do as an extended regular expression.

We can extract the third field from this by removing the initial two fields and the last field along with the .txt suffix string.

#!/bin/bash
shopt -s extglob nullglob
names=( +([!_])+([!])+([!])+([!]).txt )
names=( "${names[@]#+([!_])+([!])}" )
names=( "${names[@]%+([!_]).txt}" )
printf '%s\n' "${names[@]}" | sort | uniq -c

The script only assumes that the third field in the filename does not contain embedded newlines.

Testing this on the example filenames in the question:

$ ls
list              my_file_A_f.txt   my_file_B_x.txt   my_file_D_g.txt
my_file_A_a.txt   my_file_A_t.txt   my_file_C_f.txt   my_file_E_r.txt
my_file_A_d.txt   my_file_B_r.txt   my_file_D_f.txt   script
$ ./script
   4 A
   2 B
   1 C
   2 D
   1 E

You could filter this through a simple awk script to get it into whatever format you wish.

$ ./script | awk '{ printf "%s: %d\n", $2, $1 }'
A: 4
B: 2
C: 1
D: 2
E: 1

If your names are well-behaved, meaning there are no embedded newline characters at all in any of them, then you may simplify the script somewhat and use cut instead.

#!/bin/bash
shopt -s extglob nullglob
printf '%s\n' +([!_])+([!])+([!])+([!]).txt |
cut -d _ -f 3 | sort | uniq -c

K-attila- · Answer 6 · 2021-09-30T13:07:23.730

Sort, sed and uniq enough:

ls |grep my_file | sed "s/.*_.*_\(.*\)_.*txt/\1/"|sort |uniq -c|sed "s/[^0-9]*\([0-9]*\) \(.*\)/\2: \1/"

Another oneliner, just 3 variables:

count=0;chchange="dummy";ls | sed -n "s/.*my_file.*_\(.*\)_.*txt/\1/p"|sort|cat - <(echo end) |while read a ; do  if [ $a == $chchange ] ; then  ((count++));else if [ $chchange != "dummy" ] ;then  echo "$chchange $count"; fi; count=1; chchange=$a; fi;  done;

Need to put one extra line to the sort output.

Group and count file names following a pattern

6 Answers6