0

I have a directory on an Ubuntu machine with a large number of files, and subgroups of files starting with the same prefix. I need the list of distinct prefixes present in the name of the directory files, as shown below. For the listing:

pj6_ex_18_i535_tr_92.pdf
pj6_ex_18_i535_tr_95.pdf
...
pj6_ex_14_i535_tr_96.pdf
pj6_ex_14_i535_tr_97.pdf
pj6_ex_14_i535_tr_98.pdf
....
pj1_ex_24_i535_tr_91.pdf
pj1_ex_24_i535_tr_92.pdf
pj1_ex_24_i535_tr_93.pdf
...
pj3_ex_16_i535_tr_23.pdf
pj3_ex_16_i535_tr_22.pdf

I need to get the following list. I imagine that via awk commands it is possible, but I don't know how.

pj6_ex_18_
pj6_ex_14_i535_
pj1_ex_24_i535_
pj3_ex_16_i535_

How can I do this?

terdon
  • 242,166
  • 2
    what is the different role for the pj6_ex_18_ you cut up to 3 hyphens but others 4? typo issue? also awk is not a tool for renaming, don't imagine about awk for this kind of purposes, it's a text-processing tool. – αғsнιη May 05 '21 at 14:03
  • this is almost a duplicate of https://unix.stackexchange.com/q/645966/7696 but with different data - similar enough that I could recycle my answer from there with a few small changes. – cas May 05 '21 at 14:20
  • 1
    Please define what a "prefix" is. – glenn jackman May 05 '21 at 15:02

1 Answers1

1
$ perl -lne '
    s/_tr.*/_/;
    unless (defined($prefixes) && m/^($prefixes)_/) {
      $prefixes{$_}++;
      $prefixes=join("|", map +( "\Q$_\E" ), keys %prefixes);
    };

    END { print join("\n", sort keys %prefixes) }' <(sort input.txt)
pj1_ex_24_i535_
pj3_ex_16_i535_
pj6_ex_14_i535_
pj6_ex_18_i535_

or even shorter, just keeping track of the last line seen rather than every unique prefix:

$ perl -lne '
    next if (defined($last) && m/^\Q$last\E/);
    s/_tr.*/_/;
    $last=$_;
    print' <(sort input.txt)
pj1_ex_24_i535_
pj3_ex_16_i535_
pj6_ex_14_i535_
pj6_ex_18_i535_

In both versions, the \Q and \E in the m// match operation prevent any regex meta-characters from being interpreted in $last. e.g. if it contains something like .*, it will be interpreted as a literal . and a literal *, not as "zero-or-more of any character".

cas
  • 78,579