Find duplicate files based on first few characters of filename

Question

I am looking for a way in Linux shell, preferably bash to find duplicates of files based on first few letters of the filenames.

Where this would be useful:

I build mod packs for Minecraft. As of 1.14.4 Forge no longer errors if there are duplicate mods in a pack of higher versions. It simply stops the oldest versions from running. A script to help find these duplicates would be very advantageous.

Example listing:

minecolonies-0.13.312-beta-universal.jar   
minecolonies-0.13.386-alpha-universal.jar

by quickly being able to identify the dupes i can keep the client pack small.

More information as requested

There is no specific format. However as you can see there at least 2 prevailing formats. Further there is no standard in community about what kind of characters to use or not use. Some use spaces (ick), some use [] (also ick), some use _'s (more ick), some use -'s (preferred but what can you do).

https://gist.github.com/be3cc9a77150194476b2000cb8ee16e5 for sample mods list of the filenames. Has been cleaned so no dupes in it.

https://gist.github.com/b0ac1e03145e893e880da45cf08ebd7a contains a sample where I deliberately made duplicates. It is an over-exaggeration of happens from time to time.

Deeper Explanation

I realize this might be resource heavy to do.

I would like to arbitrarily specify a slice range start to finish of all filenames to sample. Find duplicates based on that slice, and then hilight the duplicates. I don't need the script to actually delete them.

Extra Credit

The script would present a menu for files that it suspects match the duplication criterion allowing for easy deleting or renaming.

What is defining the end of testing for duplication? A dash? The 1st or 2nd number of a version? Something else? And what do you want to do after that? keep the first, the last? (add this information into the question) — thanasisp, Oct 29 '20 at 16:55
Can you add information on how the filename is structured? Is it always <package name>-<version>-<string>-<string>.jar, and is a match of the <package name> part sufficient for the match? — AdminBee, Oct 29 '20 at 16:56
Do you have any link to a list including duplicates? It is still a bit fuzzy, you 'd get better answers if you could provide such a list. — thanasisp, Oct 29 '20 at 17:46
If you don't want any automation, then probably just view the files sorted by name with ls -1 and you see suspected duplicates. — thanasisp, Oct 29 '20 at 18:05
@thanasisp added an over-exaggerated example of what i'm talking about. — Kreezxil, Oct 29 '20 at 18:16
Yes, you want "interaction", "automation" I mean the script to decide and keep/delete files without any more intervention. — thanasisp, Oct 29 '20 at 18:26
What @thanasisp said. You're not asking for a script, you're asking for an interactive program that allows you to make decisions interactively. Given the wide range of options , particularly the files starting with [1.16. which have nothing to do with each other doesn't really make this an easy task — tink, Oct 29 '20 at 19:17
create a sort table with some sort criteria: convert strings into lowercase. split file names into words. delete everything that isn't word (like numbers, dots). delete some keywords (like jar, alpha, beta, build). sort words for each row. concatenate whole row into single sort criteria. now dups should look identical (like minecoloniesuniversal). But if you want craftingtweaks match crafttweaker it gets indeed a little more complex, but there exist agrep for this — alecxs, Oct 29 '20 at 22:21
@alecxs thank you, that's a very useful grep, I think I might have script now that can do what i want. I'll post a solution if no one else beats me to it. — Kreezxil, Oct 29 '20 at 22:51
If you accepted a solution that was posted, you don't have to add it to the question; if you came up with a different solution, you should post it as an answer instead of making it part of the question. — Benjamin W., Nov 01 '20 at 19:34
Thank you @BenjaminW. i have adjusted the question as you have suggested and posted my script as answer. I still wanted to thanasisp to have credit as the awk script given is the core of my script. — Kreezxil, Nov 02 '20 at 12:48

thanasisp · Accepted Answer · 2020-10-31T06:25:10.563

Filter possible duplicates

You could use some script to filter these files for possible duplicates. You can move into a new directory all files matching with at least another one, case insensitively, on the part before the first dash, underscore or space in their names. cd into your jars directory to run it.

#!/bin/bash
mkdir -p possible_dups
awk -F'[-_ ]' '
    NR==FNR {seen[tolower($1)]++; next}
    seen[tolower($1)] > 1
' <(printf "%s\n" .jar) <(printf "%s\n" .jar) |

    xargs -r -d'\n' mv -t possible_dups/ --

Note: -r is a GNU extension to avoid running mv once with no file arguments when no possible duplicates are found. Also GNU parameter -d'\n' separates filenames by newlines, that means spaces and other usual characters are handled in the above command but not newlines.

You can edit the field separator assignment, -F'[-_ ]' to add or remove characters to define the end of the part we test for duplication. Now it means "dash or undescore or the space". It's generally good to catch more than the real duplication cases, like I probably do here.

Now you can inspect these files. You could also do directly the next step, on all files, without filtering, if you feel their number is not very large.

Visual inspection of possible duplicates

I suggest you to use a visual shell for this task, like mc, the Midnight Commander. You can easily install mc with the package management tool of your linux distribution.

You invoke mc into the directory you have these files, or you can navigate there. Using an X-terminal you can also have the mouse support but there are handy shortcuts for anything.

For example, follow the menu Left -> Sorting... -> untick "case sensitive" will give you the sorted view you want.

Navigate over the files using the arrows, and you can select many of them with Insert and then you can copy (F5), move (F6) or delete (F8) the hightlighted selections. Here is a screenshot of how it looks on your test data filtered:

I use ranger. Btw, your script is giving me errors.
mv: missing file operand — Kreezxil, Oct 30 '20 at 22:08
I updated to fix that, thanks. This error appeared because there were no file arguments for mv, so nothing happened. For example, when I run the script for a second time, and nothing is printed out of the awk command. I added -r parameter to xargs, which is a GNU extension, meaning to run nothing if the input from the pipe is empty. — thanasisp, Oct 31 '20 at 04:22
I see ranger screenshots and it looks very good to me, it's better to use whatever you are familiar with. — thanasisp, Oct 31 '20 at 04:25

Kreezxil · Answer 2 · 2020-12-19T02:58:12.623

We have a Solution I have accepted the answer that allowed to easily accomplish my goal of bash driven program that doesn't involve a shell manager like MC or Ranger.

#!/bin/bash
declare -a names
xIFS="${IFS}"
IFS="^M"
while true; do
awk -F'[-_ ]' '
    NR==FNR {seen[tolower($1)]++; next}
    seen[tolower($1)] > 1
' <(printf "%s\n" .jar) <(printf "%s\n" .jar) > tmp.dat
    IDX=0
    names=()


    readarray names &lt; tmp.dat

    size=${#names[@]}

    clear
    printf '\nPossible Dupes\n'

    for (( i=0; i&lt;${size}; i++)); do
            printf '%s\t%s' ${i} ${names[i]}
    done

    printf '\nWhich dupe would you like to delete?\nEnter # to delete or q to quit\n'
    read n

    if [ $n == 'q' ]; then
            exit
    fi

    if [ $n -lt 0 ] || [ $n -gt $size ]; then
            read -p &quot;Invalid Option: present [ENTER] to try again&quot; dummyvar
            continue
    fi

    #clean the carriage return \n from the name
    IFS='^M'
    read -ra TARGET &lt;&lt;&lt; &quot;${names[$n]}&quot;
    unset IFS

    # now remove the filename sans any carriage returns
    # from the filesystem
    # 12/18/2020
    rm &quot;${TARGET[*]}&quot; 
    echo &quot;removed ${TARGET[0]}&quot; &gt;&gt; rm.log

done
IFS="${xIFS}"

This works well for me as it doesn't involve trying to read hundreds of filenames for duplicates and will loop around until I'm happy with the outcome. It also saves my actions to a log file.

Generally speaking the duplication of mods that I encounter are few and far between but when they do it is bothersome. This script greatly improves that situation for me.

If you can make the script more intelligent or user friendly go for it, i'd like to see it.

EDITED: 11/5/20

reworded my thoughts
been using the script for several days now, very useful
what it allows me to do is to up my client pack, then upload everythign minus client mods to the server, then use this script to quick clean the server mods/ folder. So now my pack maintenance is even faster!
updated the script to use IFS and to cleanup the output in the menu

EDITED: 12/18/2020

one minor change makes the script behave correctly in even more situations.

Find duplicate files based on first few characters of filename

2 Answers2

Filter possible duplicates

Visual inspection of possible duplicates

Linked