Recursively search for files that duplicate the same first 8 letters

Question

I have folders with many files that all start with a date - 20200403 for example. I want to find which folders have multiple files with that same date i.e. the first 8 letters are the same. Only in each specific folder do the dates matter. It doesn't matter across folders.

The file metadata will not necessarily comply with the file name date so I cant use that as a way to find them.

Please share your attempts, even if incomplete, to a solution or research (How to ask). Contributors are more likely to help when they see effort on the asking side, and here none is visible. — Quasímodo, Mar 13 '21 at 21:37
For example, Find duplicate files based on first few characters of filename looks promising. — Quasímodo, Mar 13 '21 at 21:45
"files that all start with a date". Do you mean "file_names_ that all start with a date", or do you actually mean that the files all have a date as a string on their first line of text? — Kusalananda, Mar 14 '21 at 12:40

steeldriver · Answer 1 · 2021-03-14T01:27:22.510

You could consider passing a (suitably sorted) list of filenames through uniq -d. Assuming your shell and uniq have the same ideas about collation order for example

printf -- "%s\n" * | cut -c1-8 | uniq -d

If the result is non-empty, then there must be duplicates. Wrapping this in a find command:

find . -type d -exec sh -c '
  cd "$1" && test -n "$(printf -- "%s\n" * | cut -c1-8 | uniq -d)"
' find-sh {} \; -print

So given

$ tree .
.
├── subdir1
│   └── 20200403foo
├── subdir2
│   ├── 20200403bar
│   └── 20200403foo
├── subdir3
│   └── 20200403foo
├── subdir4
│   ├── 20200403bar
│   └── 20200403foo
└── subdir5
    └── 20200403foo
5 directories, 7 files

then

$ find . -type d -exec sh -c 'cd "$1" && test -n "$(printf -- "%s\n" * | cut -c1-8 | uniq -d)"' find-sh {} \; -print
./subdir4
./subdir2

If you need to handle filenames containing newlines and your cut and uniq support null delimiters, you can change the pipeline to

printf "./%s\0" * | cut -zc1-10 | uniq -zd

fra-san · Answer 2 · 2021-03-14T22:53:56.527

A portable solution that also correctly handles arbitrary file paths, including those containing newline characters, at the price of some degree of inelegance and slowness:

find /path/to/dir -type d \( -exec sh -c '
  cd "$1"
  printf "%s/" [0123456789][0123456789][0123456789][0123456789][01][0123456789][0123][0123456789]* \
    | awk -v RS="/" "seen[substr(\$0,1,8)]++ { exit 1 }"
  ' mysh {} \; -o -print \)

find is used to recursively search for directories in /path/to/dir and, in each found directory, execute a script that pipes the file names that match a pattern roughly resembling a date (the yyyy/mm/dd format is assumed), each one terminated by a /, to an awk instance that reads /-separated records and exits with a status of 1 as soon as a duplicate eight-character (starting from the first one) string is found in the input, causing the directory name to be -printed.

A faster alternative based on GNU tools:

find /path/to/dir -type f -name '[0123456789][0123456789][0123456789][0123456789][01][0123456789][0123][0123456789]*' \
  -print0 | awk -v FS='/' -v OFS='/' -v RS='\0' '
  { file=substr($NF,1,8); $NF=""; dir=$0 }
  seen[dir file]++ { dupl[dir] }
  END { for (d in dupl) print d }'

Here, only regular files whose name (roughly) starts with a date are recursively searched for in /path/to/dir. Found file paths are piped to awk as a NUL-separated stream of records. For each record, only the first eight characters of the last component (the file name) are kept and the resulting path is stored in an associative array. When a duplicate is found, the directory part (i.e. the path with the file name component removed) is printed.

Recursively search for files that duplicate the same first 8 letters

2 Answers2