0

I have folders with many files that all start with a date - 20200403 for example. I want to find which folders have multiple files with that same date i.e. the first 8 letters are the same. Only in each specific folder do the dates matter. It doesn't matter across folders.

The file metadata will not necessarily comply with the file name date so I cant use that as a way to find them.

AKDub
  • 11

2 Answers2

3

You could consider passing a (suitably sorted) list of filenames through uniq -d. Assuming your shell and uniq have the same ideas about collation order for example

printf -- "%s\n" * | cut -c1-8 | uniq -d

If the result is non-empty, then there must be duplicates. Wrapping this in a find command:

find . -type d -exec sh -c '
  cd "$1" && test -n "$(printf -- "%s\n" * | cut -c1-8 | uniq -d)"
' find-sh {} \; -print

So given

$ tree .
.
├── subdir1
│   └── 20200403foo
├── subdir2
│   ├── 20200403bar
│   └── 20200403foo
├── subdir3
│   └── 20200403foo
├── subdir4
│   ├── 20200403bar
│   └── 20200403foo
└── subdir5
    └── 20200403foo

5 directories, 7 files

then

$ find . -type d -exec sh -c 'cd "$1" && test -n "$(printf -- "%s\n" * | cut -c1-8 | uniq -d)"' find-sh {} \; -print
./subdir4
./subdir2

If you need to handle filenames containing newlines and your cut and uniq support null delimiters, you can change the pipeline to

printf "./%s\0" * | cut -zc1-10 | uniq -zd
steeldriver
  • 81,074
0

A portable solution that also correctly handles arbitrary file paths, including those containing newline characters, at the price of some degree of inelegance and slowness:

find /path/to/dir -type d \( -exec sh -c '
  cd "$1"
  printf "%s/" [0123456789][0123456789][0123456789][0123456789][01][0123456789][0123][0123456789]* \
    | awk -v RS="/" "seen[substr(\$0,1,8)]++ { exit 1 }"
  ' mysh {} \; -o -print \)

find is used to recursively search for directories in /path/to/dir and, in each found directory, execute a script that pipes the file names that match a pattern roughly resembling a date (the yyyy/mm/dd format is assumed), each one terminated by a /, to an awk instance that reads /-separated records and exits with a status of 1 as soon as a duplicate eight-character (starting from the first one) string is found in the input, causing the directory name to be -printed.

A faster alternative based on GNU tools:

find /path/to/dir -type f -name '[0123456789][0123456789][0123456789][0123456789][01][0123456789][0123][0123456789]*' \
  -print0 | awk -v FS='/' -v OFS='/' -v RS='\0' '
  { file=substr($NF,1,8); $NF=""; dir=$0 }
  seen[dir file]++ { dupl[dir] }
  END { for (d in dupl) print d }'

Here, only regular files whose name (roughly) starts with a date are recursively searched for in /path/to/dir. Found file paths are piped to awk as a NUL-separated stream of records. For each record, only the first eight characters of the last component (the file name) are kept and the resulting path is stored in an associative array. When a duplicate is found, the directory part (i.e. the path with the file name component removed) is printed.

fra-san
  • 10,205
  • 2
  • 22
  • 43