A portable solution that also correctly handles arbitrary file paths, including those containing newline characters, at the price of some degree of inelegance and slowness:
find /path/to/dir -type d \( -exec sh -c '
cd "$1"
printf "%s/" [0123456789][0123456789][0123456789][0123456789][01][0123456789][0123][0123456789]* \
| awk -v RS="/" "seen[substr(\$0,1,8)]++ { exit 1 }"
' mysh {} \; -o -print \)
find
is used to recursively search for directories in /path/to/dir
and, in each found directory, execute a script that pipes the file names that match a pattern roughly resembling a date (the yyyy/mm/dd format is assumed), each one terminated by a /
, to an awk
instance that reads /
-separated records and exits with a status of 1
as soon as a duplicate eight-character (starting from the first one) string is found in the input, causing the directory name to be -print
ed.
A faster alternative based on GNU tools:
find /path/to/dir -type f -name '[0123456789][0123456789][0123456789][0123456789][01][0123456789][0123][0123456789]*' \
-print0 | awk -v FS='/' -v OFS='/' -v RS='\0' '
{ file=substr($NF,1,8); $NF=""; dir=$0 }
seen[dir file]++ { dupl[dir] }
END { for (d in dupl) print d }'
Here, only regular files whose name (roughly) starts with a date are recursively searched for in /path/to/dir
. Found file paths are piped to awk
as a NUL-separated stream of records. For each record, only the first eight characters of the last component (the file name) are kept and the resulting path is stored in an associative array. When a duplicate is found, the directory part (i.e. the path with the file name component removed) is printed.