0

I have a large project for which I'm trying to find directories that don't contain a *_out.csv file. I have looked at other similar answers and I think I am almost there.

The problem I'm running into is that I only want to look in directories that proceed analysis/ but I also don't want to look in a few specific directories that also proceed analysis.

I have set up a small example problem:

$ tree
.
├── case1
│   ├── analysis
│   │   ├── test1
│   │   │   ├── gold
│   │   │   └── test1_out.csv
│   │   └── test2
│   └── doc
└── case2
    ├── analysis
    │   ├── test3
    │   │   └── gold
    │   └── test4
    │       └── test4_out.csv
    └── doc

12 directories, 2 files

I don't want to look in directories titled */doc/* or */gold/*. My current command is:

find . -type d -not -name "doc" -not -name "gold" '!' -exec test -e "{}/*_out.csv" ';' -print

Which results in:

.
./case1
./case1/analysis
./case1/analysis/test1
./case1/analysis/test2
./case2
./case2/analysis
./case2/analysis/test3
./case2/analysis/test4

My ideal output would look like

./case1/analysis/test2
./case2/analysis/test3

So as you can see, my current find command is excluding the doc and gold directories, but it's not excluding directories which have a *_out.csv file and also not excluding directories that don't proceed analysis/.

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
dylanjm
  • 201

2 Answers2

1

So you want to look under directories of the form */analysis, excluding certain subdirectories.

Instead of searching everything under ., search only under */analysis.

To exclude a subdirectory, use -prune. This is an action that tells find not to traverse that subdirectory recursively.

Finally, to test whether a file exists that matches a pattern, you need to invoke a shell. You're invoking test directly from find, but test doesn't do pattern matching, so it's only testing the existence of a file whose name contains a literal * character. Invoke sh, passing it the name of the directory as an argument: -exec sh -c '…' {} \;. In the sh code, expand a wildcard to generate the list of matching files, and check if there is at least one existing file.

find ./*/analysis -name "doc" -prune -o -name "gold" -prune -o \
     -type d \! -exec sh -c 'set -- "$0"/*_out.csv; test -e "$1"' {} ';' -print

(I assume that there are no dangling symbolic links whose name end with _out.csv.)

1

Your task is a duplicate of this question. The same strategy will work:

  1. Find all your *_out.csv files, strip off the basename and uniq the list.

  2. Find all the directories that you hope would have *_out.csv files, and remove the entries in list 1 from the list from step 2.

This script does that, with output descriptors:

echo "csv files exist in:"
find . -type f -name \*_out.csv | sed -e 's/\/[^\/]*$//' |  
    sort -u | tee csv-dirs.txt

echo
echo "dirs we hope would have csv's:"
find . -type d | egrep '/analysis/' | egrep -v '/(doc|gold)(/.*|)$' |  
    tee all-dirs.txt

echo
echo "all dirs less the ones that do have csv's:"
egrep -vxFf csv-dirs.txt all-dirs.txt

Condensed a little, that could be just:

$ find . -type f -name \*_out.csv |  
    sed -e 's/\/[^\/]*$//' | sort -u > csv-dirs.txt
$ find . -type d | egrep '/analysis/' |  
    egrep -v '/(doc|gold)(/.*|)$' | egrep -vxFf csv-dirs.txt
./case1/analysis/test2
./case2/analysis/test3
Jim L.
  • 7,997
  • 1
  • 13
  • 27