2

I am using the below while loop to read a file.

while read file
do
    FileFound="`find $DataDir -name $file -print 2>/dev/null`"
    if [ -n "$FileFound" ]; then
        echo $FileFound >> ${runDir}/st_$Region
        else
            echo $file >> ${APP_HOME}/${Region}_filesnotfound_$date.txt
        fi
done<${Region}_${date}.txt

This while loop reads a file name and compares in the datadir to find if any match is available. if available it will put the whole path to a file. if not available it puts it to a different file. However this script is taking 2 days to complete for reading 8000 records. Is there a way to optimize it?

5 Answers5

2

If you're on a modern Linux desktop, you probably have a file-indexing tool like mlocate already installed and indexing files in the background. If so, you can just use that:

while read file
do
    locate "$file" >> "${runDir}/st_$Region" || echo "$file" >> "${APP_HOME}/${Region}_filesnotfound_$date.txt"
done<"${Region}_${date}.txt"

If the files you're looking for are updated frequently, you might first manually force the database to update with updatedb or whatever is appropriate for your version of locate.

1

This script is for only 1 occurrence of a particular file. So if there are two files of the same name in different directories, only one will be reported. It's not been tested.

declare -a arr
tmp1=$$tmp1

while read file
do
    base=$(basename "$file")
    echo "$base" >> "$tmp1"
    arr["$base"]="$file"
done <(find "$DataDir")

cat "$tmp1" | sort | uniq > "$tmp1"
tmp2=$$tmp2
cat "${Region}_${date}.txt" | sort | uniq > "$tmp2"

for file in "$(join <(cat "$tmp1") <(cat "$tmp2"))"
do
    echo "${arr["$file"]}" >> ${runDir}/st_$Region
done

for file in "$(cat "$tmp1" "$tmp2" | sort | uniq -u)"
do
    echo "$file" >> ${APP_HOME}/${Region}_filesnotfound_$date.txt
done

rm "$tmp1"
rm "$tmp2"
  • Hmm, a find command which -exec a shell function, it's unlikely to work. Also some typos: $$variable, find --exec. – xhienne Jan 10 '17 at 03:16
  • @xhienne I corrected the script. I use $$tmp to create a tmp file with pid in name. –  Jan 10 '17 at 13:46
1

With xargs + find

One solution is to use xargs to build insanely long find commands that will search for thousands of files at once:

sed -e 's/^/-o -name /' "${Region}_${date}.txt" \
| xargs find "$DataDir" -false \
> "${runDir}/st_$Region"

The first sed command turns each filename into the expression -o -name filename which will be appended by xargs to the find command. Then xargs execute the find command(s) it has built. The result is stored directly into the st_$Region file.

Fine. But how are we going to build ${Region}_filesnotfound_$date.txt, the list of files that were not found? Just by intersecting the full original list with the list of files found:

comm -3 \
    <(sort -u "${Region}_${date}.txt") \
    <(xargs -L1 basename < "${runDir}/st_$Region" | sort -u) \
    > "${Region}_filesnotfound_$date.txt"

comm -3 supresses the lines in common between the two files. Those are pseudo-files actually. The second file is the result of the basename command applied to each file found. Both files are sorted.

With find + grep

Another solution is to grep the filenames from the output of find. grep offers the possibility (via the-f option) to search a series of patterns stored in a file. We have a series of filenames in a file. Let's make it a pattern list and feed it to grep:

find "$DataDir" \
| grep -f <(sed 's|.*|/&$|' "${Region}_${date}.txt") \
> "${runDir}/st_$Region"

The sed command is mandatory: it anchors the filename to search at the end of the path.

As for the list of missing files, it would be built the same way as the other solution.

The problem with this solution is that filenames may contain characters that may be interpreted by grep: ., *, [, etc. We would have to escape them with sed (I leave it as an exercise to the reader). That's why the first solution is to be preferred IMHO.

Finally, note that I have used some bashisms here (e.g. process substitions <(...)). Don't expect any of my solutions to be POSIX compliant.

xhienne
  • 17,793
  • 2
  • 53
  • 69
0

For each iteration, you're crawling the whole directory tree. You'd want to run find only once. With GNU tools:

find "$DataDir" -print0 |
  FOUND=${runDir}/st_$Region \
  NOTFOUND=${APP_HOME}/${Region}_filesnotfound_$date.txt \
  awk -F/ '
    ARGIND == 1 {files[$0]; notfound[$0]; next}
    $NF in files {print > ENVIRON["FOUND"]; unset notfound[$0]}
    END {
      for (f in notfound) print f > ENVIRON["NOTFOUND"]
    }'  "${Region}_${date}.txt" RS='\0' -
-1

The slow part of this script is the find that searches the entire of your $DataDir for a match. By moving much of this component outside the loop you should be able to achieve a significant time saving:

ftmp=$(mktemp -t)
find "$DataDir" >"$ftmp" 2>/dev/null

while IFS= read -r file
do
    if grep -Fx -q "$file" "$ftmp"    # No RE patterns. Match full line
    then
        echo "$file" >>"$runDir/st_$Region"
    else
        echo "$file" >>"${APP_HOME}/${Region}_filesnotfound_$date.txt"
    fi
done <"${Region}_${date}.txt"

rm -f "$ftmp"

If your list of files in ${Region}_${date}.txt is really large you might get further savings by passing the entire file to grep and then using comm to identify the unmatched entries from the full list and the set of matches. The downside here is that because comm requires sorted lists, the output result lists will also become sorted:

fdata=$(mktemp -t)
fmatch=$(mktemp -t)
find "$DataDir" >"$fdata" 2>/dev/null

# No RE patterns. Match full line
grep -Fx -f "${Region}_${date}.txt" "$fdata" |
    tee -a "$runDir/st_$Region" |
    sort >"$fmatch"

# Pick out the filenames that didn't match
sort "${Region}_${date}.txt" |
    comm -23 - "$fmatch" >>"${APP_HOME}/${Region}_filesnotfound_$date.txt"

rm -f "$fdata" "$fmatch"
Chris Davies
  • 116,213
  • 16
  • 160
  • 287