3

I'm converting doc files to txt using catdoc on Linux. To keep the same file name as output file I'm replacing the .doc extension with .txt using parameter expension. But there are many doc files ending on .DOC. How to make the .doc in ${filename%.doc}.txt case insentive while keeping the capitals in the filename itself? I can't use ${filename%.*}.txt because some files have dots in the filename

My current code:

find "${COMPANYPATH}" -iname '*.doc' | while read -r file; do
    echo "${file}"
    filename=$(basename "${file}")
    path="${file%/*}/"
    mkdir -p "${OUTPUTPATH}/DOC/${path#$COMPANYPATH/}"
    catdoc "${file}" >> "${OUTPUTPATH}/DOC/${path#$COMPANYPATH}${filename%.doc}.txt"
done

input

/home/user/test/2218-0/test.doc
/home/user/test/2218-0/Test2.DOC

Expected output

/home/user/output/test/DOC/2218-0/test.txt
/home/user/output/test/DOC/2218-0/Test2.txt

There are no duplicated files.

terdon
  • 242,166

2 Answers2

7

I don't think you can make the pattern match in ${filename%.doc} case-insensitive in Bash. You could do it zsh, with ${filename%(#i).doc} (requires setopt extendedglob enabled) or ksh93 with ${filename%~(i:.doc)}. Bash's nocasematch doesn't help, it only works in case and [[ .. ]] constructs.

In any POSIX shell, there's always the workaround of explicitly listing both uppercase and lowercase characters with ${filename%.[dD][oO][cC]}. Or just remove the three last characters with ${filename%.???}, knowing find only gives you the correct ones.

Then again ${filename%.*} should only remove the shortest matching part, so that should also not be a problem. (%% would remove the longest.)

zsh:

% setopt extendedglob
% filename=foo.bar.DoC
% echo ${filename%.(#i)doc}.txt
foo.bar.txt

sh/Bash:

$ filename=foo.bar.DoC
$ echo "${filename%.[dD][oO][cC]}.txt"
foo.bar.txt
$ echo "${filename%.*}.txt"
foo.bar.txt
ilkkachu
  • 138,973
5

You don't. Just remove the extension entirely instead:

find "${COMPANYPATH}" -iname '*.doc' | while read -r file; do
    echo "${file}"
    filename=$(basename "${file}")
    name="${file%.*}"
    path="${file%/*}"
    noComPath="${path#$COMPANYPATH/}"
    mkdir -p "${OUTPUTPATH}/DOC/$noComPath"
    catdoc "${file}" >> "${OUTPUTPATH}/DOC/$noComPath/$name.txt"
done

The expression name="${file%.*}" will set the variable name to the name of the file with the anything from the last . to the end removed. If there are many ., only the last is removed:

$ foo=file.foo.bar.DoC
$ echo "${foo%.*}"
file.foo.bar

And here is a more robust version that can deal with arbitrary file names (your would fail if a file name contains a newline character for instance):

LC_ALL=C find "${COMPANYPATH}" -iname '*.doc' -type f -print0 |
  while IFS= read -r -d '' file; do
    printf>&2 'Processing "%s"\n' "${file}"
    basename="${file##*/}"
    dirname="${file%/*}"
    rootname="${basename%.*}"
    targetdir=${OUTPUTPATH}/DOC/${dirname#"${COMPANYPATH}/"}
    mkdir -p -- "${targetdir}" &&
      catdoc -- "${file}" >> "${targetdir}/${rootname}.txt"
  done
terdon
  • 242,166
  • Thanks you for your recommendation on a more robust version. I decided to mark @ilkkachu his answer as the right one because it fits the initial question more – unixcandles Mar 30 '24 at 20:59
  • @unixcandles of course, it is a better answer that focuses on exactly what you need! In any case, there's never a reason to justify which answer you choose, it is entirely up to you :). – terdon Mar 30 '24 at 21:35