0

I have a script called get_numbers.sh, which I want to use to extract data from .pdf files labelled sequentially by date, using pdfgrep.

Let me simplify my problem to what I believe are its essentials:

The .pdfs are named file-07-01.pdf, file-07-02.pdf, ..., file-07-31.pdf, where the numbers correspond to the month and day of the data in the file.

I can enter a shell command like pdfgrep -i "Total Number:" file-07-{01..12}.pdf

and I get exactly what I want, the appropriate text from each file for the dates 07-01 to 07-12.

I want to make a script for this, where all I have to do is enter the start and end dates as well as the month. This was my first go:

    #!/usr/bin/bash
if [ "$#" -eq 3 ]
then
    START_DAY=$1 
    END_DAY=$2
    MONTH=$3
else
    echo "INCORRECT USAGE:"
    exit 1
fi   


pdfgrep -i "Total Number:" file-$MONTH-{$START_DAY..$END_DAY}.pdf

But doing bash get_numbers.sh 01 12 07 gives me the error message

pdfgrep: Could not open file-07-{01..12}.pdf

Looking around, I've realized that you have to be careful when doing this, because the script will interpret file-$MONTH-{$START_DAY..$END_DAY}.pdf as a literal string rather than a glob. I've tried to modify this, by making this a variable, or putting double quotes around it, but it doesn't change the result. What am I missing?

nonreligious
  • 113
  • 6
  • 1
    It's the passing of arguments into the brace expansion that's the issue here - see for example How can I use $variable in a shell brace expansion of a sequence?. Personally I'd use seq rather than the accepted answer based on eval. – steeldriver Aug 09 '21 at 16:41
  • @steeldriver Ah, that did the trick - thank you. Switched to a for loop with seq like one of the later answers/examples, because eval was doing something funny with the search pattern. You learn something every day! – nonreligious Aug 09 '21 at 17:27
  • fwiw not sure if you realize that you can still pass all the filenames to a single pdfgrep if you assemble them into an array: for m in $(seq -w "$1" "$2"); do files+=("file-$3-$m.pdf"); done then pdfgrep -i "Total Number:" "${files[@]}" – steeldriver Aug 09 '21 at 17:33
  • I see, yes I was doing the pdfgrep inside the loop - this does seem to be noticeably faster. Thanks again! – nonreligious Aug 09 '21 at 18:04

1 Answers1

2

Switch from bash to zsh which does have a decimal range glob operator (it can also, contrary to bash use variables in {$start..$end}, but that's not a glob operator).

#!/usr/bin/zsh -
if
  (( $# == 3 )) &&
    start_day=$1 end_day=$2 month=$3 &&
    [[ $start_day = <1-31> && $end_day = <1-31> && $month = <1-12> ]] &&
    (( end_day >= start_day ))
then
  pattern="file-<$month-$month>-<$start_day-$end_day>.pdf"
  exec pdfgrep -i "Total Number:" $~pattern(n) 
else
  print -u2 Incorrect usage
  exit 1
fi

<1-5> will match on any sequence of decimal digits that represent integer numbers from 1 to 5, including 2, 02, 002... That's why we also use <$month-$month> so that <2-2> can match on both 2 and 02.

The (n) glob qualifier causes the glob expansion to sort numerically, so that file-12-2.pdf comes before file-12-13.pdf and file-12-03.pdf for instance.