how to make the inner loop faster?

Question

I created my first shell script few days ago and while testing it during creation on few files, it worked flawlessly. However now in practice, I have over 12000 files to edit with it and it is going VERY slow. So is it possible to make it faster? I tried to shorten this part:

grep -rl "${id[$j]}" ../usage --exclude-dir="*/.git*" --exclude=*.{png,jpg,pdf} --include=*.dita | xargs sed -i "s/_[0-9]\+\"/_$apps.$title\"/g";

grep -rl "${id[$j]}" ../usage --exclude-dir="*/.git*" --exclude=*.{png,jpg,pdf} --include=*.dita | xargs sed -i "s/_[0-9]\+\//_$apps.$title\//g";

But I wasn't able to make it work with operators:

grep -rl "${id[$j]}" ../usage --exclude-dir="*/.git*" --exclude=*.{png,jpg,pdf} --include=*.dita | xargs sed -i "s/_[0-9]\+\"/_$apps.$title\"/g" | xargs sed -i "s/_[0-9]\+\//_$apps.$title\//g";

I also tried with && operator it works on the files where I have both cases, but I need the second sed to work even if the first one failed.

I would appreciate your suggestions. Here is my script:

len_1=($(find . -name "*.dita" -not -path "*/.git*"))
len=${#len_1[@]}
echo -e "${CYAN}Found $len objects for modifying...${OUTPUT}"
#echo $len

for ((i=0; i<len; i++)); do
    id=($(grep -Po 'id="\K[^"]+' ${len_1[$i]}))
    echo -e "${CYAN}Modifying ${len_1[$i]}${OUTPUT}"
    apps=$(grep -Po 'appname="\K[^"]+' ${len_1[$i]}) && title=$(grep -Po '<title>\K.*?(?=</title>)' ${len_1[$i]} | head -1) && sed -i "s/_[0-9]\+/_$apps.$title/g" ${len_1[$i]} && sed -i "s/id=\"[0-9]\+\"\+/id=\"$apps.$title\"/g" ${len_1[$i]};

    if [ ${#id[@]} -gt 0 ]
    then
        for ((j=0; j<${#id[@]}; j++)); do
            echo -e "${RED}Searching for ${id[$j]}...${OUTPUT}"
            grep -rl "${id[$j]}" ../usage --exclude-dir="*/.git*" --exclude=*.{png,jpg,pdf} --include=*.dita | xargs sed -i "s/_[0-9]\+\"/_$apps.$title\"/g" ;
            grep -rl "${id[$j]}" ../usage --exclude-dir="*/.git*" --exclude=*.{png,jpg,pdf} --include=*.dita | xargs sed -i "s/_[0-9]\+\//_$apps.$title\//g";
        done
    else
        echo -e "${RED}Didn't found IDs...${OUTPUT}";
    fi
done

How many files are there in the ../usage and . directories, excluding files in .git? Note that you will be reading the files in ../usage twice for each file in . so if there are 6,000 files in each and there is one line matching id=".*" in each you will be reading 72,000,000 files. You can also be launching a very large number of sed processes. Cutting this number in half is good, but 36,000,000 is still a lot! Can you edit the question to show some typical input files and desired output? — icarus, Sep 02 '19 at 10:27
if you want it not to be abysmally slow, don't do it in a shell loop. use awk or perl or python (or almost anything except shell) for the entire job. See Why is using a shell loop to process text considered bad practice? — cas, Sep 02 '19 at 11:23

score 2 · Accepted Answer · answered Sep 02 '19 at 09:34

2

What about matching " or / and capture them?

sed -i "s/_[0-9]\+\([\"\/]\)/_$apps.$title\1/g"

or, more readably as

sed -i "s=_[0-9]\+\([\"/]\)=_$apps.$title\1=g"

answered Sep 02 '19 at 09:34

choroba

47,233

Ideally, the grep | sed pipeline is moved out of the double loop, if at all possible (haven't looked at it too closely). – Kusalananda Sep 02 '19 at 09:37
Without testing data, I'm not going to go any deeper. – choroba Sep 02 '19 at 09:44
Thanks that made it much faster and was exactly what I was looking for. I will try to remove the for loop now. – revaljilji Sep 02 '19 at 09:44

how to make the inner loop faster?

1 Answers1