4

I am trying to process a large file-set, appending specific lines into the "test_result.txt" file - I achieved it -not very elegantly- with the following code.

for i in *merged; do
        while read -r lo; do
                if [[ $lo == *"ID"* ]]; then
                echo $lo >> test_result.txt
                fi
                if [[ $lo == *"Instance"* ]]; then
                echo $lo >> test_result.txt
                fi
                if [[ $lo == *"NOT"* ]]; then
                echo $lo >> test_result.txt
                fi
                if [[ $lo == *"AI"* ]]; then
                echo $lo >> test_result.txt
                fi
                if [[ $lo == *"Sitting"* ]]; then
                echo $lo >> test_result.txt

        done < $i
done

However, I am trying to size-it-down using an array - which resulted in quite an unsuccessful attempt.

KEYWORDS=("ID" "Instance" "NOT" "AI" "Sitting" )
KEY_COUNT=0

for i in *merged; do
        while read -r lo; do
                if [[$lo == ${KEYWORDS[@]} ]]; then
                echo $lo >> ~/Desktop/test_result.txt && KEY_COUNT="`expr $KEY_COUNT + 1`"
                fi
        done < $i
done
Rui F Ribeiro
  • 56,709
  • 26
  • 150
  • 232
madArch
  • 97

2 Answers2

5

It looks like you want to get all the lines that contains at least one out of a set of words, from a set of files.

Assuming that you don't have many thousands of files, you could do that with a single grep command:

grep -wE '(ID|Instance|NOT|AI|Sitting)' ./*merged >outputfile

This would extract the lines matching any of the words listed in the pattern from the files whose names matches *merged.

The -w with grep ensures that the given strings are not matched as substrings (i.e. NOT will not be matched in NOTICE). The -E option enables the alternation with | in the pattern.

Add the -h option to the command if you don't want the names of the files containing matching lines in the output.

If you do have many thousands of files, the above command may fail due to expanding to a too long command line. In that case, you may want to do something like

for file in ./*merged; do
    grep -wE '(ID|Instance|NOT|AI|Sitting)' "$file"
done >outputfile

which would run the grep command once on each file, or,

find . -maxdepth 1 -type f -name '*merged' \
    -exec grep -wE '(ID|Instance|NOT|AI|Sitting)' {} + >outputfile

which would do as few invocations of grep as possible with as many files as possible at once.

Related:

Kusalananda
  • 333,661
3

Adding an array doesn't particularly help: you still would need to loop over the elements of the array (see How do I test if an item is in a bash array?):

while read -r lo; do
    for keyword in "${keywords[@]}"; do
        if [[ $lo == *$keyword* ]]; then
            echo $lo >> ~/Desktop/test_result.txt && KEY_COUNT="`expr $KEY_COUNT + 1`"
        fi
    done
done < "$i"

It might be better to use a case statement:

while read -r lo; do
    case $lo in
    *(ID|Instance|NOT|AI|Sitting)*)
        echo "$lo" >> ~/Desktop/test_result.txt && KEY_COUNT="`expr $KEY_COUNT + 1`"
        ;;
    esac
done < "$i"

(I assume you do further processing of these lines within the loop. If not, grep or awk could do this more efficiently.)

muru
  • 72,889