1

Got a tricky and complex program that is used for pre-processing text in order to send it to a Machine Learning software.

To make long story short:

The bash script gets inside a folder where thousands of text files are waiting, opening them with CAT, cleaning and deleting superfluous lines and then, prior to sending files to Machine Learning process writes a CSV to disk with some info for later human checking.

It's very important to keep the line number besides their content because the order of apparition of words is a key for the ML process.

So, mi approach is to add line number to every line this way (one liner with many piped commands):

for every file in *.txt
do

cat -v $file | nl -nrz -w4 -s$'\t' | .......

Then I get rid of undesired lines this way (sample) :

 ...... | sed '/^$/d'| grep -vEi 'unsettling|aforementioned|ruled' 

and finally keep two lines for further processing this way:

........ | grep -A 1 -Ei 'university|institute|trust|college'

The output is something like this (sampling two files):

file 1.txt
0098  university of Goteborg is downtown and is one of the
0099  most beautiful building you can visit

0123  the institute of Oslo for advanced investigation
0124  is near the central station and keeps

0234  most important college of Munich
0235  and the most acclaimed teachers are

file 2.txt
0023  there is no trust or confidence
0024  in the counselor to accomplish the new

0182  usually the college is visited
0183  every term for the president but

[EDITED] Missed this step that was in the wrong line. Sorry.

Then, the text is stacked into "paragraphs" this way:

tr '\n\r' ' '| grep -Eio '.{0,0}university.{0,25}|.{0,0}college.{0,25}'

[END EDIT]

This output is saved as variable "CLEANED_TXT" and piped into a WHILE, this way:

while read everyline; do 

    if [[ -n "${everyline// }" ]];then

            echo "$file;$linenumber;$everyline" >> output.csv
    fi  

    done <<< "$CLEANED_TXT"

done  # for every text file

FINAL DESIRED OUTPUT

file 1.txt;0098;university of Goteborg
file 1.txt;0123;the institute of Oslo
file 1.txt;0234;college of Munich

My issue is the line number is lost at this last step because of the GREP just before the loop. Take into account that I need the original line number. Re-numbering inside the loop is not allowed.

I'm stuck. Any help would be much appreciated.

Regards

jomaweb
  • 531

1 Answers1

0

UPDATE2 get rid of whole tr ... | grep line (it's just making a mess) and replace your while with:

while read linenumber everyline; do
        everyline=$(echo $everyline | grep -Eio '.{0,0}university.{0,25}|.{0,0}college.{0,25}')
        if [[ -n "$everyline" ]]; then
            echo "$file;$linenumber;$everyline" >> output.csv
        fi
done

that will populate $linenumber with correct value, and matched words at right place:

file1.txt;0098;university of Goteborg is downtown
file1.txt;0234;college of Munich
file1.txt;0182;college is visited

note however that whole thing is a mess, and should really be rewritten in perl or awk or similar.

Matija Nalis
  • 3,111
  • 1
  • 14
  • 27