keeping line number when processing text inside a grep

Question

Got a tricky and complex program that is used for pre-processing text in order to send it to a Machine Learning software.

To make long story short:

The bash script gets inside a folder where thousands of text files are waiting, opening them with CAT, cleaning and deleting superfluous lines and then, prior to sending files to Machine Learning process writes a CSV to disk with some info for later human checking.

It's very important to keep the line number besides their content because the order of apparition of words is a key for the ML process.

So, mi approach is to add line number to every line this way (one liner with many piped commands):

for every file in *.txt
do

cat -v $file | nl -nrz -w4 -s$'\t' | .......

Then I get rid of undesired lines this way (sample) :

 ...... | sed '/^$/d'| grep -vEi 'unsettling|aforementioned|ruled'

and finally keep two lines for further processing this way:

........ | grep -A 1 -Ei 'university|institute|trust|college'

The output is something like this (sampling two files):

file 1.txt
0098  university of Goteborg is downtown and is one of the
0099  most beautiful building you can visit

0123  the institute of Oslo for advanced investigation
0124  is near the central station and keeps

0234  most important college of Munich
0235  and the most acclaimed teachers are

file 2.txt
0023  there is no trust or confidence
0024  in the counselor to accomplish the new

0182  usually the college is visited
0183  every term for the president but

[EDITED] Missed this step that was in the wrong line. Sorry.

Then, the text is stacked into "paragraphs" this way:

tr '\n\r' ' '| grep -Eio '.{0,0}university.{0,25}|.{0,0}college.{0,25}'

[END EDIT]

This output is saved as variable "CLEANED_TXT" and piped into a WHILE, this way:

while read everyline; do 

    if [[ -n "${everyline// }" ]];then

            echo "$file;$linenumber;$everyline" >> output.csv
    fi  

    done <<< "$CLEANED_TXT"

done  # for every text file

FINAL DESIRED OUTPUT

file 1.txt;0098;university of Goteborg
file 1.txt;0123;the institute of Oslo
file 1.txt;0234;college of Munich

My issue is the line number is lost at this last step because of the GREP just before the loop. Take into account that I need the original line number. Re-numbering inside the loop is not allowed.

I'm stuck. Any help would be much appreciated.

Regards

I don't understand why you would want to loop over lines externally to grep. That aside, have you considered using awk instead? e.g. awk '{if (match($0, "university|college")) print $1, substr($0,RSTART,RLENGTH+25)}' — steeldriver, Aug 18 '16 at 18:07
I'm not sure I understand your problem, but does grep -n help at all? — Jeff Schaller, Aug 18 '16 at 18:46
One thing you should definitely read is, Why is using a shell loop to process text considered bad practice? — Wildcard, Aug 18 '16 at 20:03
Hi Jeff, the problem with grep -n is that generates new line numbering, not preserving the original one given in the first code line — jomaweb, Aug 19 '16 at 06:58
@Wilcard if a While loop is not used, no line is processed or just the first one — jomaweb, Aug 19 '16 at 08:14
@ikkachu yep. Trying to keep different issues on different pages. This for extracting data, the other for inserting data. — jomaweb, Aug 19 '16 at 08:22

Matija Nalis · Accepted Answer · 2016-08-19T10:59:19.613

0

UPDATE2 get rid of whole tr ... | grep line (it's just making a mess) and replace your while with:

while read linenumber everyline; do
        everyline=$(echo $everyline | grep -Eio '.{0,0}university.{0,25}|.{0,0}college.{0,25}')
        if [[ -n "$everyline" ]]; then
            echo "$file;$linenumber;$everyline" >> output.csv
        fi
done

that will populate $linenumber with correct value, and matched words at right place:

file1.txt;0098;university of Goteborg is downtown
file1.txt;0234;college of Munich
file1.txt;0182;college is visited

note however that whole thing is a mess, and should really be rewritten in perl or awk or similar.

edited Aug 19 '16 at 10:59

answered Aug 18 '16 at 19:28

Matija Nalis

3,111
1
14
27

Sorry, missed a line in the wrong place. Edited. – jomaweb Aug 19 '16 at 07:06
@jomaweb can you update the question with your desired output? as shown, even your tr step gives me broken data... – Matija Nalis Aug 19 '16 at 07:22
updated question with desired output – jomaweb Aug 19 '16 at 08:13

keeping line number when processing text inside a grep

1 Answers1