Got a tricky and complex program that is used for pre-processing text in order to send it to a Machine Learning software.
To make long story short:
The bash script gets inside a folder where thousands of text files are waiting, opening them with CAT, cleaning and deleting superfluous lines and then, prior to sending files to Machine Learning process writes a CSV to disk with some info for later human checking.
It's very important to keep the line number besides their content because the order of apparition of words is a key for the ML process.
So, mi approach is to add line number to every line this way (one liner with many piped commands):
for every file in *.txt
do
cat -v $file | nl -nrz -w4 -s$'\t' | .......
Then I get rid of undesired lines this way (sample) :
...... | sed '/^$/d'| grep -vEi 'unsettling|aforementioned|ruled'
and finally keep two lines for further processing this way:
........ | grep -A 1 -Ei 'university|institute|trust|college'
The output is something like this (sampling two files):
file 1.txt
0098 university of Goteborg is downtown and is one of the
0099 most beautiful building you can visit
0123 the institute of Oslo for advanced investigation
0124 is near the central station and keeps
0234 most important college of Munich
0235 and the most acclaimed teachers are
file 2.txt
0023 there is no trust or confidence
0024 in the counselor to accomplish the new
0182 usually the college is visited
0183 every term for the president but
[EDITED] Missed this step that was in the wrong line. Sorry.
Then, the text is stacked into "paragraphs" this way:
tr '\n\r' ' '| grep -Eio '.{0,0}university.{0,25}|.{0,0}college.{0,25}'
[END EDIT]
This output is saved as variable "CLEANED_TXT" and piped into a WHILE, this way:
while read everyline; do
if [[ -n "${everyline// }" ]];then
echo "$file;$linenumber;$everyline" >> output.csv
fi
done <<< "$CLEANED_TXT"
done # for every text file
FINAL DESIRED OUTPUT
file 1.txt;0098;university of Goteborg
file 1.txt;0123;the institute of Oslo
file 1.txt;0234;college of Munich
My issue is the line number is lost at this last step because of the GREP just before the loop. Take into account that I need the original line number. Re-numbering inside the loop is not allowed.
I'm stuck. Any help would be much appreciated.
Regards
awk '{if (match($0, "university|college")) print $1, substr($0,RSTART,RLENGTH+25)}'
– steeldriver Aug 18 '16 at 18:07grep -n
help at all? – Jeff Schaller Aug 18 '16 at 18:46