0

I've read a bunch of text files to extract some patterns. I need the line number too but the line number must be removed before the final grep (but can be saved for further processing maybe using variables).

I'll explain my procedure splitting the (long oneliner) command for better understanding:

  1. read file with cat, do some cleaning (remove strange characters & line feeds using sed and tr) and such. Here just an example of many piped cleaning tasks:

     cat file | sed 's/,/ /g' | sed '/^$/d'
    
  2. add line number and tab with nl command & more processing and cleaning

     nl -nrz -w4 -s$'\t' | tr '\n\r' ' '
    
  3. extract the final desired pattern to CSV file

     grep -Eio '.{0,0}account number.{0,15}' >> account_list.csv
    

The issue is I need the line number from step 2 to the very same CSV (other column no matter the order) using the SAME ONE LINE COMMAND (no while or loop allowed) but no success so far.

[EDITED for better understanding] Take into account the line number I need is the original one prior to file cleaning. This cleaning process sometimes deletes some paragraphs. Imagine a file with a thousand lines, after processing i got one hundred. New line numbering is wrong. [end edit]

Sample imput after some processing:

0123 the first account number 2345356432 must be used
0345 take it just for billing purposes, not any other.
0657 Meanwhile the second account number 8623525534
0987 user is blocked until the issue is solved with

The desired oputput would be:

 2345356432; 0123
 8623525534; 0657

or

0123; 2345356432
0657; 8623525534

any hint would be much appreciated

jomaweb
  • 531
  • Much better if you present sample of input and desired output – Costas Aug 18 '16 at 09:08
  • added sample input and desired output Costas – jomaweb Aug 18 '16 at 09:15
  • 1
    What is the reason to do pre-cleaning if you need just account number(s)? – Costas Aug 18 '16 at 10:17
  • sed -rn 's/ .*account number (\S{,15}) *.*/; \1/p' but if the string account … is splitted in 2 lines? Which number you'l need? – Costas Aug 18 '16 at 10:23
  • Hi Costas, I'm trying to simplify a lot to make this issue suitable to be asked here. The program is very big and this is a tiny part (tricky but tiny) from it. The real purpose of the program is preparing text for Machine Learning algorithms and a lot of processing is made prior to getting account numbers. In addition, account numbers is an example (easy one) but the real pattern I need to extract is a very complex one. – jomaweb Aug 18 '16 at 10:26
  • I mean that you can receive account number prior to any cleaning (by grep script for example), then do what you want else. – Costas Aug 18 '16 at 10:30
  • yes, I get your point Costas. But this would break the program flow as it is working now. Too many complex parts to change. I try to keep the changes to a minimum. By the way your SED command worked like a charm. Thanks – jomaweb Aug 18 '16 at 10:38
  • You take data from file and store accounts in account_list.csv. What kind of "break the program flow" do you mean? – Costas Aug 18 '16 at 10:59

4 Answers4

3

Using awk on the original input file, prior to cleaning:

awk '/account number [[:digit:]]+/ { match($0, "account number ([[:digit:]]+)", a); print NR ";" substr($0, a[1, "start"], a[1, "length"]); }' input

This extracts the account number and prints the line number at the start of the line:

1;2345356432
3;8623525534

If you want to extract the pre-processed number instead from the cleaned-up file:

awk '/account number [[:digit:]]+/ { match($0, "account number ([[:digit:]]+)", a); print $1 ";" substr($0, a[1, "start"], a[1, "length"]); }' input

Splitting this up a little:

  • /account number [[:digit:]]+/ ensures we only process lines matching "account number" followed by a number;
  • match($0, "account number ([[:digit:]]+)", a) looks for the pattern again and stores the positions and lengths of the matched groups (([[:digit:]]+), the number) in array a;
  • print NR ";" substr($0, a[1, "start"], a[1, "length"]) prints the record number (i.e. the line number; use FNR if you want to process multiple files), followed by ;, followed by the substring corresponding to the first group: a[1, "start"] gives its starting index, a[1, "length"] its length (this was filled in by match).

All this assumes there's at most one account number per line.

The second variant prints $1 instead of NR, i.e. the first field in the file, which is the pre-processed line number.

Stephen Kitt
  • 434,908
0

If your grep version supports Perl regular-expressions, you can use look-behind:

grep -Pnio "(?<=account number.)([0-9]{0,15})" text
guido
  • 8,649
  • guido, your command adds NEW line numbers. Take into account the line numbers are in the file prior to all processing. And some of that processing is deleting some lines .What I'm using to feed grep is a subset of the original file. The line number I need is the original, not the added by the final grep. Imagine a file with a thousand lines, after processing i got one hundred. Your grep re-set the numbering from 1 to 100 – jomaweb Aug 18 '16 at 10:07
  • @jomaweb How we can see the above from your example? – Costas Aug 18 '16 at 10:14
  • edited input text to accomodate to better requirements explanation – jomaweb Aug 18 '16 at 10:28
  • I'd offer to use -z option + [[:space:]] class symbols instead just – Costas Aug 18 '16 at 10:35
0

Given your input and output, a awk script seems much simpler:

gawk '/account number/ {
    nr=gensub(/.*account\s*number\s*([0-9]+).*/, "\\1", "g")
    print FNR "; " nr
}'

Of course, you may need to adapt the account number extraction and output format to your likes. But you get the idea. (Requires GNU awk due to the use of the gensub function.)

0

I would be tempted to use Perl for this, something like this should work:

perl -ne 'print "$1; $2\n" if /^(\d+).*account number (\d+)/' input

On lines that start with some digits (^\d+), and contain the string "account number" followed by some digits, print the 1st and 2nd capture groups (parts in parenthesis, here the numbers). If you want to print Perl's idea of the line number, use $. instead of $1.

ilkkachu
  • 138,973