Parse text file add semicolons before and after numbers; join all other words that are before/ between numbers with hyphens

Question

I have a text file with accounts like:

Input Sample

Paid 100 15/02/2022 3000
recd 50 15/02/2022 nelur trip 3050
PAID 80 25/03/2022 Adjusted towards trip 3130
14 PAID 50  26/03/2022 Given to Nate Cash (padma ac) 3180

I want to ignore numbers at start or join them to first word, then first word(s) join them with a dash/hyphen there is a number put a ; then a date put a ; then words so put a ; and join them then a number so put a ; remove white space except new lines

Can be a script too, perl/ php, bash, awk, sed anything that I can easily install and run on my linux laptop.

Output sample

Paid;100;15/02/2022;3000
recd;50;15/02/2022;nelur-trip;3050
PAID;80;25/03/2022;Adjusted-towards-trip;3130
14-PAID;50;26/03/2022;Given-to-Nate-Cash-(padma-ac);3180

I want to import it to a spreadsheet so any other way is fine too. There around 300 lines.

What did I try before? Using spreadsheet to clean up the data that I got from an image and then a image to text convertor.

Also tried with notepad++ and column select. i briedly tried to learn AWK. I know Java but this would take too long on that. Few things I learnt after getting the answers:

leave line numbers if you have them helps to know which rows are missing if you use awk. AS the output is given without any warning of lines skipped. might be a setting to bypass that, not sure
always ask for accounts in csv, spreadsheet or PDF and dont accept images!

I got the input by running a image to text on images I got from a friend on our accounts

We encourage questioners to show what they have tried so far to solve the problem themselves. — Cyrus, Oct 31 '23 at 23:05
Those are dashes, not underscores. Why is it 14-PAID, not 14;PAID, how to handle mixed strings like abc123 or 123abct? — FelixJN, Nov 01 '23 at 09:47
You don't say/show how you want ;s in the input handled - if they're copied as-is to the output they'll mess up your spreadsheet import. — Ed Morton, Nov 01 '23 at 14:24
Your data file indicates an optional 4th column. If you want your parsed output to be correctly imported into a spreadsheet program you'll have to include empty separators where appropriate, e.g. the first line should be parsed into: PAID;100;15/02/2022;;3000. Best Regards. — jubilatious1, Nov 01 '23 at 22:54

choroba · Answer 1 · 2023-11-01T09:24:06.537

3

perl -pe 's/(?:^[0-9]+|[^0-9])\K (?=[^0-9])/-/g;
          s/ /;/g
' -- file

-p reads the input line by line and prints each line after processing;
The first substitution uses look-around assertions: it says replace a space by a dash if it was preceded by something that was a number at the beginning of the string or that wasn't a number and was followed by something that wasn't a number;
The second substitution replaces the remaining spaces by semicolons.

If the two spaces after 50 on the last line aren't a typo, you need to change the second substitution to s/ +/;/g to replace all the adjacent spaces to a single semicolon.

edited Nov 01 '23 at 09:24

answered Oct 31 '23 at 22:31

choroba

47,233

Will give it try – tgkprog Oct 31 '23 at 23:50
Had some issues. with the awk too but that was easier to debug. Few things i learned was should ahve kept the left most column which was line numbers - helped tally – tgkprog Nov 01 '23 at 21:02
Could you please explain the -- before file? Thx. – jubilatious1 Nov 01 '23 at 23:22
1

@jubilatious1, the -- helps if the file name starts with -. It's still not enough to handle arbitrary file paths unfortunately. – Stéphane Chazelas Nov 02 '23 at 06:51

Ed Morton · Accepted Answer · 2023-11-01T22:08:39.610

Since you already have an answer...

Using GNU awk for the 3rd arg to match():

awk -v OFS=';' '
    match($0,/\S\s+([0-9]+)\s+(([0-9]{2}\/){2}[0-9]{4})(\s+(.*\S))? ([0-9]+)$/,a) {
        $0 = substr($0,1,RSTART) OFS a[1] OFS a[2] OFS a[5] OFS a[6]
        gsub(/\s+/,"-")
        print
    }
' file
Paid;100;15/02/2022;;3000
recd;50;15/02/2022;nelur-trip;3050
PAID;80;25/03/2022;Adjusted-towards-trip;3130
14-PAID;50;26/03/2022;Given-to-Nate-Cash-(padma-ac);3180

The above assumes your input format is:

<anything1> <integer1> <date>[ <anything2>] <integer2>
  (.*\S)     ([0-9]+)     ^    ( (.*\S))?    ([0-9]+)
                          |
              (([0-9]{2}\/){2}[0-9]{4})

where there can be any non-newline white space between fields, the anything fields inside <...> always end with a non-space but can contain spaces or any other characters, cannot have a date string preceded by an integer in anything1, [ <anything2>] is optional, and you want the output given that to be:

<anything1>;<integer1>;<date>;<anything2>;<integer2>

That means that Paid;100;15/02/2022;3000 in the expected output should actually be Paid;100;15/02/2022;;3000 so every output line has the same number of fields for importing into a spreadsheet as the OP says they want to do.

I'm using a single \S at the start of the regexp and then using substr() to get the head anything string instead of using (.*\S) to populate an array element as I originally intended, and would work with the posted sample input in the question, as I realised that would fail given input like:

Paid 100 15/02/2022 foo 100 15/02/2022 3000

where there are 2 dates in the line and the strings around the 2nd one match the rest of the regexp, e.g. given this modified input with the problematic line at the bottom:

$ cat file
Paid 100 15/02/2022 3000
recd 50 15/02/2022 nelur trip 3050
PAID 80 25/03/2022 Adjusted towards trip 3130
14 PAID 50  26/03/2022 Given to Nate Cash (padma ac) 3180
Paid 100 15/02/2022 foo 100 15/02/2022 3000

note the undesirable last line of output if we used (.*\S) at the start of the regexp:

$ awk -v OFS=';' '
    match($0,/(.*\S)\s+([0-9]+)\s+(([0-9]{2}\/){2}[0-9]{4})(\s+(.*\S))? ([0-9]+)$/,a) {
        $0 = a[1] OFS a[2] OFS a[3] OFS a[6] OFS a[7]
        gsub(/\s+/,"-")
        print
    }
' file
Paid;100;15/02/2022;;3000
recd;50;15/02/2022;nelur-trip;3050
PAID;80;25/03/2022;Adjusted-towards-trip;3130
14-PAID;50;26/03/2022;Given-to-Nate-Cash-(padma-ac);3180
Paid-100-15/02/2022-foo;100;15/02/2022;;3000

vs the correct output using the suggested script:

$ awk -v OFS=';' '
    match($0,/\S\s+([0-9]+)\s+(([0-9]{2}\/){2}[0-9]{4})(\s+(.*\S))? ([0-9]+)$/,a) {
        $0 = substr($0,1,RSTART) OFS a[1] OFS a[2] OFS a[5] OFS a[6]
        gsub(/\s+/,"-")
        print
    }
' file
Paid;100;15/02/2022;;3000
recd;50;15/02/2022;nelur-trip;3050
PAID;80;25/03/2022;Adjusted-towards-trip;3130
14-PAID;50;26/03/2022;Given-to-Nate-Cash-(padma-ac);3180
Paid;100;15/02/2022;foo-100-15/02/2022;3000

EDIT based on feedback in comments - to log any lines that do not match the regexp:

awk -v OFS=';' '
    match($0,/\S\s+([0-9]+)\s+(([0-9]{2}\/){2}[0-9]{4})(\s+(.*\S))? ([0-9]+)$/,a) {
        $0 = substr($0,1,RSTART) OFS a[1] OFS a[2] OFS a[5] OFS a[6]
        gsub(/\s+/,"-")
        print
        next
    }
    { print > "/dev/stderr" }
' file 2>unmatched.log

Swell I was trying thing with awk but was no where near. And your right would be good to have 4 fields std — tgkprog, Oct 31 '23 at 23:50
I updated it to make it a bit more robust to handle input other than the example in your question. — Ed Morton, Nov 01 '23 at 14:26
Thank you a few rows it ignored, is there a way to tell awk to report how many rows it did not process or give it an error stream/file to report to? i made the first column unique by joining a column with an id number after the word using spreadsheet. Then compared the col1 unparsed and parsed by awk to know which rows were not processed, corrected the data oin those (just 3 rows) and it worked fine. Next time should leave col1 as line numbers. just FYI. Your awk command was super helpful. Thank you :-) — tgkprog, Nov 01 '23 at 21:09
You're welcome. I added a version that will output the failed rows to stderr and write that to a file. If you need any additional changes then please ask a new question. — Ed Morton, Nov 01 '23 at 22:09
+1 for catching/correcting output so that "...every output line has the same number of fields...". :-) — jubilatious1, Nov 01 '23 at 23:07

score 1 · Answer 3 · answered Nov 01 '23 at 15:15

1

sed -E 's,([^0-9 ]) +([^0-9 ]),\1-\2,g
        s,^([0-9]+) +,\1-,g
        s, +,;,g
'

answered Nov 01 '23 at 15:15

jthill

2,710

jubilatious1 · Answer 4 · 2023-11-02T10:47:57.187

Using Raku (formerly known as Perl_6)

First see if you can get all lines to parse:

~$ raku -ne '.match(/ ^ \d* \s? <.alpha>+ \s \d+ \s [ \d**2 \/ \d**2 \/ \d**4 ] \s [[<.graph>+]+ % " " \s]? \d+ $ /).say;'  file

Above is an approach in Raku, a Unicode-ready member of the Perl-family of programming languages. Because you're starting with a complex Regex problem, it's best to see that all lines parse correctly from /^ … $/ beginning to end. (You'll probably want to run something like raku -pe 's:g/\s+/ /;' to normalize whitespace first). The code above will return Nil for lines that don't parse correctly, and here's the result for the OP's whitespace-normalized data:

Sample Input:

Paid 100 15/02/2022 3000
recd 50 15/02/2022 nelur trip 3050
PAID 80 25/03/2022 Adjusted towards trip 3130
14 PAID 50 26/03/2022 Given to Nate Cash (padma ac) 3180

Parsed Lines:

｢Paid 100 15/02/2022 3000｣
｢recd 50 15/02/2022 nelur trip 3050｣
｢PAID 80 25/03/2022 Adjusted towards trip 3130｣
｢14 PAID 50 26/03/2022 Given to Nate Cash (padma ac) 3180｣

Now that you see all lines are parsed from beginning-to-end, you can capture matching elements. Denote captures with parentheses, noting that Raku captures start from $0:

~$ raku -ne '.match(/ ^ (\d* \s? <.alpha>+) \s (\d+) \s ( \d**2 \/ \d**2 \/ \d**4 ) \s [(<.graph>+)+ % " " \s]? (\d+) $ /).say;'  file
｢Paid 100 15/02/2022 3000｣
 0 => ｢Paid｣
 1 => ｢100｣
 2 => ｢15/02/2022｣
 4 => ｢3000｣
｢recd 50 15/02/2022 nelur trip 3050｣
 0 => ｢recd｣
 1 => ｢50｣
 2 => ｢15/02/2022｣
 3 => ｢nelur｣
 3 => ｢trip｣
 4 => ｢3050｣
｢PAID 80 25/03/2022 Adjusted towards trip 3130｣
 0 => ｢PAID｣
 1 => ｢80｣
 2 => ｢25/03/2022｣
 3 => ｢Adjusted｣
 3 => ｢towards｣
 3 => ｢trip｣
 4 => ｢3130｣
｢14 PAID 50 26/03/2022 Given to Nate Cash (padma ac) 3180｣
 0 => ｢14 PAID｣
 1 => ｢50｣
 2 => ｢26/03/2022｣
 3 => ｢Given｣
 3 => ｢to｣
 3 => ｢Nate｣
 3 => ｢Cash｣
 3 => ｢(padma｣
 3 => ｢ac)｣
 4 => ｢3180｣

Final answer -- Use either Raku's .subst( /…/, {…} ) or Raku's s/// substitution commands/idioms. Note: both these substitution idioms take "Callable" code in within curlie braces in the replacement, greatly increasing the power of each command. So you can .translate remaining whitespace characters into - hyphens, then join on ; semicolons:

~$ raku -ne '.subst(/ ^ (\d* \s? <.alpha>+) \s (\d+) \s ( \d**2 \/ \d**2 \/ \d**4 ) \s [(<.graph>+)+ % " " \s]? (\d+) $/,  \
             {join ";", $0.trans(" " => "-"), $1, $2, $3.trans(" " => "-") // "", $4}).put;'  file
#OR:
~$ raku -pe 's/ ^ (\d* \s? <.alpha>+) \s (\d+) \s ( \d2 / \d2 / \d**4 ) \s [(<.graph>+)+ % " " \s]? (\d+) $  

              /{join ";", $0.trans(" " => "-"), $1, $2, $3.trans(" " => "-") // "", $4}/;'  file

Sample Output (from either solution immediately above):

Paid;100;15/02/2022;;3000
recd;50;15/02/2022;nelur-trip;3050
PAID;80;25/03/2022;Adjusted-towards-trip;3130
14-PAID;50;26/03/2022;Given-to-Nate-Cash-(padma-ac);3180

Bonus points: You can normalize case by using something like {$0.uc} to convert the first column all to PAID, RECD, etc.

Simplifying assumptions: You can simplify the Regex to use \d for digit and \D for non-digit, but you risk not parsing lines with mixed non-digit/digit "words", like UK postcodes or Airline confirmation codes (especially in Column 4).

https://docs.raku.org/routine/subst#Callable
https://docs.raku.org/language/regexes
https://raku.org

Parse text file add semicolons before and after numbers; join all other words that are before/ between numbers with hyphens

4 Answers4