Since you already have an answer...
Using GNU awk for the 3rd arg to match()
:
awk -v OFS=';' '
match($0,/\S\s+([0-9]+)\s+(([0-9]{2}\/){2}[0-9]{4})(\s+(.*\S))? ([0-9]+)$/,a) {
$0 = substr($0,1,RSTART) OFS a[1] OFS a[2] OFS a[5] OFS a[6]
gsub(/\s+/,"-")
print
}
' file
Paid;100;15/02/2022;;3000
recd;50;15/02/2022;nelur-trip;3050
PAID;80;25/03/2022;Adjusted-towards-trip;3130
14-PAID;50;26/03/2022;Given-to-Nate-Cash-(padma-ac);3180
The above assumes your input format is:
<anything1> <integer1> <date>[ <anything2>] <integer2>
(.*\S) ([0-9]+) ^ ( (.*\S))? ([0-9]+)
|
(([0-9]{2}\/){2}[0-9]{4})
where there can be any non-newline white space between fields, the anything
fields inside <...>
always end with a non-space but can contain spaces or any other characters, cannot have a date string preceded by an integer in anything1
, [ <anything2>]
is optional, and you want the output given that to be:
<anything1>;<integer1>;<date>;<anything2>;<integer2>
That means that Paid;100;15/02/2022;3000
in the expected output should actually be Paid;100;15/02/2022;;3000
so every output line has the same number of fields for importing into a spreadsheet as the OP says they want to do.
I'm using a single \S
at the start of the regexp and then using substr()
to get the head anything
string instead of using (.*\S)
to populate an array element as I originally intended, and would work with the posted sample input in the question, as I realised that would fail given input like:
Paid 100 15/02/2022 foo 100 15/02/2022 3000
where there are 2 dates in the line and the strings around the 2nd one match the rest of the regexp, e.g. given this modified input with the problematic line at the bottom:
$ cat file
Paid 100 15/02/2022 3000
recd 50 15/02/2022 nelur trip 3050
PAID 80 25/03/2022 Adjusted towards trip 3130
14 PAID 50 26/03/2022 Given to Nate Cash (padma ac) 3180
Paid 100 15/02/2022 foo 100 15/02/2022 3000
note the undesirable last line of output if we used (.*\S)
at the start of the regexp:
$ awk -v OFS=';' '
match($0,/(.*\S)\s+([0-9]+)\s+(([0-9]{2}\/){2}[0-9]{4})(\s+(.*\S))? ([0-9]+)$/,a) {
$0 = a[1] OFS a[2] OFS a[3] OFS a[6] OFS a[7]
gsub(/\s+/,"-")
print
}
' file
Paid;100;15/02/2022;;3000
recd;50;15/02/2022;nelur-trip;3050
PAID;80;25/03/2022;Adjusted-towards-trip;3130
14-PAID;50;26/03/2022;Given-to-Nate-Cash-(padma-ac);3180
Paid-100-15/02/2022-foo;100;15/02/2022;;3000
vs the correct output using the suggested script:
$ awk -v OFS=';' '
match($0,/\S\s+([0-9]+)\s+(([0-9]{2}\/){2}[0-9]{4})(\s+(.*\S))? ([0-9]+)$/,a) {
$0 = substr($0,1,RSTART) OFS a[1] OFS a[2] OFS a[5] OFS a[6]
gsub(/\s+/,"-")
print
}
' file
Paid;100;15/02/2022;;3000
recd;50;15/02/2022;nelur-trip;3050
PAID;80;25/03/2022;Adjusted-towards-trip;3130
14-PAID;50;26/03/2022;Given-to-Nate-Cash-(padma-ac);3180
Paid;100;15/02/2022;foo-100-15/02/2022;3000
EDIT based on feedback in comments - to log any lines that do not match the regexp:
awk -v OFS=';' '
match($0,/\S\s+([0-9]+)\s+(([0-9]{2}\/){2}[0-9]{4})(\s+(.*\S))? ([0-9]+)$/,a) {
$0 = substr($0,1,RSTART) OFS a[1] OFS a[2] OFS a[5] OFS a[6]
gsub(/\s+/,"-")
print
next
}
{ print > "/dev/stderr" }
' file 2>unmatched.log
14-PAID
, not14;PAID
, how to handle mixed strings likeabc123
or123abct
? – FelixJN Nov 01 '23 at 09:47;
s in the input handled - if they're copied as-is to the output they'll mess up your spreadsheet import. – Ed Morton Nov 01 '23 at 14:24PAID;100;15/02/2022;;3000
. Best Regards. – jubilatious1 Nov 01 '23 at 22:54