Implementing an extended regexp to add a variable number of leading zeros based on position in a string

Question

I am having trouble getting my sed syntax down to add a varying number of leading zeros to a numeric organizational scheme. The strings I am operating on appear like

1.1.1.1,Some Text Here

leveraging the sed syntax

sed -r ":r;s/\b[0-9]{1,$((1))}\b/0&/g;tr"

I am able to elicit the response

01.01.01.01,Some Text Here

However, What I am looking for is something to zero-fill up to 2 digits in fields 2 and 3 and 3 digits in field 4 so that all items are of a standard length at [0-9].[0-9]{2}.[0-9]{2}.[0-9]{3}

1.01.01.001,Some Text Here

For the life of me I cannot figure even how to modify the boundary to include the parameters necessary to snap to only numerals following a period. I think it has something to do with the use of the \b which I understand matches zero characters at a word boundary, but I do not understand why my attempts to add a period to the match fail as follows:

sed -r ":r;s/\.\b[0-9]{1,$((1))}\b/0&/g;tr"
sed -r ":r;s/\b\.[0-9]{1,$((1))}\b/0&/g;tr"
Both cause the statement to hang

sed -r ":r;s/\b[0-9]\.{1,$((1))}\b/0&/g;tr"
sed -r ":r;s/\b[0-9]{1,$((1))}\.\b/0&/g;tr"
sed -r ":r;s/\b[0-9]{1,$((1))}\b\./0&/g;tr"
cause the statement to output:

1.01.01.1,Some Text Here

Additionally, I expect that I will have additional problems if the statement contains text like:

1.1.1.1,Some Number 1 Here

It is a foregone conclusion that I need to really learn sed and all of its complexities. I am working on that, but expect that this particular statement will continue to cause me trouble for a while. Any help would be greatly appreciated.

EDIT: I've figured out a way... This statement seems to do what I am looking for, but there has got to be a more elegant way to do this.

sed -r ':r;s/\b[0-9]{1,1}\.\b/0&/;tr;:i;s/\b[0-9]{1,2},\b/0&/;ti;s/.//'

Also, syntactically this will cause problems if a similar number format appears in the text... similar to:

1.1.1.1,Some Text Referring to Document XXX Heading 1.2.3

In which case it will result in:

1.01.01.001,Some Text Referring to Document XXX Heading 01.02.03

Solved Thank you all for your help here. I initially solved the problem with the answer I accepted below. I've sense moved the solution into Python as a part of a larger solution leveraging the sort below:

def getPaddedKey(line):
    keyparts = line[0].split(".")
    keyparts = map(lambda x: x.rjust(5, '0'), keyparts)
    return '.'.join(keyparts)

s=sorted(reader, key=getPaddedKey)

This seems to do what I am looking for:
sed -r ':r;s/\b[0-9]{1,1}\.\b/0&/;tr;:i;s/\b[0-9]{1,2},\b/0&/;ti;s/.//'

However, I'd love to know if there is a more elegant approach. — daijizai, Jul 18 '17 at 18:39
Strangely, reversing the string, applying trailing zeros, and then reversing the result may achieve your aim more easily. — Chris Davies, Jul 18 '17 at 18:51
I'll have to take a look at that. It sounds quite obtuse... (o_O ) — daijizai, Jul 18 '17 at 19:51
Using printf (or a printf call within Awk) may be more straightforward. — Wildcard, Jul 18 '17 at 21:51
this is definitely something that will be easier to implement, read, understand, and modify in the future in a language like awk or perl (or anything else that has printf and easy field-splitting). — cas, Jul 19 '17 at 00:15
@Wildcard and cas have it right. A tool like awk is much easier for such things. If you set FS='.' in awk, then your fields will be split for you without any other code. Then you can look at them individually without worrying that what you need to do to one of them will affect another. sed is a great tool, but it's really hard to debug and, often, really hard to read your code later if you're doing anything even moderately involved. — Joe, Jul 22 '17 at 03:13
@Joe, thanks, but incidentally, I don't even agree that Sed is hard to debug. It just has a different purpose than Awk. Awk is for delimited field data; Sed is for whole lines. Here's an example of the trouble you can have if you use Awk when you should use Sed. — Wildcard, Jul 22 '17 at 03:40
@Wildcard - point well taken. Can you point me to something about debugging sed? Usually, I resort to prolonged staring punctuated by expletives. ;) Barring that, I sometimes break a sed statement into smaller pieces and try to get each one to work before combining them again. I recently read a great tutorial https://github.com/learnbyexample/Command-line-text-processing/blob/master/gnu_sed.md and I was sure some of the examples were wrong until I applied prolonged staring. — Joe, Jul 22 '17 at 04:19
@Joe, I really can't. "Prolonged staring punctuated by expletives" - I laughed out loud. :) Just general coding best practices. Break it up in pieces, make it more explicit, etc. For large Sed programs (if you have any), be very clear about what you know about the hold space and pattern space at each point in the flow. And use the tool/language suited for the task at hand. Best of luck. :) — Wildcard, Jul 24 '17 at 21:04

score 9 · Answer 1 · answered Jul 18 '17 at 19:31

9

bash can handle this. It'll be a lot slower than perl though:

echo "1.1.1.1,Some Text Here" | 
while IFS=., read -r a b c d text; do
    printf "%d.%02d.%02d.%03d,%s\n" "$a" "$b" "$c" "$d" "$text"
done

1.01.01.001,Some Text Here

answered Jul 18 '17 at 19:31

glenn jackman

85,964

2

Or Awk. But +1 for using printf, the sensible tool. (Awk has printf also and is better designed than bash for text processing.) Also see Why is using a shell loop to process text considered bad practice? – Wildcard Jul 18 '17 at 21:51

Chris Davies · Answer 2 · 2017-07-18T20:06:14.067

5

You haven't specifically asked for a perl solution but here's one anyway. Personally I think this is a little easier to read, especially when broken into several lines.

First here is the one-liner:

(
    echo '1.2.3.4,Some Text Here'
    echo '1.01.01.1,Some Text Here'
    echo '1.1.1.1,Some Number 1 Here'
    echo '1.1.1.1,Some Text Referring to Document XXX Heading 1.2.3'
    echo '1.2.3.4,Some \n \s \text'
) |
perl -ne '($ip, $text) = split(/,/, $_, 2); $ip = sprintf("%1d.%02d.%03d.%03d", split(/\./, $ip)); print "$ip,$text"'

Its results:

1.02.003.004,Some Text Here
1.01.001.001,Some Text Here
1.01.001.001,Some Number 1 Here
1.01.001.001,Some Text Referring to Document XXX Heading 1.2.3
1.02.003.004,Some \n \s \text

And here is the perl script broken out and commented (the -n flag puts an implicit while read; do ... done loop around the code):

($ip, $text) = split(/,/, $_, 2);                # Split line into two parts by comma
@octets = split(/\./, $ip)                       # Split IP address into octets by dots
$ip = sprintf("%1d.%02d.%03d.%03d", @octets);    # Apply the formatting
print "$ip,$text"                                # Output the two parts

edited Jul 18 '17 at 20:06

answered Jul 18 '17 at 18:43

Chris Davies

116,213
16
160
287

Ironically, I was just about to give up in sed and move to awk when you posted this. It seems to fit the bill. I'll check it and get back. – daijizai Jul 18 '17 at 18:48
@daijizai awk would work too - same principle using printf – Chris Davies Jul 18 '17 at 18:49
The only thing this fails at I couldn't have anticipated, but is significant. It seems to strip backslash from the text portion. – daijizai Jul 18 '17 at 19:50
@daijizai not here it doesn't. How are you feeding it text with a backslash? I've added a backslashed example for you – Chris Davies Jul 18 '17 at 20:03
In my use with my internal dataset there are rows with the text column containing strings like SOME\Text\Might\Be\Here\4Realz. When this dataset was passed to the perl statement it resulted in a response like SOMETextMightBeHere4Realz – daijizai Jul 20 '17 at 04:32
Hi @daijizai. If you look at my demo data you can see text with backslashes. How did you feed your data to the perl script? – Chris Davies Jul 20 '17 at 06:52

MiniMax · Accepted Answer · 2017-07-18T21:43:22.210

Usage: leading_zero.sh input.txt

#!/bin/bash

sed -r '
    s/\.([0-9]{1,2})\.([0-9]{1,2})\.([0-9]{1,3},)/.0\1.0\2.00\3/
    s/\.0*([0-9]{2})\.0*([0-9]{2})\.0*([0-9]{3})/.\1.\2.\3/
' "$1"

Explanation:

First subtitution add certain amount of zeros to each number. 1 zero to 2 and 3 numbers, 2 zero to 4 number. Doesn't matter, how much digits already there are.
Second substution removes all extra zeros, leaving only needed amount of numbers. 2 and 3 numbers should be contain only 2 digits. Leaves them and removes rests. Fourth number should be contain only 3 digits. Leaves them and removes rests.

input.txt

1.1.1.1,Some Text Here
1.1.1.1,Some Text Here
1.11.1.11,Some Text Referring to Document XXX Heading 1.2.3
1.1.1.1,Some Text Here
1.1.11.111,Some Text Referring to Document XXX Heading 1.2.3
1.11.1.1,Some Text Here

output.txt

1.01.01.001,Some Text Here
1.01.01.001,Some Text Here
1.11.01.011,Some Text Referring to Document XXX Heading 1.2.3
1.01.01.001,Some Text Here
1.01.11.111,Some Text Referring to Document XXX Heading 1.2.3
1.11.01.001,Some Text Here

While in the end I just ended up scripting this in Python for expediency, this is the best answer to my question as written given that the perl previously submitted removed backslashes (at least) from the output. This 1. is a sed solution, and 2. produces the proper output without molestation of the text. Marking as answer. Thanks! :-) — daijizai, Jul 18 '17 at 21:52
@daijizai as I have already demonstrated, the perl version does not remove backslashes. — Chris Davies, Jul 19 '17 at 22:24

score 3 · Answer 4 · 2017-07-18T17:36:43.920

3

Here's one possible approach:
sed -E 's/([0-9]*\.)/0\1/g;s/.//;s/([0-9]*,)/00\1/'

Examples

echo "1.11.111.1111,Some Text Here" | sed -E 's/([0-9]*\.)/0\1/g;s/.//;s/([0-9]*,)/00\1/'
1.011.0111.001111,Some Text Here

Also work with this string:

echo "1.1.1.1,Some Number 1 Here" | sed -E 's/([0-9]\.)/0\1/g;s/.//;s/([0-9],)/00\1/'
1.01.01.001,Some Number 1 Here

...and this string:

echo "1.2.2101.7191,Some Text Here" | sed -E 's/([0-9]*\.)/0\1/g;s/.//;s/([0-9]*,)/00\1/'
1.02.02101.007191,Some Text Here

edited Jul 18 '17 at 17:36

answered Jul 18 '17 at 17:16

Unfortunately this breaks down as the numerals climb. For instance:
```
1.1.11.111,Some Text Here
```
Became:
```
1.1.101.11001,Some Text Here
```
– daijizai Jul 18 '17 at 17:29
@daijizai Please see my edit. Would this meet the requirement? – Jul 18 '17 at 17:38
Unfortunately not, but I think that might be my fault. The zero-fill needs of be up two two digits on field 2 and 3 and 3 digits on field 4. Essentially [0-9].[0-9]{2}.[0-9]{2}.[0-9]{3},Some Text Here – daijizai Jul 18 '17 at 18:11

score 2 · Answer 5 · 2017-07-20T05:01:02.987

perl -pe '/^\d/g && s/\G(?:(\.\K\d+(?=\.))|\.\K\d+(?=,))/sprintf "%0".($1?2:3)."d",$&/ge'

Explanation:

The method used is here is to look at the neighborhoods of the numerics and take action based on that. So, the 2nd and 3rd numbers see a dot on both sides whilst the 4th numeric sees dot on it's left and a comma to it's right.

The $1 is set when the regex takes the path of 2nd or 3rd nums and accordingly the precision padding is 2. OTOH, for the 4th num, the padding is 3.

% cat file.txt

1.00.3.4,Some Text Here
1.01.01.1,Some Text Here
1.0.01.1,Some Number 1 Here
1.1.1.1,Some Text Referring to Document XXX Heading 1.2.3.4
1.2.3.4,Some \n \s \text

Results:

1.00.03.004,Some Text Here
1.01.01.001,Some Text Here
1.00.01.001,Some Number 1 Here
1.01.01.001,Some Text Referring to Document XXX Heading 1.2.3.4
1.02.03.004,Some \n \s \text

Implementing an extended regexp to add a variable number of leading zeros based on position in a string

5 Answers5

Examples

Explanation:

Results: