1

I have a huge text file with about 70k lines in it. My objective is to read this file, match a pattern ("Count"), and add or replace its value with an iterated number.

What I'm doing is :

  1. Reading file.
  2. Grep for the pattern Count.
  3. If it matches, delete the pattern.
  4. Append filw the desired pattern (Count = $i) in that line.
  5. Increment variable i.

Here's the code

line_count=0
i=0
while read line
do
        line_count=$((line_count+1))
        if echo "$line" | grep -q "Count"
        then
                sed -i "$line_count d" /tmp/$rand_file1
                sed -i "$line_count i Count = $i" /tmp/rand_file1
                i=$((i+1))
        fi
done </tmp/rand_file1

The above technique takes about 25min to complete. Is there a way to reduce this time as I will be working with larger data files?

Below is the input pattern and file and expected output :

Input file

Count
Name = Sarah
ID = 113
PhNo =

Count
Name = John
ID = 787
PhNo =

Count = 123
Name = Mike
ID = 445
PhNo =

Count Now
Name = Max
ID = 673
PhNo =

Expected output file

Count = 1
Name = Sarah
ID = 113
PhNo =

Count = 2
Name = John
ID = 787
PhNo =

Count = 3
Name = Mike
ID = 445
PhNo =

Count = 4
Name = Max
ID = 673
PhNo =
agc
  • 7,223
manu ks
  • 13

6 Answers6

3

Parsing a text file in the shell is very slow and extremely error prone. You are running grep once per line in the input file, and sed twice for every line that contains Count. Avoid doing this.

As far as I can see, this may be replaced by

awk '$1 == "Count" { printf("Count = %d\n", ++i); next } { print }' rand_file1 >rand_file1.new

This outputs the Count = lines with the correct increment when it hits a line whose first field is exactly Count, and passes all other lines on as-is.

Alternatively,

awk '$1 == "Count" { $0 = sprintf("Count = %d", ++i) } { print }' rand_file1 >rand_file1.new

which modifies the $0 value (the input line) and prints all lines with a single print.

This last variation may be shortened into

awk '$1 == "Count" { $0 = sprintf("Count = %d", ++i) } 1' rand_file1 >rand_file1.new

See also "Why is using a shell loop to process text considered bad practice?".

Kusalananda
  • 333,661
  • The last {print} could be replaced by a simple 1. Instead of using sprintf you could simply do: $0="Count = " ++i. The whole script would be: awk '/^Count/{$0="Count = " ++i}1' –  Apr 17 '18 at 16:54
2

Short awk approach:

awk '$1 == "Count"{ $0 = "Count = "++i }1' file

The output:

Count = 1
Name = Sarah
ID = 113
PhNo =

Count = 2
Name = John
ID = 787
PhNo =
2

The obligatory perl answer:

perl -pe 's{^Count\b.*}{"Count = " . ++$i}e'
  • I think that RE match should be ^Count\b.* – Chris Davies Apr 16 '18 at 08:46
  • Think we need to change this solution based on the new edited input file. – manu ks Apr 16 '18 at 08:56
  • If there are leading white spaces before Count this will fail. Maybe we can ask Perl to perform an automatic split into fields using white spaces will make Perl more helpful: perl -pae 'if($F[0]=~"Count"){$_="Count = ".++$c." \n"}' infile –  Apr 17 '18 at 19:32
1

Replacing lines having Count with Count = its Occurrence

Assuming Count is the first word in line

awk -v c=1 'sub(/^Count.*/, "Count = " c) {c++}; {print}' /tmp/rand_file1

Assuming Count is the first word in line but can be preceded with zero or more white space, white spaces are not preserved.

awk -v c=1 'sub(/^[[:blank:]]*Count.*/, "Count = " c) {c++}; {print}' /tmp/rand_file1
Bharat
  • 814
  • Can I do a wildcard match also? Like if the line has "Count = 1234", how do I include this pattern also? – manu ks Apr 16 '18 at 07:48
  • Also after every 14th match, the counter is getting reset to 0!!? – manu ks Apr 16 '18 at 07:51
  • strange, let me try it.. – Bharat Apr 16 '18 at 08:16
  • for me it seems to be working fine , can you paste more.. – Bharat Apr 16 '18 at 08:21
  • Looks like its working. I typed the code instead of copy paste. But how to do a wildcard match. I saw that there are lines with "Count = 1234", "Count now" etc.. they are getting replaced as "Count = 1 = 1234", "Count = 2 = now". – manu ks Apr 16 '18 at 08:22
  • Please add this to sample input in question, will update it... – Bharat Apr 16 '18 at 08:31
  • Updated assuming whatever is there before or after Count keyword , needs to be replaced with Count = its count.. – Bharat Apr 16 '18 at 08:53
  • 1
    Note that it would increase c on a line like Name = Count Olaf. Since there can only be one substitution, you can replace the gsub with sub. Here you could do awk -v c=1 'sub(/^Count.*/, "Count = " c) {c++}; {print}' – Stéphane Chazelas Apr 16 '18 at 09:19
  • yes, updated .... – Bharat Apr 16 '18 at 09:21
  • @isaac Its not giving expected output... – Bharat Apr 17 '18 at 12:10
  • @Bharat Yes, sorry. It should have been: awk '/^Count/{sub(/^Count.*/,"Count = " ++c)}1'. But there is a better/shorter solution: awk '/^Count/{$0="Count = " ++c}1'. –  Apr 17 '18 at 16:40
  • @Bharat If you want to allow for leading whitespace, use internal awk ability of breaking lines into fields: awk '$1~/^Count/{$0="Count = " ++c}1' (assuming that a first field starting with Count is what is wanted). –  Apr 17 '18 at 17:20
  • @isaac, awk '{sub(/^Count.*/,"Count = " ++c)}1' wouldn't work as c in there is incremented regardless of whether the substitution is made or not. – Stéphane Chazelas Apr 17 '18 at 20:32
  • @StéphaneChazelas Late to this party. The command has been corrected in my following comment(s). Read them. –  Apr 17 '18 at 20:35
1

Using sed, with seq piped in for iteration:

t='Count'
seq -f "$t = %g" 70000 | sed -i -e "/^$t/R /dev/stdin" -e "/^$t/d" /tmp/rand_file1

Notes:

  • sed's Read command won't work in braces {}, so two -es are needed.
  • The 70000 could be any large enough number. When sed stops, so does seq, so the higher values won't even be created.
agc
  • 7,223
0

Pending sample input files, I think this should work:

gawk '($1=="Count"){print "Count = " (++i); next;} 1' /tmp/rand_file1

Short explanation:

  • on lines having Count as tehir first field: print a new count statement and increment number. ++i will start at 1, i++ will start at 0. In this case, also skip the rest of processing and continue to the next input line.

  • on all lines (1): do the default action, which is to print the input line.

This should be faster since it touches every input line only once, in your existing solution, a match for Count copies the entire file around.