Number lines according to their content

Question

I would like to number lines according to their content: the first line gets number 1, the second line gets number 2 if it's identical to the first and number 1 if it's different, and so on. For example:

asdf
asdf
asdf
asdf
dfg
dfg
dfg
qwert
qwert
er
qwert
er
asdf

Should result in:

1   asdf
2   asdf
3   asdf
4   asdf
1   dfg
2   dfg
3   dfg
1   qwert
2   qwert
1   er
3   qwert
2   er
5   asdf

Incremental? You are resetting the counter every time there a new item. Or is it a counter and it should resume if the same token is encountered again? — Matteo, Sep 05 '12 at 14:23
The question is underspecified. Please look at the comments on @JohnCC's answer, and update the question to clarify the ambiguity. — jw013, Sep 05 '12 at 16:09

score 4 · Answer 1 · answered Sep 05 '12 at 15:04

4

Even simpler with awk:-

awk '{ print ++c[$0],$0 }' < test

Where test is the file that contains the data. I made a couple of assumptions here that are not clear from the question. First, I assume the file is already sorted. If not, then:-

sort < test | awk '{ print ++c[$0],$0 }'

Also, I assume that the whole line is significant, and not just the first word if there should be more than one. If you just want to work on the first word then:-

awk '{ print ++c[$1],$0 }' < test

answered Sep 05 '12 at 15:04

JohnCC

191

1

But, if asdf occurs again, it will continue numbering, do I understand that correctly? But this was also not clear from the question. I like your approach. – Bernhard Sep 05 '12 at 15:18
1

Yes, correct. That was why I asked about sorting, since as you say, the question is not very clear. – JohnCC Sep 05 '12 at 15:20

score 1 · Answer 2 · answered Sep 05 '12 at 14:59

1

You could do this with awk:

number.awk

BEGIN { OFS = "\t" }

last == $1 { cnt += 1}
last != $1 { cnt  = 1 }

{ print cnt, $1; last = $1 }

Run like this:

awk -f number.awk infile

answered Sep 05 '12 at 14:59

Thor

17,182

score 0 · Answer 3 · edited Sep 05 '12 at 22:26

0

You can iterate over the input and use a counter

#!/bin/sh                                                                                                                                                     

counter=1
old=""

while IFS= read -r line ; do
    # check if the line is different from the previous one
    if [ "$line" != "$old" ] ; then
        counter=1
    fi
    old="$line"
    printf '%s\t%s\n' "$counter" "$line"
    counter=$((counter+1))
done

You can run the script with:

$ sh scriptname.sh < inputfile

edited Sep 05 '12 at 22:26

Gilles 'SO- stop being evil'

829,060

answered Sep 05 '12 at 14:32

Matteo

9,796
4
51
66

Thanks, how do I launch this from a tab delimited file? – martijn Sep 05 '12 at 14:57
Edited with tab delimited output – Matteo Sep 05 '12 at 15:21
@Gilles, thanks for the edit. Just a question why printf is better then echo? – Matteo Sep 06 '12 at 05:16
1

@Matteo With echo -e, if an input line contains a backslash, it would have been interpreted as an escape character. The -r option to read is for the same reason, and IFS= is to retain leading and trailing whitespace, see Why is while IFS= read used so often, instead of IFS=; while read..? – Gilles 'SO- stop being evil' Sep 06 '12 at 13:28

score 0 · Answer 4 · answered May 17 '13 at 08:26

If you need something that works independent of whether the input is clustered (i.e. all occurrences of X being after each other) you need to use some counter per each different X. You can e.g. use the following as a filter or with a ommandline parameter, writing to stdout:

#!/usr/bin/env python
import sys, collections
c = collections.Counter()
for line in sys.stdin if len(sys.argv) == 1 else open(sys.argv[1]):
    c[line] += 1
    sys.stdout.write("%s\t%s" % (c[line], line))

Number lines according to their content

4 Answers4