Remove duplicates by adding numerical suffix

Question

How do I append a numerical suffix to lines to remove duplicates?

Pseudo code:

if currLine.startsWith("tag:")
  x = numFutureLinesMatching(currLine)
  if (x > 0)
    currLine = currLine + ${x:01}

Input file

tag:20230901-FAT
val:1034
tag:20230901-FAT
val:1500
tag:20230901-LAX
val:8934
tag:20230901-SMF
val:2954
tag:20230901-LAX
val:1000
tag:20230901-FAT
val:1500

Desired output

tag:20230901-FAT-02
val:1034
tag:20230901-FAT-01
val:1500
tag:20230901-LAX-01
val:8934
tag:20230901-SMF
val:2954
tag:20230901-LAX
val:1000
tag:20230901-FAT
val:1500

Notes:

The final duplicate must remain unchanged.
The earlier duplicates can have any suffix to be unique, so I chose a countdown.
Awk seems to be a good choice, but any common scripted language will work.

A countdown requires more work to implement and more time to execute than a countup because you don't know how many of X you have until you've gone through the input once so a countdown requires 2 passes of the input to determine the starting count value(s). So if it doesn't matter to you which way it goes - implement a countup, not a countdown. — Ed Morton, Sep 11 '23 at 14:15

FelixJN · Accepted Answer · 2023-09-10T22:46:05.760

4

awk can take arbitrary array indices - even a whole record ("line").

Make a regex match for tag: and start the counter, but correct by one due to the first match

awk '$0 ~ /^tag:/ { n[$0]++?$0=sprintf("%s-%02d",$0,n[$0]-1):1 }  1'

To make it a countdown, use tac twice:

tac infile | 
awk '$0 ~ /^tag:/ { n[$0]++?$0=sprintf("%s-%02d",$0,n[$0]-1):1 }  1' |
tac

edited Sep 10 '23 at 22:46

answered Sep 10 '23 at 22:32

FelixJN

13,566

Perfect! This solution worked better in my outer script where I had to pipe to and from this script. – Steven Sep 11 '23 at 02:38
3

You could alternatively write it as awk '{print $0 (/^tag:/ && cnt[$0]++ ? sprintf("-%02d",cnt[$0]-1) : "")}' file using the shorthand /^tag:/ for $0 ~ /^tag:/ and avoiding assigning to $0 which could cause awk to unnecessarily re-split the record into fields. – Ed Morton Sep 11 '23 at 17:12

Gilles Quénot · Answer 2 · 2023-09-10T22:35:10.443

Here we go, exactly as required:

awk '
    NR==FNR{
        if (/^tag:/) {
            a[$1]++
        }
        next
    }
    {
        c=--a[$1]
        if (c>0) {
            printf "%s-%.2d\n", $1, c
        } else {
            print
        }
    }
' file file

With explanations:

awk '
    # first block for first file
    NR==FNR{                           # first file
        if (/^tag:/)                   # if the line starts with ^tag:
            a[$1]++                    # increment array a with key as column 1
        next                           # stop processing this line
    }                                                                   
    # 2th block for second file
    {
        c=--a[$1]                      # c = decrement array a with key as column 1
        if (c>0) {                     # ... pretty simple, no ?
            printf "%s-%.2d\n", $1, c  # %s = string %.2d integer, zero pading
        } else {                 
            print                      # else, print current line
        }
    }                                         
' file file

Output

tag:20230901-FAT-02
val:1034
tag:20230901-FAT-01
val:1500
tag:20230901-LAX-01
val:8934
tag:20230901-SMF
val:2954
tag:20230901-LAX
val:1000
tag:20230901-FAT
val:1500

score 0 · Answer 3 · answered Sep 10 '23 at 22:57

0

With perl:

#!/usr/bin/perl
use strict; use warnings;
use feature qw/say/;
my (%h, $c);
while (<>) {
    chomp;
    if (/^tag:/) {
        $c = sprintf "%.2d", ++$h{$};
        if ($c>1) {
            say $ . "-" . $c;
        } else {
            say;
        }
    } else {
        say $_;
    }
}

Usage:

./script file

Output:

tag:20230901-FAT
val:1034
tag:20230901-FAT-02
val:1500
tag:20230901-LAX
val:8934
tag:20230901-SMF
val:2954
tag:20230901-LAX-02
val:1000
tag:20230901-FAT-03
val:1500

answered Sep 10 '23 at 22:57

Gilles Quénot

33,867

The output file isn't quite right here. Appreciate the effort though. – Steven Sep 11 '23 at 02:41
What is not quite right? Be precise. Removed duplicated as requested – Gilles Quénot Sep 11 '23 at 02:55

jubilatious1 · Answer 4 · 2023-09-19T17:43:48.597

Using Raku (formerly known as Perl_6)

~$ raku -ne 'BEGIN my %hash; put /^tag\:/ && %hash{$_}++ ?? $_ ~ sprintf("-%02d", %hash{$_}-1) !! $_;'   file

Above is the Raku version of an excellent awk answer posted by @EdMorton in a comment.

Start by calling Raku at the commandline with the -ne non-autoprinting linewise flags. Before entering the linewise code BEGIN by declaring a %hash. Run the put... statement over the input. If the line /^tag:/ starts with tag: add the line to the %hash and ++ increment its value.

This && conditional forms the beginning of Raku's "Test ?? True !! False" ternary operator. If True, the $_ line is output with the line's value minus one appended (value decoded using %hash{$_} ). If False, the line is output unchanged.

Sample Input:

tag:20230901-FAT
val:1034
tag:20230901-FAT
val:1500
tag:20230901-LAX
val:8934
tag:20230901-SMF
val:2954
tag:20230901-LAX
val:1000
tag:20230901-FAT
val:1500

Sample Output:

tag:20230901-FAT
val:1034
tag:20230901-FAT-01
val:1500
tag:20230901-LAX
val:8934
tag:20230901-SMF
val:2954
tag:20230901-LAX-01
val:1000
tag:20230901-FAT-02
val:1500

Above implements a count-up suffix, leaving the earliest tag: lines unchanged. To implement a count-down suffix that leaves the final tag: lines unchanged, use tac twice as instructed in the accepted answer by @FelixJN. Below, the answer implemented on MacOS which uses tail -r instead of tac:

~$ tail -r  Steve_suffix.txt | raku -ne 'BEGIN my %hash; put /^tag:/ && %hash{$_}++ ?? $_ ~ sprintf("-%02d", %hash{$_}-1) !! $_;' | tail -r
tag:20230901-FAT-02
val:1034
tag:20230901-FAT-01
val:1500
tag:20230901-LAX-01
val:8934
tag:20230901-SMF
val:2954
tag:20230901-LAX
val:1000
tag:20230901-FAT
val:1500

https://unix.stackexchange.com/a/114043
https://docs.raku.org/language/operators#infix_??_!!
https://raku.org