1

How do I append a numerical suffix to lines to remove duplicates?

Pseudo code:

if currLine.startsWith("tag:")
  x = numFutureLinesMatching(currLine)
  if (x > 0)
    currLine = currLine + ${x:01}

Input file

tag:20230901-FAT
val:1034
tag:20230901-FAT
val:1500
tag:20230901-LAX
val:8934
tag:20230901-SMF
val:2954
tag:20230901-LAX
val:1000
tag:20230901-FAT
val:1500

Desired output

tag:20230901-FAT-02
val:1034
tag:20230901-FAT-01
val:1500
tag:20230901-LAX-01
val:8934
tag:20230901-SMF
val:2954
tag:20230901-LAX
val:1000
tag:20230901-FAT
val:1500

Notes:

  1. The final duplicate must remain unchanged.
  2. The earlier duplicates can have any suffix to be unique, so I chose a countdown.
  3. Awk seems to be a good choice, but any common scripted language will work.
Steven
  • 738
  • 1
    A countdown requires more work to implement and more time to execute than a countup because you don't know how many of X you have until you've gone through the input once so a countdown requires 2 passes of the input to determine the starting count value(s). So if it doesn't matter to you which way it goes - implement a countup, not a countdown. – Ed Morton Sep 11 '23 at 14:15

4 Answers4

4

awk can take arbitrary array indices - even a whole record ("line").

Make a regex match for tag: and start the counter, but correct by one due to the first match

awk '$0 ~ /^tag:/ { n[$0]++?$0=sprintf("%s-%02d",$0,n[$0]-1):1 }  1'

To make it a countdown, use tac twice:

tac infile | 
awk '$0 ~ /^tag:/ { n[$0]++?$0=sprintf("%s-%02d",$0,n[$0]-1):1 }  1' |
tac
FelixJN
  • 13,566
  • Perfect! This solution worked better in my outer script where I had to pipe to and from this script. – Steven Sep 11 '23 at 02:38
  • 3
    You could alternatively write it as awk '{print $0 (/^tag:/ && cnt[$0]++ ? sprintf("-%02d",cnt[$0]-1) : "")}' file using the shorthand /^tag:/ for $0 ~ /^tag:/ and avoiding assigning to $0 which could cause awk to unnecessarily re-split the record into fields. – Ed Morton Sep 11 '23 at 17:12
3

Here we go, exactly as required:

awk '
    NR==FNR{
        if (/^tag:/) {
            a[$1]++
        }
        next
    }
    {
        c=--a[$1]
        if (c>0) {
            printf "%s-%.2d\n", $1, c
        } else {
            print
        }
    }
' file file

With explanations:

awk '
    # first block for first file
    NR==FNR{                           # first file
        if (/^tag:/)                   # if the line starts with ^tag:
            a[$1]++                    # increment array a with key as column 1
        next                           # stop processing this line
    }                                                                   
    # 2th block for second file
    {
        c=--a[$1]                      # c = decrement array a with key as column 1
        if (c>0) {                     # ... pretty simple, no ?
            printf "%s-%.2d\n", $1, c  # %s = string %.2d integer, zero pading
        } else {                 
            print                      # else, print current line
        }
    }                                         
' file file 

Output

tag:20230901-FAT-02
val:1034
tag:20230901-FAT-01
val:1500
tag:20230901-LAX-01
val:8934
tag:20230901-SMF
val:2954
tag:20230901-LAX
val:1000
tag:20230901-FAT
val:1500
0

With :

#!/usr/bin/perl

use strict; use warnings; use feature qw/say/;

my (%h, $c); while (<>) { chomp; if (/^tag:/) { $c = sprintf "%.2d", ++$h{$}; if ($c>1) { say $ . "-" . $c; } else { say; } } else { say $_; } }

Usage:

./script file

Output:

tag:20230901-FAT
val:1034
tag:20230901-FAT-02
val:1500
tag:20230901-LAX
val:8934
tag:20230901-SMF
val:2954
tag:20230901-LAX-02
val:1000
tag:20230901-FAT-03
val:1500
0

Using Raku (formerly known as Perl_6)

~$ raku -ne 'BEGIN my %hash; put /^tag\:/ && %hash{$_}++ ?? $_ ~ sprintf("-%02d", %hash{$_}-1) !! $_;'   file

Above is the Raku version of an excellent awk answer posted by @EdMorton in a comment.

Start by calling Raku at the commandline with the -ne non-autoprinting linewise flags. Before entering the linewise code BEGIN by declaring a %hash. Run the put... statement over the input. If the line /^tag:/ starts with tag: add the line to the %hash and ++ increment its value.

This && conditional forms the beginning of Raku's "Test ?? True !! False" ternary operator. If True, the $_ line is output with the line's value minus one appended (value decoded using %hash{$_} ). If False, the line is output unchanged.

Sample Input:

tag:20230901-FAT
val:1034
tag:20230901-FAT
val:1500
tag:20230901-LAX
val:8934
tag:20230901-SMF
val:2954
tag:20230901-LAX
val:1000
tag:20230901-FAT
val:1500

Sample Output:

tag:20230901-FAT
val:1034
tag:20230901-FAT-01
val:1500
tag:20230901-LAX
val:8934
tag:20230901-SMF
val:2954
tag:20230901-LAX-01
val:1000
tag:20230901-FAT-02
val:1500

Above implements a count-up suffix, leaving the earliest tag: lines unchanged. To implement a count-down suffix that leaves the final tag: lines unchanged, use tac twice as instructed in the accepted answer by @FelixJN. Below, the answer implemented on MacOS which uses tail -r instead of tac:

~$ tail -r  Steve_suffix.txt | raku -ne 'BEGIN my %hash; put /^tag:/ && %hash{$_}++ ?? $_ ~ sprintf("-%02d", %hash{$_}-1) !! $_;' | tail -r
tag:20230901-FAT-02
val:1034
tag:20230901-FAT-01
val:1500
tag:20230901-LAX-01
val:8934
tag:20230901-SMF
val:2954
tag:20230901-LAX
val:1000
tag:20230901-FAT
val:1500

https://unix.stackexchange.com/a/114043
https://docs.raku.org/language/operators#infix_??_!!
https://raku.org

jubilatious1
  • 3,195
  • 8
  • 17