Remove duplicate lines from a file that contains a timestamp

Question

This question/answer has some good solutions for deleting identical lines in a file, but won't work in my case since the otherwise duplicate lines have a timestamp.

Is it possible to tell awk to ignore the first 26 characters of a line in determining duplicates?

Example:

[Fri Oct 31 20:27:05 2014] The Brown Cow Jumped Over The Moon
[Fri Oct 31 20:27:10 2014] The Brown Cow Jumped Over The Moon
[Fri Oct 31 20:27:13 2014] The Brown Cow Jumped Over The Moon
[Fri Oct 31 20:27:16 2014] The Brown Cow Jumped Over The Moon
[Fri Oct 31 20:27:21 2014] The Brown Cow Jumped Over The Moon
[Fri Oct 31 20:27:22 2014] The Brown Cow Jumped Over The Moon
[Fri Oct 31 20:27:23 2014] The Brown Cow Jumped Over The Moon
[Fri Oct 31 20:27:24 2014] The Brown Cow Jumped Over The Moon

Would become

[Fri Oct 31 20:27:24 2014] The Brown Cow Jumped Over The Moon

(keeping the most recent timestamp)

Yes. If you were to post some example input and output, then this might amount to a question. — jasonwryan, Nov 03 '14 at 16:21
When asking this type of question, you need to include your input and your desired output. We can't help if we have to guess. — terdon, Nov 03 '14 at 16:24
"yes" or "no" seems to be an acceptable answer, what are you going to do with that knowledge? In case of no, extend awk? — Anthon, Nov 03 '14 at 16:32
Wow. 80,000 rep claim this was an unusable question (I would not call it a good one) but not a single close vote? — Hauke Laging, Nov 03 '14 at 16:45
@HaukeLaging it seems reasonable to give the OP the chance to react to our comments. They have now done so and the question is greatly improved. — terdon, Nov 03 '14 at 17:39

score 15 · Accepted Answer · edited Nov 09 '14 at 01:18

15

You can just use uniq with its -f option:

uniq -f 4 input.txt

From man uniq:

  -f, --skip-fields=N
       avoid comparing the first N fields

Actually this will display the first line:

[Fri Oct 31 20:27:05 2014] The Brown Cow Jumped Over The Moon

If that is a problem you can do:

tac input.txt | uniq -f 4

or if you don't have tac but your tail supports -r:

tail -r input.txt | uniq -f 4

edited Nov 09 '14 at 01:18

Whymarrh

175

answered Nov 03 '14 at 19:16

Anthon

79,293

1

That's wickedly awesome :) – Ramesh Nov 03 '14 at 19:18
3

@Ramesh Some of these tools have some nasty useful options that, when you know them, beat any awk/perl/python stuff you can come up with. – Anthon Nov 03 '14 at 19:20

score 4 · Answer 2 · answered Nov 03 '14 at 16:35

4

awk '!seen[substr($0,27)]++' file

answered Nov 03 '14 at 16:35

Hauke Laging

90,279

This solution does not cover the timestamp part as that was not part of the question when this answer was written. – Hauke Laging Nov 03 '14 at 17:18
2

This is exactly why many of us work to close these until the Q's have been fully fleshed out. Otherwise these Q's are wasting your time and the OP's. – slm Nov 03 '14 at 18:30

score 3 · Answer 3 · answered Nov 03 '14 at 18:15

3

Try this one:

awk -F ']' '{a[$2]=$1}END{for(i in a){print a[i]"]"i}}'

answered Nov 03 '14 at 18:15

jimmij

47,140

score 0 · Answer 4 · answered Nov 04 '14 at 04:54

0

A perl solution:

perl -F']' -anle '$h{$F[1]} = $_; END{print $h{$_} for keys %h}' file

answered Nov 04 '14 at 04:54

cuonglm

153,898

score 0 · Answer 5 · answered May 14 '15 at 22:07

One can use power of vim:

:g/part of duplicate string/d

Very easy. If you have couple more files (such as gzipped rotated logs), vim will open them without any preliminary uncompression on your side and you can repeat the last command by pressing : and ↑. Just like repeating last command in terminal.

Remove duplicate lines from a file that contains a timestamp

5 Answers5