2

Is there anyway to use uniq (or similar) to filter/remove sets of repeating lines from log type output? I am debugging an MPI code where multiple processors often print the same exact output. uniq works great when the output is one line, but frequently the code will generate multiple lines. Here's an example:

calling config()
calling config()
calling config()
running main loop
  time=0
running main loop
  time=0
running main loop
  time=0
output from Rank 0

gets filtered with uniq (without options) to:

calling config()
running main loop
  time=0
running main loop
  time=0
running main loop
  time=0
output from Rank 0

is there an easy way to filter n-line blocks? I've read and reread the manpage but can't find anything obvious. Thanks!

UPDATE: I'd like the output to have duplicated blocks condensed down to a single entry, so in the case of the example above:

calling config()
running main loop
  time=0
output from Rank 0
rheo
  • 31
  • 4

2 Answers2

5
$ awk '!a[$0]++' file
calling config()
running main loop
  time=0
output from Rank 0
Hauke Laging
  • 90,279
  • This worked great! Short and sweet and I was able to pipe the output from the program into it to watch realtime logs. Thanks! – rheo May 08 '20 at 03:44
  • 1
    That will omit any line that was ever mentioned before, not just repeated blocks. – Simon Richter May 08 '20 at 09:53
  • @SimonRichter - that's definitely undesirable! The idea is to work like uniq does, but with multiple lines. As soon as something new appears, the "history" should be reset. – rheo May 08 '20 at 15:31
  • The difficulty here is to identify "blocks" -- can they be defined as "consecutive lines where all but the first is indented"? – Simon Richter May 08 '20 at 15:43
  • @SimonRichter - the indent isn't really important, just multiple identical lines. So if the pattern was B,C,D,A,B,C,D,B,C,D,B,C,D,E (on separate lines), the result would be B,C,D,A,B,C,D,E. – rheo May 08 '20 at 16:18
  • @rheo, what would be the desired output for B,C,D,A,B,C,D,B,C,B,C,D,E, i.e. with the third D missing? Would that (retroactively) mean that the second D should have been output since B,C,D obviously no longer belong together? – Simon Richter May 08 '20 at 18:11
  • @SimonRichter -- I suppose in that case it would be B,C,D,A,B,C,D,B,C,D,E; the repeating string there is now [B,C]. The reduction doesn't need to be done recursively to remove the resulting repetition. The original use case here was to reduce the identical output from multiple threads/MPI-ranks to make the output less noisy. If the output is slightly different, it's probably important, and therefore shouldn't be filtered in that case. – rheo May 12 '20 at 20:32
  • 1
    @rheo, I'd probably go for two processing passes then, to avoid having to go backwards in time. Pass 1 would identify groups of lines that always occur together, but not generate output to allow later input lines to change the assessment, pass 2 would then generate an automaton from the gathered information and apply it. If the output is somewhat stable over different runs, that might even be reusable (and the automaton could have error states that tell us that it needs to be rebuilt). – Simon Richter May 13 '20 at 10:11
3

From uniq man page:

Note: 'uniq' does not detect repeated lines unless they are adjacent.

But you can do it with a short bash script like that:

#!/usr/bin/env bash
set -euo pipefail
IFS=$'\n'

declare -r file=${1:?Please enter a filename to treat as first parameter}

linenum=0
for line in $(cat "${file}"); do
  linenum="$((linenum + 1))"
  freq=$(sed -n "1,${linenum} p" "${file}" | grep -c "${line}")
  [[ ${freq} == 1 ]] && echo "${line}"
done

Which will produce in your case:

calling config()
running main loop
  time=0
output from Rank 0
Uggla
  • 61
  • 3