Remove repetitive entries from a file

Question

There is a file with a couple of identical and repetitive entries, which looks something like this:

123 abc nhjk
123 abc cftr
123 abc xdrt
123 def nhjk
123 def cftr
123 def xdrt

If the combination of (coloumns) Field1 and Field2 match, then only the first occurrence where they both match needs to be retained. So since 123 and abc of the first line match with 123 and abc of the second line, only the first line is to be retained. On further comparision, since the match is true for the first and third line too, again only the first line is to be retained.

However, For the first and fourth line 123 and 123 matches but abc and def do not match, so both lines are to be retained.

So the final output should look like this:

123 abc nhjk
123 def nhjk

score 3 · Accepted Answer · answered Oct 24 '15 at 06:14

3

One way to do it is with the -u flag to sort although this may end up not preserving original file order:

sort -k1,1 -k2,2 -u file

If you need the de-duping done with file order preserved

awk '!a[$1, $2]++' file

answered Oct 24 '15 at 06:14

iruvar

16,725

Jesus A. Sanchez · Answer 2 · 2015-10-26T04:00:03.970

2

Awesome answers from RobertL and 1_CR

If you prefer a shell script approach that will be more flexible, you can try the following script:

#!/bin/sh

rm output.txt
touch output.txt
while read line
do
    field1=$( echo $line | cut -d" " -f1)
    field2=$( echo $line | cut -d" " -f2)
    lookup="$field1 $field2"
    if  [ -z $(grep "$lookup" output.txt) ]
    then
        echo $line >> output.txt
    fi
done < input.txt
cat output.txt
exit 0

Obviously, it can be shortened a lot, but I wanted to do every step very clear.

Enjoy.

EDIT:

Following the link posted by @RobertL and after testing several options, I have to agree that there's a huge improvement in this script. I would use

#!/bin/sh

sort -k1,2 -u "$@" |
while read line
do
     echo "$line"
done

My only question about this is to RobertL but be why to use:

sort -k1,2 -k2,2 -u

instead of

sort -k1,2 -u

According to my own testing your sort is efficient,

$ cat robertL.sh
    #!/bin/sh

    sort -k1,1 -k2,2 -u "$@" |
    while read line
    do
         echo "$line"
    done

$ time ./robertL.sh < input.txt

123 abc nhjk
123 def nhjk

real    0m0.022s
user    0m0.014s
sys     0m0.009s

But the other one is twice as fast,

$ cat process_v2.sh
#!/bin/sh

sort -k1,2 -u "$@" |
while read line
do
     echo "$line"
done

$ time ./process_v2.sh < input.txt

123 abc nhjk
123 def nhjk

real    0m0.012s
user    0m0.006s
sys     0m0.009s

So, as a conclusion, is highly recommended RobertL's approach, but always take everything here as a sample, and not as an absolute truth or a final solution for your question. I think the best here is to find guidance through answers.

edited Oct 26 '15 at 04:00

answered Oct 24 '15 at 17:13

Jesus A. Sanchez

684
4
6

I take issue with hardcoded filenames, but even aside from that, see here: http://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice/169765#169765 – Wildcard Oct 24 '15 at 17:54
Thanks @Wildcard, I'll take a look at it and get as much as I can. Indeed I'm still a beginner when it comes to shell scripting, but I'm doing my best to learn something every day. – Jesus A. Sanchez Oct 24 '15 at 18:03
Thank you Jesus, the if [[ -z $(grep "$lookup" output.txt) ]] bit helped me with my problem! – Cartwig Oct 25 '15 at 04:25
1

Warning! Very inefficient! I advise against using if [[ -z $(grep "$lookup" output.txt) ]]. It is very, very, inefficient. It's a linear search which takes longer each time a record is added. It works great on small data sets, but upon scaling up you will have a noticeable performance hit. If you need to process the records one by one, you should read the output of of @1_CR's solutions. I'm updating my original answer with an example. – RobertL Oct 25 '15 at 15:56
I take issue with exit 0. Why do you want to hide error status? – RobertL Oct 25 '15 at 17:21
I take issue with using [[ ... ]] because using [ ... ] accomplishes the same results in this case, is also a shell built-in (/bin/sh, dash). [[ is a non-universal shell extension which lowers portabililty and simplicity, with no advantage in this case. – RobertL Oct 25 '15 at 18:39
I really don't want to say it, but this script exemplifies how not to approach this problem. I hope that @jesus-a-sanchez would re-examine, analyze, and improve this answer. – RobertL Oct 25 '15 at 18:43
I did @RobertL, could you please explain a little the meaning of "-k1,1 -k2,2" and why not use "-k1,2"?? Thanks a lot, I'm trying to learn here too :) – Jesus A. Sanchez Oct 26 '15 at 04:01
*Kudos on your edits @jesus-a-sanchez!* Actually, we should ask @1_CR about the -F options because I used their code for the test. However, I think the -K1,1 notation means to stop the compare at the end of field 1, instead of stopping at the end of the line. I don't think that accounts for the difference in speed. You may want to use a larger data set, like thousands of lines. Also, alternately run both tests several times to smooth out time differences from various other system activities such as buffering and other processes. – RobertL Oct 26 '15 at 04:13
Thank you for the comments and suggestions! I have managed to improve my code based on your guidance. – Cartwig Oct 28 '15 at 02:51

score 1 · Answer 3 · answered Oct 25 '15 at 18:36

If you need to intensely process each record of the output, you can create a filter which reads each line of the output. Do not process the records inside the sort/unique algorithm.

The original script takes about 1 second to process every 100 records. The script which reads the output of the sort took less than 3/10 of a second to process over 380,000 records. It would have taken the original script over one hour to process that amount of data.

One Hour compared to 3/10 of one second!

Also notice that the original script also spent most of the time in system time (forking processes, doing io, etc), another bad bad sign of performance issues.

Execution of the original script:

    $ wc -l input.txt 
    1536 input.txt
    $ time ./jesus.sh
    rm: cannot remove ‘output.txt’: No such file or directory
    123 abc nhjk
    123 def nhjk

    real    0m16.997s              #<<<---------
    user    0m3.546s
    sys 0m16.329s                  #<<<---------

Execution of this new example script, only a small fraction of runtime is spent in os system code:

    $ time ./RobertL.sh < input.txt
    123 abc nhjk
    123 def nhjk        

    real    0m0.011s               #<<<---------
    user    0m0.004s
    sys 0m0.007s                   #<<<---------

Now we run the new script on a huge dataset that we know would take the original script over 1 hour to complete:

    $ wc -l data388440.txt 
    388440 data388440.txt
    $ time ./RobertL.sh < data388440.txt 
    123 abc nhjk
    123 def nhjk        

    real    0m0.282s               #<<<---------
    user    0m0.728s
    sys 0m0.032s                   #<<<---------

The new example script:

    $ cat RobertL.sh
    #!/bin/sh

    sort -k1,1 -k2,2 -u "$@" |
    while read line
    do
         echo "$line"
    done

The original script, modified to run without installing ksh:

    $ cat jesus.sh
    #!/bin/bash
    #!/bin/sh  # does not accept [[ ... ]]
    #!/bin/ksh # not installed on ubuntu by default

    rm output.txt
    touch output.txt
    while read line
    do
        field1=$( echo $line | cut -d" " -f1)
        field2=$( echo $line | cut -d" " -f2)
        lookup="$field1 $field2"
        if  [[ -z $(grep "$lookup" output.txt) ]]
        then
            echo $line >> output.txt
        fi
    done < input.txt
    cat output.txt
    exit 0

The input data was created by repeating the original 6 lines of sample data, data contained almost all duplicate records.

Thank you Robert for such an extensive breakdown! I've been part of this community for only a few days and am overwhelmed by the guidance here.
The initial code took about 15 seconds to process about 15000 lines of data. The improved code only takes 0.4 seconds!! — Cartwig, Oct 28 '15 at 06:53

score 1 · Answer 4 · answered Oct 26 '15 at 05:51

How about this sed one-liner:

sed -n '${;p;q;};N;/^ *\([^ ][^ ]*  *[^ ][^ ]*\)\( .*\)*\n *\1/{;s/\n.*//;h;G;D;};P;D' inputfile

This was a nice tricky challenge; thanks! :)

At a high level, what this does is iterate through the inputfile comparing two lines at a time. If the lines match up to the first two words, the second line of the two is discarded and the next line from the file is taken to compare to the first line. If the lines don't match, the first is printed and the second retained for comparison with later lines. When the end of the file is reached, the line currently "held for comparison" is printed.

Blow by blow explanation:

-n  doN't print lines by default; only if specified to print them.

${;p;q;};   if on the la$t line then Print the line and Quit.
N;  append a newline followed by the Next line of the file to the pattern space
/^ *\([^ ][^ ]*  *[^ ][^ ]*\)\( .*\)*\n *\1/    A very tricky regex:
    match any leading spaces, followed by a nonspace sequence, space or
    multiple spaces, nonspace sequence, then optionally a space followed
    by anything, then a newline, then any leading spaces, then the matched
    two words from earlier again.
{;  if that regex matched the pattern space, excecute the following.
s/\n.*//;   delete the first newline and everything after it
h;  copy the pattern space contents to the Hold space
G;  append (Get) a newline followed by the hold space contents to the pattern space
D;  delete everything in the pattern space up to the first newline, then start from the beginning of this sequence (with the ${ block)
};  end of block.  Skip to here if the tricky regex didn't match.
P;  Print everything in the pattern space up to the first newline.
D   Delete the pattern space up to the first newline.

Note that the above is very portable. Deliberately so. Just for a challenge I wanted it to run without ? or + being available (as they are not POSIX compatible), which makes the regex much more finicky.

In addition, the logic flow doesn't include any branches, although branches are POSIX compatible and universally available. Why did I do this? It's because not all implementations of sed allow for labels to be specified in a one-liner. They require a \ and a newline after the label. GNU sed allows labels in an actual one-liner and, for example, BSD sed doesn't.

The following two one liners are each an exact equivalent using GNU sed, the only difference being they are more robust by handling tabs as well as spaces:

sed -n ':k;${;p;q;};N;/^\s*\(\S\+\s\+\S\+\)\(\s.*\)\?\n\s*\1/{;s/\n.*//;bk;};P;D' inputfile
sed -n ':k;${;p;q;};N;s/^\(\s*\(\S\+\s\+\S\+\)\(\s.*\)\?\)\n\s*\2.*$/\1/;tk;P;D' inputfile

I mostly did this for fun. :) I think 1_CR's answer is the best, and of course it's simplest by far.

If your requirements get a little more tricky than they currently are and his approach won't work, the best tool is probably awk. But I haven't learned awk yet and I have learned sed. :)

score 0 · Answer 5 · answered Oct 24 '15 at 06:48

0

If the lines to be removed are all consecutive, and the keys are the same length, then you could use:

$ uniq --check-chars=8 <<EOF
123 abc nhjk
123 abc cftr
123 abc xdrt        
123 def nhjk        
123 def cftr        
123 def xdrt
EOF         
123 abc nhjk
123 def nhjk
$

answered Oct 24 '15 at 06:48

RobertL

6,780

Thank you RobertL. The keys are of different lengths, I should have mentioned it earlier. Sorry about that. – Cartwig Oct 25 '15 at 04:26
No problem. The uniq is a simple, useful, and often overlooked utility. Obviously @1_CR's solutions are the most accurate, readable, brief, and efficient possible in the current state of the art! Seems the tie breaker would be if you wanted sorted or original order. Good luck to you. – RobertL Oct 25 '15 at 15:41

Remove repetitive entries from a file

5 Answers5

EDIT: