1

How can I delete any line containing 2 or more words (which aren't separated by space) from text file?

The file has also "single versions" of those words.

For example for:

alpha
beta
gama
alphabeta
zeta
gamabeta

The output should be:

alpha
beta
gama
zeta

Edit: note that my file contains 1.5M lines

lina32
  • 47
  • 3

5 Answers5

2

This prints all words in the file that are not combinations of any two words in the file:

$ awk '{one[NR]=$1} END{for (i=1;i<=length(one);i++) for (j=1;j<=length(one);j++) two[one[i] one[j]]; for (i=1;i<=length(one);i++) if (!(one[i] in two)) print one[i]}' file
alpha
beta
gama
zeta

For those who prefer their commands split over multiple lines:

awk '
    {
    one[NR]=$1
    }
END{
    for (i=1;i&lt;=length(one);i++)
        for (j=1;j&lt;=length(one);j++)
            two[one[i] one[j]]
    for (i=1;i&lt;=length(one);i++)
        if (!(one[i] in two))
            print one[i]
 }' file

Another example

Let's consider a file with similar words but with the combinations sometimes appearing before the individual words:

$ cat file2
alphabeta
alpha
gammaalpha
beta
gamma

Running our same command still produces the correct result:

$ awk '{one[NR]=$1} END{for (i=1;i<=length(one);i++) for (j=1;j<=length(one);j++) two[one[i] one[j]]; for (i=1;i<=length(one);i++) if (!(one[i] in two)) print one[i]}' file2
alpha
beta
gamma

How it works

  • one[NR]=$1

    This creates an array one with keys being the line numbers, NR, and values being the word on that line.

  • END{...}

    The commands in curly braces are performed after we have finished reading in the file. These commands consist of two loops. This first loop is:

     for (i=1;i<=length(one);i++)
          for (j=1;j<=length(one);j++)
              two[one[i] one[j]]
    

    This creates array two with keys made from every combination of two words in the file.

    The second loop is:

      for (i=1;i<=length(one);i++)
          if (!(one[i] in two))
              print one[i]
    

    This loop prints out every word in the file that does not appear as a key in array two.

Shorter Simpler Version

This version uses shorter code and prints out the same words. The disadvantage is that the words are not guaranteed to be in the same order as in the input file:

$ awk '{one[$1]} END{for (w1 in one) for (w2 in one) two[w1 w2]; for (w in one) if (!(w in two)) print w}' file1
gama
zeta
alpha
beta

More memory-efficient approach

For large files, the above methods could easily overflow memory. In these cases, consider:

$ sort -u file | awk '{one[$1]} END{for (w1 in one) for (w2 in one) print w1 w2}' >doubles
$ grep -vxFf doubles file
alpha
beta
gama
zeta

This uses sort -u to remove any duplicated words from file1 and then creates a file of possible double words called doubles. Then, grep is used to print lines in file which are not in doubles.

John1024
  • 74,655
  • thank you the best answer is mr. jone1024, but it's work good in very small file, my file containt 1.5 M lines, so ubuntu terminal killed the shell after 10 min.?!

    please what the best solution for this broblem?

    – lina32 Jul 13 '20 at 12:29
  • 1
    @renola32 That's a large number of lines. Most people have a vocabulary of well less than 10,000 words. How many unique lines in your file? – John1024 Jul 13 '20 at 21:05
  • @renola32 I just added a new method to the end of the answer which should be more memory efficient. – John1024 Jul 14 '20 at 03:02
  • mr. john1024 thank you, all it uniqes, what about: save after 10 delete then loop, 10 delete then save then loop. – lina32 Jul 14 '20 at 05:52
  • @renola32 So every every line is unique? That's depressing. If so, then I think you are right that a more complex approach is needed. – John1024 Jul 14 '20 at 06:02
  • @renola32 What do the words look like? Are they all lower case letters? Are they all the same length? Or, what is the range of character lengths for the words? – John1024 Jul 14 '20 at 06:07
  • yes, lower case, the length from 4 to 20, the range is all letters, regards – lina32 Jul 14 '20 at 06:12
  • Instead of grep -vFf doubles (you're missing -x btw), it would be more efficient to use comm (on sorted lists) – Stéphane Chazelas Jul 15 '20 at 10:55
2
<file awk 'NF {print length "\t" $0}' | sort -k1n,1 | cut -f2- |
awk 'NR==1 {min=length}
(l=length) >= 2*min {
  delete k; # clear k array
  k[1];
  while (length(k))
    for (i in k) {
      for (j=l-i+1; j>=min; --j)
        if (substr($0,i,j) in seen) {
          if (i+j-1==l)
            next;
          k[i+j];
        }
      delete k[i];
    }
}
!seen[$0]++'

Lines that are made up entirely of previously seen lines will not be printed.

Works by checking substrings for presence of already seen strings.

It's required that the input file is sorted by line length from shortest to longest. awk | sort | cut does that.

The next awk program starts by noting the length of the shortest line (stored as min). Any line whose length is less than 2*min does not need to have its substrings checked. Instead, it can be added to seen array hash & printed (!seen[$0]++ is used as a condition to print non-duplicates, more info: How does awk '!a[$0]++' work?). min can also be used as a cutoff length when checking substrings.

When scanning lines for substrings, any new possible starting positions must be noted. This is done using an array k to store these offsets. The substrings are scanned and checked whether they exist as a hash of the seen array. When a seen string is found:

  • If substring is at end of line, go to next line of input. The line is not printed or added to seen array.
  • Otherwise, add the next starting position to k and continue scanning for further substrings.
  • Keep trying as long as new starting positions are found (while (length(k))).
  • If the above loop did not advance to next line, the line is added to seen array hash (and printed if not already seen).
guest
  • 2,134
2

With reasonably short files and assuming the lines don't contain ERE operators:

$ LC_ALL=C grep -vxE "($(paste -sd '|' file)){2,}" file
alpha
beta
gama
zeta

Returns the lines that don't contain sequences of 2 or more of any of the lines in file.

It works by constructing a grep command like:

LC_ALL=C grep -vxE '(alpha|beta|gama|alphabeta|zeta|gamabeta){2,}' file

For larger files, you'll run into the limit of the length or arguments+environment (or of a single argument on Linux). That could be worked around by using -f - to pass the regexp via stdin rather than arguments, but even then, you'd run into limits on the size of the regexps.

Using perl instead of grep, I'm able to process larger inputs:

perl -le '
  chomp (@words = <>);
  $re = "^(" . join("|", map {qr{\Q$_\E}} @words) . "){2,}\\z";
  for (@words) {print unless m/$re/}' file

(which also avoids the other limitation mentioned above).

In any case, that's going to take a long time as each word needs to be compared with every other word (possibly more than once).

  • Nice use of paste and {2,}. +1. In case you missed it, if you look at the comments under my answer, you'll see that the OP explains that his files are not "reasonably short." In fact, they are quite long (1.5 million lines, no duplicates) and, consequently, none of the current answers work for him. – John1024 Jul 14 '20 at 21:46
  • thank you, error msg: Argument list too long. – lina32 Jul 15 '20 at 08:31
  • thank you, error msg: Argument list too long. – lina32 Jul 15 '20 at 08:31
  • @renola32, see edit for a perl variant. – Stéphane Chazelas Jul 15 '20 at 10:50
0
  1. Sort the file
  2. Skip lines while they match the one before

In awk:

sort file.txt | awk '!pattern || !($0~pattern) { pattern = $0; print }'

In perl:

$ sort file.txt | perl -wnlE 'BEGIN {my $pattern};if(!$pattern || !/$pattern/) {$pattern=$_; say }
andreoss
  • 98
  • 5
  • 17
-1
awk '{for (i in a) if (index($0,i)) next; print; a[$0]}' file
  • 1
    That fails if the repeated word comes first. If your first line is alphabeta for example. – terdon Jul 13 '20 at 08:47
  • 1
    Apart from that, you may want to edit your answer to explain how it achieves what the OP is requesting ... – AdminBee Jul 13 '20 at 09:14