How can I delete any line containing 2 or more words (which aren't separated by space) from text file?

Question

The file has also "single versions" of those words.

For example for:

alpha
beta
gama
alphabeta
zeta
gamabeta

The output should be:

alpha
beta
gama
zeta

Edit: note that my file contains 1.5M lines

alphafoobetabar satisfies your requirement of "containing 2 or more words (which aren't separated by space)" - should it be printed or not? — Ed Morton, Jul 13 '20 at 11:57

John1024 · Answer 1 · 2020-07-15T19:17:42.737

This prints all words in the file that are not combinations of any two words in the file:

$ awk '{one[NR]=$1} END{for (i=1;i<=length(one);i++) for (j=1;j<=length(one);j++) two[one[i] one[j]]; for (i=1;i<=length(one);i++) if (!(one[i] in two)) print one[i]}' file
alpha
beta
gama
zeta

For those who prefer their commands split over multiple lines:

awk '
    {
    one[NR]=$1
    }
END{
    for (i=1;i&lt;=length(one);i++)
        for (j=1;j&lt;=length(one);j++)
            two[one[i] one[j]]
    for (i=1;i&lt;=length(one);i++)
        if (!(one[i] in two))
            print one[i]
 }' file

Another example

Let's consider a file with similar words but with the combinations sometimes appearing before the individual words:

$ cat file2
alphabeta
alpha
gammaalpha
beta
gamma

Running our same command still produces the correct result:

$ awk '{one[NR]=$1} END{for (i=1;i<=length(one);i++) for (j=1;j<=length(one);j++) two[one[i] one[j]]; for (i=1;i<=length(one);i++) if (!(one[i] in two)) print one[i]}' file2
alpha
beta
gamma

How it works

one[NR]=$1

This creates an array one with keys being the line numbers, NR, and values being the word on that line.
END{...}

The commands in curly braces are performed after we have finished reading in the file. These commands consist of two loops. This first loop is:
```
 for (i=1;i<=length(one);i++)
      for (j=1;j<=length(one);j++)
          two[one[i] one[j]]
```
This creates array two with keys made from every combination of two words in the file.

The second loop is:
```
  for (i=1;i<=length(one);i++)
      if (!(one[i] in two))
          print one[i]
```
This loop prints out every word in the file that does not appear as a key in array two.

Shorter Simpler Version

This version uses shorter code and prints out the same words. The disadvantage is that the words are not guaranteed to be in the same order as in the input file:

$ awk '{one[$1]} END{for (w1 in one) for (w2 in one) two[w1 w2]; for (w in one) if (!(w in two)) print w}' file1
gama
zeta
alpha
beta

More memory-efficient approach

For large files, the above methods could easily overflow memory. In these cases, consider:

$ sort -u file | awk '{one[$1]} END{for (w1 in one) for (w2 in one) print w1 w2}' >doubles
$ grep -vxFf doubles file
alpha
beta
gama
zeta

This uses sort -u to remove any duplicated words from file1 and then creates a file of possible double words called doubles. Then, grep is used to print lines in file which are not in doubles.

thank you the best answer is mr. jone1024, but it's work good in very small file, my file containt 1.5 M lines, so ubuntu terminal killed the shell after 10 min.?!
please what the best solution for this broblem? — lina32, Jul 13 '20 at 12:29
@renola32 That's a large number of lines. Most people have a vocabulary of well less than 10,000 words. How many unique lines in your file? — John1024, Jul 13 '20 at 21:05
@renola32 I just added a new method to the end of the answer which should be more memory efficient. — John1024, Jul 14 '20 at 03:02
mr. john1024 thank you, all it uniqes, what about: save after 10 delete then loop, 10 delete then save then loop. — lina32, Jul 14 '20 at 05:52
@renola32 So every every line is unique? That's depressing. If so, then I think you are right that a more complex approach is needed. — John1024, Jul 14 '20 at 06:02
@renola32 What do the words look like? Are they all lower case letters? Are they all the same length? Or, what is the range of character lengths for the words? — John1024, Jul 14 '20 at 06:07
yes, lower case, the length from 4 to 20, the range is all letters, regards — lina32, Jul 14 '20 at 06:12
Instead of grep -vFf doubles (you're missing -x btw), it would be more efficient to use comm (on sorted lists) — Stéphane Chazelas, Jul 15 '20 at 10:55

guest · Answer 2 · 2020-07-17T07:46:04.733

<file awk 'NF {print length "\t" $0}' | sort -k1n,1 | cut -f2- |
awk 'NR==1 {min=length}
(l=length) >= 2*min {
  delete k; # clear k array
  k[1];
  while (length(k))
    for (i in k) {
      for (j=l-i+1; j>=min; --j)
        if (substr($0,i,j) in seen) {
          if (i+j-1==l)
            next;
          k[i+j];
        }
      delete k[i];
    }
}
!seen[$0]++'

Lines that are made up entirely of previously seen lines will not be printed.

Works by checking substrings for presence of already seen strings.

It's required that the input file is sorted by line length from shortest to longest. awk | sort | cut does that.

The next awk program starts by noting the length of the shortest line (stored as min). Any line whose length is less than 2*min does not need to have its substrings checked. Instead, it can be added to seen array hash & printed (!seen[$0]++ is used as a condition to print non-duplicates, more info: How does awk '!a[$0]++' work?). min can also be used as a cutoff length when checking substrings.

When scanning lines for substrings, any new possible starting positions must be noted. This is done using an array k to store these offsets. The substrings are scanned and checked whether they exist as a hash of the seen array. When a seen string is found:

If substring is at end of line, go to next line of input. The line is not printed or added to seen array.
Otherwise, add the next starting position to k and continue scanning for further substrings.
Keep trying as long as new starting positions are found (while (length(k))).
If the above loop did not advance to next line, the line is added to seen array hash (and printed if not already seen).

That works and is clever. The use of hash tables makes it quite efficient. That would deserve an explanation though. — Stéphane Chazelas, Jul 15 '20 at 17:48

Stéphane Chazelas · Answer 3 · 2020-07-15T10:46:43.570

2

With reasonably short files and assuming the lines don't contain ERE operators:

$ LC_ALL=C grep -vxE "($(paste -sd '|' file)){2,}" file
alpha
beta
gama
zeta

Returns the lines that don't contain sequences of 2 or more of any of the lines in file.

It works by constructing a grep command like:

LC_ALL=C grep -vxE '(alpha|beta|gama|alphabeta|zeta|gamabeta){2,}' file

For larger files, you'll run into the limit of the length or arguments+environment (or of a single argument on Linux). That could be worked around by using -f - to pass the regexp via stdin rather than arguments, but even then, you'd run into limits on the size of the regexps.

Using perl instead of grep, I'm able to process larger inputs:

perl -le '
  chomp (@words = <>);
  $re = "^(" . join("|", map {qr{\Q$_\E}} @words) . "){2,}\\z";
  for (@words) {print unless m/$re/}' file

(which also avoids the other limitation mentioned above).

In any case, that's going to take a long time as each word needs to be compared with every other word (possibly more than once).

edited Jul 15 '20 at 10:46

answered Jul 14 '20 at 08:12

Stéphane Chazelas

544,893

Nice use of paste and {2,}. +1. In case you missed it, if you look at the comments under my answer, you'll see that the OP explains that his files are not "reasonably short." In fact, they are quite long (1.5 million lines, no duplicates) and, consequently, none of the current answers work for him. – John1024 Jul 14 '20 at 21:46
thank you, error msg: Argument list too long. – lina32 Jul 15 '20 at 08:31
thank you, error msg: Argument list too long. – lina32 Jul 15 '20 at 08:31
@renola32, see edit for a perl variant. – Stéphane Chazelas Jul 15 '20 at 10:50

score 0 · Answer 4 · answered Jul 13 '20 at 04:23

0

Sort the file
Skip lines while they match the one before

In awk:

sort file.txt | awk '!pattern || !($0~pattern) { pattern = $0; print }'

In perl:

$ sort file.txt | perl -wnlE 'BEGIN {my $pattern};if(!$pattern || !/$pattern/) {$pattern=$_; say }

answered Jul 13 '20 at 04:23

andreoss

98
5
17

This approach won't work in general: consider a file containing "and" and "andreoss" – Fox Jul 13 '20 at 15:31
@Fox I assumed that 'single versions' of words are always present and unambigious. – andreoss Jul 13 '20 at 15:36

score -1 · Answer 5 · answered Jul 13 '20 at 04:22

-1

awk '{for (i in a) if (index($0,i)) next; print; a[$0]}' file

answered Jul 13 '20 at 04:22

1

That fails if the repeated word comes first. If your first line is alphabeta for example. – terdon Jul 13 '20 at 08:47
1

Apart from that, you may want to edit your answer to explain how it achieves what the OP is requesting ... – AdminBee Jul 13 '20 at 09:14

How can I delete any line containing 2 or more words (which aren't separated by space) from text file?

5 Answers5

Another example

How it works

Shorter Simpler Version

More memory-efficient approach