This prints all words in the file that are not combinations of any two words in the file:
$ awk '{one[NR]=$1} END{for (i=1;i<=length(one);i++) for (j=1;j<=length(one);j++) two[one[i] one[j]]; for (i=1;i<=length(one);i++) if (!(one[i] in two)) print one[i]}' file
alpha
beta
gama
zeta
For those who prefer their commands split over multiple lines:
awk '
{
one[NR]=$1
}
END{
for (i=1;i<=length(one);i++)
for (j=1;j<=length(one);j++)
two[one[i] one[j]]
for (i=1;i<=length(one);i++)
if (!(one[i] in two))
print one[i]
}' file
Another example
Let's consider a file with similar words but with the combinations sometimes appearing before the individual words:
$ cat file2
alphabeta
alpha
gammaalpha
beta
gamma
Running our same command still produces the correct result:
$ awk '{one[NR]=$1} END{for (i=1;i<=length(one);i++) for (j=1;j<=length(one);j++) two[one[i] one[j]]; for (i=1;i<=length(one);i++) if (!(one[i] in two)) print one[i]}' file2
alpha
beta
gamma
How it works
one[NR]=$1
This creates an array one
with keys being the line numbers, NR
, and values being the word on that line.
END{...}
The commands in curly braces are performed after we have finished reading in the file. These commands consist of two loops. This first loop is:
for (i=1;i<=length(one);i++)
for (j=1;j<=length(one);j++)
two[one[i] one[j]]
This creates array two
with keys made from every combination of two words in the file.
The second loop is:
for (i=1;i<=length(one);i++)
if (!(one[i] in two))
print one[i]
This loop prints out every word in the file that does not appear as a key in array two
.
Shorter Simpler Version
This version uses shorter code and prints out the same words. The disadvantage is that the words are not guaranteed to be in the same order as in the input file:
$ awk '{one[$1]} END{for (w1 in one) for (w2 in one) two[w1 w2]; for (w in one) if (!(w in two)) print w}' file1
gama
zeta
alpha
beta
More memory-efficient approach
For large files, the above methods could easily overflow memory. In these cases, consider:
$ sort -u file | awk '{one[$1]} END{for (w1 in one) for (w2 in one) print w1 w2}' >doubles
$ grep -vxFf doubles file
alpha
beta
gama
zeta
This uses sort -u
to remove any duplicated words from file1 and then creates a file of possible double words called doubles
. Then, grep
is used to print lines in file
which are not in doubles
.
alphabeta
is two attached words? – jsotola Jul 13 '20 at 03:26background
. – Quasímodo Jul 13 '20 at 10:49alphafoobetabar
satisfies your requirement of "containing 2 or more words (which aren't separated by space)" - should it be printed or not? – Ed Morton Jul 13 '20 at 11:57