8

I have a list of names like so:

dog_bone
dog_collar
dragon
cool_dragon
lion
lion_trainer
dog

I need to extract out names that appear in other names like so:

dragon
lion
dog

I looked through the uniq man page, but it seems to compare entire lines, not strings. Is there a way to do this with a bash function?

derobert
  • 109,670

5 Answers5

5
file=/the/file.txt
while IFS= read -r string; do
  grep -Fe "$string" < "$file" | grep -qvxFe "$string" &&
    printf '%s\n' "$string"
done < "$file"

That runs one read, two grep and sometimes one printf commands per line of the file, so is not going to be very efficient.

You can do the whole thing in one awk invocation:

awk '{l[NR]=$0}
     END {
       for (i=1; i<=NR; i++)
         for (j=1; j<=NR; j++)
           if (j!=i && index(l[j], l[i])) {
             print l[i]
             break
           }
     }' < "$file"

though that means the whole file is stored in memory.

5

bash

names=(
  dog_bone
  dog_collar
  dragon
  cool_dragon
  lion
  lion_trainer
  dog
)

declare -A contained                 # an associative array
for (( i=0; i < ${#names[@]}; i++ )); do 
    for (( j=0; j < ${#names[@]}; j++ )); do 
        if (( i != j )) && [[ ${names[i]} == *"${names[j]}"* ]]; then
            contained["${names[j]}"]=1
        fi 
    done
done
printf "%s\n" "${!contained[@]}"    # print the array keys
dog
dragon
lion
glenn jackman
  • 85,964
3

Here's a Perl approach. This also needs to load the file into memory:

perl -le '@f=<>; foreach $l1 (@f){ 
                    chomp($l1); 
                    foreach $l2 (@f){ 
                        chomp($l2); 
                        next if $l1 eq $l2; 
                        $k{$l1}++ if $l2=~/$l1/;
                    }
                } print join "\n", keys %k' file
terdon
  • 242,166
3

Here is a bash version 4.x solution:

#!/bin/bash

declare -A output
readarray input < '/path/to/file'

for i in "${input[@]}"; do
  for j in "${input[@]}"; do
    [[ $j = "$i" ]] && continue
    if [ -z "${i##*"$j"*}" ]; then
      if [[ ! ${output[$j]} ]]; then
        printf "%s\n" "$j"
        output[$j]=1
      fi
    fi
  done
done
cuonglm
  • 153,898
3

A hacky way to do what you want. I'm not sure if all your examples will include a underscore or not but you could key off of that and use sort | uniq -d to produce a list of substrings that are present more than once within a given file, using the actual file itself as a list of fixed strings to grep, via the -F switch.

Example

$ grep -oFf <(grep -v _ file.txt) file.txt |
    LC_ALL=C sort | LC_ALL=C uniq -d    
dog
dragon
lion

The above works as follows.

  1. <(grep -v _ file.txt) will produce a list of the contents of file.txt omitting the lines that contain a underscore (_).

    $ grep -v _ file.txt 
    dragon
    lion
    dog
    
  2. grep -oFf <(..) file.txt will use the results of #1 as a list of fixed length strings that grep will find within the file file.txt.

    $ grep -oFf <(grep -v _ file.txt) file.txt
    dog
    dog
    dragon
    dragon
    lion
    lion
    dog
    
  3. The results of this command are then run through the sort & uniq -d commands which will list the entries that occur more than once amongst the results that grep -oFf has produced.

NOTE: If you'd like to understand why you need to enlist the use of LC_ALL=C when performing the sort and uniq calls then take a look at @Stephane's fine answer to this here: What does "LC_ALL=C" do?.

slm
  • 369,824
  • That's wrong as it is equivalent to grep -v _ file.txt. Using LC_ALL=C sort | LC_ALL=C uniq -d instead of sort -u would work – Stéphane Chazelas May 07 '14 at 19:23
  • @StephaneChazelas - thanks for the feedback. Can you explain what's wrong? I don't understand what you're suggestion is going to change. – slm May 07 '14 at 19:55
  • grep -of <(grep -v _ file.txt) file.txt will always return the lines that don't contain underscores because they match themselves (you're also missing some -F, but that's another issue). – Stéphane Chazelas May 07 '14 at 22:01
  • @StephaneChazelas - OK I finally understand what LC_ALL=C is doing in all your examples now. I finally stumbled across your A to that Q, funny I'd never seen that one until today. Thanks! – slm May 08 '14 at 02:00
  • Your answer assumes that one wants to consider whether foo is within foo_bar, but not whether a_b is within a_b_c. It also won't work if there's a foo, and foobar. – Stéphane Chazelas May 08 '14 at 06:30
  • @StephaneChazelas - that's the trouble with hacky solutions. It will work for most situations that contain underscores as I stated in the first paragraph. Do you know off hand if you can word boundary the fixed strings via grep's -F option? That could be of potential use in making this better, but this type of solution is something you can use in a pinch, but is not meant to be exhaustive or highly robust. – slm May 08 '14 at 12:01