1

I'm trying to make a script that looks through each line of one file and if a line fails to match anywhere in any line of another text file then remove that line from the original file.

An example of and input and output desired from this script would be:

example input: file 1 (groups file),

hello
hi hello
hi
great
interesting

           file 2: 
this is a hi you see
this is great don't ya think
sometimes hello is a good expansion of its more commonly used shortening hi
interesting how brilliant coding can be just wish i could get the hang of it

Example script output - file 1 changed to:

hello
hi
great
interesting

So its removed hi hello, because its not present in the second file

here is the script, it seems to work to the point of making the variables.

#take first line from stability.contigs.groups
echo | head -n1 ~/test_folder/stability.contigs.groups > ~/test_folder/ErrorFix.txt
#remove the last 5 character
sed -i -r '$ s/.{5}$//' ~/test_folder/ErrorFix.txt 

#find match of the word string in errorfix.txt in stability.trim.contigs.fasta if not found then delete the line containing the string in stability.contigs.groups
STRING=$(cat ~/test_folder/MothurErrorFix.txt)
FILE=~/test_folder/stability.trim.contigs.fasta
if [ ! -z $(grep "$STRING" "$FILE") ]
    then
        perl -e 's/.*\$VAR\s*\n//' ~/test_folder/stability.contigs.groups
fi
Rui F Ribeiro
  • 56,709
  • 26
  • 150
  • 232
Giles
  • 897
  • Please read http://stackoverflow.com/help/formatting – garethTheRed May 21 '16 at 18:27
  • yes, but the line from file 1 can be on and within any portion of text within file 2, so it needs to take that into account before deleting it from file 1 if its not there. – Giles May 21 '16 at 19:24
  • sorry was trying to keep it simple, short and sweet. can provide a better example if it helps. also i apologise if i occasionally make no sense, as i've just come off a 48 hour straight work stint, feeling a little woozy/dizzy! – Giles May 21 '16 at 19:30
  • i get that, should of known better really, as even in the short time i've been doing this i've seen it happen a few times in these forums when looking through other peoples questions. i think sometimes I can stare at something so long trying to figure it out, when I go to explain the problem i can overlook important factors in the explanation. – Giles May 21 '16 at 19:45

2 Answers2

1

If you have gnu grep you could run:

grep -oFf file1 file2 | sort | uniq | grep -Ff - file1

remove the last grep if don't need to preserve the order of the lines in file1.
If you don't have access to gnu grep, with awk:

awk 'NR==FNR{z[$0]++;next};{for (l in z){if (index($0, l)) y[l]++}}
END{for (i in y) print i}' file1 file2
don_crissti
  • 82,805
  • gnu grep worked perfectly, thank you. awk solution had a few problems, which seemed to be associated with grep and some numbers that were in the text file (text file is huge, thought it was all normal text, but i was wrong) – Giles May 21 '16 at 20:28
0

Go for don_crissti's (accepted) answer if you have GNU grep. Just in case you don't (e.g. on a standard Mac OS X, where that won't work), you could alternatively save this snippet to a bash script, e.g. myconvert.sh

#!/bin/bash
while IFS='' read -r line || [[ -n "$line" ]]; do
    if ! grep -Fq "$line" $2
    then
        sed -i '' "/$(echo $line | sed -e 's/[]\/$*.^|[]/\\&/g')/d" $1
    fi
done < "$1"

an call it with the two files as arguments

./myconvert.sh file1 file2

However, please note don_crissti's knowledgeable comments below regarding the usage of while/read and the obvious performance drawbacks of invoking sed.

Mischa
  • 161
  • Avoid while..reading a file... Also, if file1 had 10.000 lines your script would edit the file in-place 10.000 times (via sed) not to mention your sed command could (potentially) remove other lines too since you're not escaping any special characters that may be present in file1 (most common of them being the dot). – don_crissti May 21 '16 at 20:28
  • With a modern shell like e.g. bash or zsh you could read the lines of file1 into an array and for each element use grep -q and if successful print the element: c=0; readarray -t arr <file1; while [ $c -le ${#arr} ]; do grep -qF "${arr[$c]}" file2 && printf %s\\n "${arr[$c]}"; ((c++)); done Now, if you modify this to print something else that you then pipe to sed as a script you can edit the file in-place with a single sed invocation. – don_crissti May 21 '16 at 22:31