0

I use the following command to match some IDs in file 1 and retrieve data stored in referencefile.

while read -r line; do
    awk -v pattern=$line -v RS=">" '$0 ~ pattern { printf(">%s", $0); }' referencefile;
done <file1 >output

I have 50 files similar to file1 stored in a directory and want to perform the above command on all those files and save the outputs as seperate files. Is there a way to achieve this in one single command like a nested loop.

reference file

>LD200FFFFFFFFFFFFFFFFFFFFSSSSSSSSS
 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
 SSSSSSSSSSSSSSS
>LD400HHHHHHHHHHHHHHHHHHHHHHHHHHHHH
 HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
>LD311DDDDDDDDDDDDDDDDDDDDDDDDDDDDD
>LD500TTTTTTTTTTTTTTTTTTTTTTTTTTTTT
>LD100KKKKKKKKKKKKKKKKKKKKKKKKKKKKK

example file 1

LD100
LD200
LD311

expected output1.txt

>LD100KKKKKKKKKKKKKKKKKKKKKKKKKKKKK
>LD200FFFFFFFFFFFFFFFFFFFFSSSSSSSSS
 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
 SSSSSSSSSSSSSSS
>LD311DDDDDDDDDDDDDDDDDDDDDDDDDDDDD

example file 2

LD500
LD400

expected output2.txt

>LD500TTTTTTTTTTTTTTTTTTTTTTTTTTTTT
>LD400HHHHHHHHHHHHHHHHHHHHHHHHHHHHH
 HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
Arora
  • 79
  • yes but I need to extract all the lines following the ID from the reference file. The number of characters(lines) can vary from ID to ID. Can I use grep for an unknown number of lines? – Arora Jun 02 '20 at 10:46
  • 1
    Right, I missed the significance of RS=">" here. grep can't do that. – ilkkachu Jun 02 '20 at 11:40

3 Answers3

0

I understand, you are using a script for this, not searching for single command line. So how about changing your script to something like this:

#!/bin/bash
Directory="$1"
ls "$Directory" | while read FileName
do
  while read -r line
  do
   awk -v pattern="$line" -v RS=">" '$0 ~ pattern { printf(">%s", $0); }' referencefile;
  done < "$Directory"/"$FileName" > OutputDirectory/"$FileName".out
done

This script should be called like this:

<script> <directory with input files>

Some notes on the usage:

  • The OutputDirectory must exist, and please edit it into the script or add a parameter.
  • The <directory with input files> should contain only the input files, and no subdirectories. Else you receive error messages.

Caveat

The script relies on parsing the output of ls. This allows to keep the script simple enough to understand the method more easily, but is usually not recommended practice as special characters in filenames can lead to unwanted behaviour. It will work in simple setups, where the names of input files are not too exotic. Spaces in the names are OK, but e.g. newline in a name will cause error, and such file will not be processed.

AdminBee
  • 22,803
Gienek
  • 41
  • I beleieve my files must have new line characters. I get this -bash: syntax error near unexpected token `newline' – Arora Jun 02 '20 at 12:47
  • @Arora: I do not think that this error is caused by 'bad filenames'. If you have some file with newline in the name, the error would be '... No such file or directory'. I have changed the script slightly, and edited it on Linux, to avoid any extra non-visible characters. Maybe you could check in your editor, that all gets copied correctly. – Gienek Jun 02 '20 at 13:40
  • when I call script , it says script:cannot open PATH/dir : is a directory. Am I using it wrong? sorry I´m very new to this – Arora Jun 02 '20 at 14:49
  • @Arora: Hmm... Let me describe directory structure, I used to test the script. It may help you with calling. Directory ‘/home/Gienek/Test1’ contains: ‘referencefile’ - The reference file in the question. ‘script.sh’ - The script. ‘T1’ - Directory with files 'file 1' and 'file 2'. OutputDirectory - Directory for output. Executing: $ cd /home/Gienek/Test1 Then: $ ./script.sh T1 ‘OutputDirectory’ and ‘referencefile’ are hardcoded in the script. So it is important, which is the current directory, when executing the script. – Gienek Jun 02 '20 at 20:02
  • Thank you very much. It worked perfectly – Arora Jun 03 '20 at 09:28
0

Well, in general you could do:

for f in file*; do
    while read ...; do
        some commands...
    done < "$f"
done > output

or just

cat file* | while read ...; do
    some commands...
done > output

If you wanted just the lines with matches, then grep could do this more directly, grep -f would read the patterns from a file and print any matching lines.

for patternfile in file*; do
    grep -f "$patternfile" referencefile
done
ilkkachu
  • 138,973
0

You could wrap the call to xargs + grep in a for loop. Note the order of the output nay not match with input in file1 because grep will catch in the order seen in the reference file.

for f in file*;do
  < "$f" paste -sd\||\
      xargs -r -I{} grep -Pzo '(?m:(?:^[>](?:'{}')\D.*\n)(?:[^>].*\n)*)' reference.file | tr -d '\0' \
  > "$f.out"
done