Linux - split a big file into small files according to specific character and how many it repeats

Question

I have a really big file that looks like this:

>name1
ACGTACGTACGT
ACGTACGTACGT
ACGTACGTACGT
ACGTACGTACGT
>name2
ACGTACGTACGT
ACGTACGTACGT
ACGTACGTACGT
>name
ACGTACGTACGT
ACGTACGTACGT
ACGTACGTACGT
ACGTACGTACGT
ACGTACGTACGT
ACGTACGTACGT
ACGTACGTACGT
ACGTACGTACGT
>name4
ACGTACGTACGT
ACGTACGTACGT
ACGTACGTACGT
ACGTACGTACGT
ACGTACGTACGT

It is a fasta file. It has about 3183 lines that start with > (3183 names), followed by random number of lines of ACGTs. I want to split it into smaller files of 250 >s followed by their number of lines of ACGTs. And if the last file does not have 250 >s that is fine. I would still like to keep it. So far I tried split, which I don't think is appropriate here since it splits the file into one > in each small file. I also tried awk:

awk -F'>' 'NR==1{f=0;c=1}NR>1{
c++
if($((c%250))==0) {
fn="file"c".fasta";
print > fn}
}' kmer_subtraction/kmercollection.fasta

I am not sure if this works because I cannot see my file. Could you please help me with this? Thank you!

This question is welcome and 100% on topic here, but based on the subject matter, you might also be interested in our sister site: [bioinformatics.se]. — terdon, Dec 12 '22 at 12:03

score 0 · Answer 1 · answered Dec 12 '22 at 11:52

0

I have been carrying around a couple of simple, powerful awk scripts written by a colleague years ago, that translate between fasta and "table" format. By "table" format I mean that the fasta file is converted to one entry per line, with the sequence identifier as the first tab-delimited field and the sequence as the second.

You can find the scripts here.

Using those, it is very easy to do what you want since table format is perfect for splitting:

FastaToTbl file.fa > file.tbl
split -l 250 file.tbl file.subseq

That will take the input fasta file file.fa and create N files of 250 sequences, one sequence per line, named file.subseqaa, file.subseqab, ..., file.subseqaz. Now, you just need to change these back to fasta asgain:

for file in file.subseq*; do
    TblToFasta "$file" > "$file.fa"
    rm  "$file"
done
rm file.tbl

You now have your individual fasta files and the intermediate table format files have been removed.

You can also just write your only little script to do it in one go:

gawk -v n=1 '{ if(/^>/){k++; if(k % 250 == 0){n++; }} print > "file."n".fa"; }' file.fa

Note that this might complain about too many open files if you are not using GNU awk.

Finally, your script wasn't working because you were mixing up shell and awk syntax. The $((c%250)) is shell syntax, in awk you just want if( c % 250 == 0) as I have used above.

answered Dec 12 '22 at 11:52

terdon

242,166

You can change portability your gawk to awk with explicit close() – Gilles Quénot Dec 12 '22 at 12:00
@GillesQuenot I know, but since these are only ~12 files, I really doubt it's worth it. – terdon Dec 12 '22 at 12:02
@terdon that depends which awk version you're using on which platform, just google awk too many output files 10 which will mostly give you results about awk failures with less than 10 output files. Some of the results are about old awk on Solaris but some are about macOS and idk what other awks/platforms limits might be. In my mind at least, just based on questions I've seen over the years, if you have less than 10 open files you're probably fine but if you have 10 or more then YMMV. – Ed Morton Dec 12 '22 at 15:51
In addition to the potential "too many open files" error print > "file."n".fa" is undefined behavior per POSIX and so will produce a syntax error in some awks, it needs to be print > ("file."n".fa") to work in all awks. – Ed Morton Dec 12 '22 at 16:00
Just as another "too many open files" limit example, here's a question where nawk failed with "too many open files" at 20 - https://www.unix.com/unix-for-advanced-and-expert-users/42985-nawk-file-limits.html. I see several questions saying they hit that limit with nawk. Like I say, idk what that limit is for every awk on every platform but best I can tell if you keep it under 10 you'll be fine (I used to think under about 15 was fine till I started seeing all the questions hitting the limit at 10). – Ed Morton Dec 12 '22 at 16:10
Thank you so much! – msening Dec 12 '22 at 16:35

Ed Morton · Accepted Answer · 2022-12-12T16:03:35.420

0

Using any awk:

awk '
    />/ { if ( (++c % 250) == 1 ) { close(fn); fn="file"(++n)".fasta" } }
    { print > fn }
' file.fasta

edited Dec 12 '22 at 16:03

answered Dec 12 '22 at 15:57

Ed Morton

31,617

1

Thank you it works! – msening Dec 12 '22 at 16:34

Linux - split a big file into small files according to specific character and how many it repeats

2 Answers2