1

I have a file which contains 3940 sequences and looks like

>TCONS_00000066 gene=XLOC_000030
CCGCCGGCTGCTGCGCGCACCGACTTGTCACCACCCCAGCACGTCCTCCACGTATACAAG
CGCTACGGTCCACCGCGGCAGCGTCGACGTCCTTGTCCGCAAACATGGTGGTGGCAGCTT
CCTCATCGAGCAGCAGCAACTCATCCTCGAGGGGAAGGGCCCAGAGCTTCTAATCCTACA
CGGCAACAACACTTTATACTTGTGTATAATTTCTCTTCGTTTCTGAGTTCATGGCTATCT
TTGTCTCTCTTATCTTCTCCCTTTTGCTATCTCTATATTTGTGATTGCCATGGAAATACA
>TCONS_00006042 gene=XLOC_003448
GCCACTAGCCAGCCCAGCCAGGGGAAGGGGAGGAGCTGCAAGCCCAACCCCCTGCTCAAC
CCTAAATTGCTTCCGCCGATCGGTGAGAGCTCCGATGCCTTCTTCTTCTTCTTCTTCCTC
CCCCTCTACCTGTTCCTTCTCCGAGATAACTGCAACATTTTCAGCACTTTTTCTGGCCAT
TCTCAAGTCCCCAGCCCAGGGACTAGAGTGTTACTATGGCTAGAGCAAATGAGATGGTCA
GGGCAGACTCAAGGATGATGGTTGTCTTTAGTGCCCTGGCATCTAAATCAGGGCCACTGA
CATTTGAAGACTCGCTCAGATTTGTCAAGAAAGTGAAGGCTTGTAACTACATGTTGTATT

I want the sequences which are longer than 200 characters in another file

Anthon
  • 79,293

6 Answers6

2

You could use sed or awk or grep.

awk 'length($0)>200' file > newfile

OR

grep '^.\{201\}' file > newfile
Avinash Raj
  • 3,703
2

With awk you need to set > as a record separator first:

awk 'BEGIN{RS=">";ORS=""}length($0)>200{print ">"$0}' input > output

Another option with pcregrep:

pcregrep -M '^>[^>]{201,}' input > output

Or to count only DNA sequences, not characters in header:

pcregrep -M '^>[^>]*\n[^>]{201,}' input > output
jimmij
  • 47,140
1

Python (split.py):

import sys

# call with the file as parameter

base = 0
line = ''
with open(sys.argv[-1]) as fp:
    with open('shorter', 'w') as fps:
        with open('longer', 'w') as fpl:
            for x in fp:
                if line and x.startswith('>'):
                    print len(line), base
                    if (len(line) - base) >= 200:
                        fpl.write(line)
                    else:
                        fps.write(line)
                    line = x
                    base = len(x)  # lenght of the ">..." line
                    continue
                if x.startswith('>'):  # very first one
                    base = len(x)
                line += x
            if line:
                if len(line) >= 200:
                    fpl.write(line)
                else:
                    fps.write(line)
                line = ""

call with python split.py inputfile and then mv shorter inputfile (after checking that the files are OK)

Anthon
  • 79,293
  • File "filterlength.py", line 2 with open('shorter', 'w') as fps: ^ IndentationError: unexpected indent – user106326 Mar 14 '15 at 09:23
  • @user106326 There was a space too many at the beginning of each line – Anthon Mar 14 '15 at 09:24
  • Traceback (most recent call last): File "split.py", line 8, in fps.stdout.write(x) AttributeError: 'file' object has no attribute 'stdout' – user106326 Mar 14 '15 at 11:09
  • @user106326 I think the update better does what you need. it takes the length of the accumulated sequence (keeping into account the >...... which doesn't count against the 200) – Anthon Mar 14 '15 at 11:50
  • the shorter file is same as input while longer file is blank – user106326 Mar 14 '15 at 11:52
  • i want sequences longer than 200 . why move shoter into input?? – user106326 Mar 14 '15 at 12:00
  • Copy and paste your example to inputfile, makes sure that if you do wc inputfile you actually have 13 lines. That input gives two sequences in longer. Does your input file have newlines breaking up the sequences? – Anthon Mar 14 '15 at 12:02
  • @user106326 you asked "want [..] in another file" not "want [...] in another file as well" because of that you indicate you want to remove from the input. If that is not what you want update your question (I asked that 3 hours ago in a comment to your question). – Anthon Mar 14 '15 at 12:06
1

You can do:

f=0
for arg in -e -v
do  f=$((f+1))
    tr '>\n' '\n>' <infile |
    grep "$arg" '>.\{200\}'|
    tr '>\n' '\n>' >"./outfile$f"
done

Which will write all sequences 200 chars or longer to outfile1 and all those shorter to outfile2.

It translates all > to \newlines and vice-versa - so your first example above becomes:

TCONS_00000066 gene=XLOC_000030>CCGCCGGCTGCTGCGCGCACCGACTTGTCACCACCCCAGCACGTCCTCCACGTATACAAG>CGCTACGGTCCACCGCGGCAGCGTCGACGTCCTTGTCCGCAAACATGGTGGTGGCAGCTT>CCTCATCGAGCAGCAGCAACTCATCCTCGAGGGGAAGGGCCCAGAGCTTCTAATCCTACA>CGGCAACAACACTTTATACTTGTGTATAATTTCTCTTCGTTTCTGAGTTCATGGCTATCT>TTGTCTCTCTTATCTTCTCCCTTTTGCTATCTCTATATTTGTGATTGCCATGGAAATACA>

It then matches (or doesn't) 200 chars from the first occurring > for each input line. For the first iteration of the for loop - the one which matches lines with 200 chars or more - it reverses the translation for the lines which grep matches and writes the results to outfile1. The second iteration gets shorter sequences and writes them to outfile2.

You should note that this count will include any newlines that were in the data.

Here's another way that doesn't have that problem:

sed -n '/^>/!{H;$!d
};   x;s/\n[ACGT]\{20\}/&/4p
'    <infile >outfile
mikeserv
  • 58,310
  • -bash: syntax error near unexpected token `do' – user106326 Mar 14 '15 at 11:46
  • @user106326 - you should try it again - you must have typed something wrong. There was no syntax error - though the iterator $f was staying at 0 because it was behind a pipe. Still, that would only get the outfile0 file written twice is all. – mikeserv Mar 14 '15 at 13:56
1
cat file | while read -r line; do
  if [ ${#line} -gt 200 ]; then
    echo "${line}"
  fi
done

EDIT Question has been updated: not the length of one line but a set of lines is wanted.

In the following script I echo >TCONS otherwise the script will skip the last hit.

multiline=""
(cat input; echo ">TCONS string for last token") | while read line; do
        if [[ "$(echo "${line}"| cut -c1-6)" = ">TCONS" ]]; then
                if [ ${#multiline} -gt 200 ]; then
                        echo "${multiline}"
                fi
                multiline=""
        else
                multiline="${multiline}${line}"
        fi
done
Walter A
  • 736
0

I already gave you my FastaToTbl and TblToFasta scripts in another answer. Using them, you could do

FastaToTbl sequences.fa | awk 'length($2)>200' | TblToFasta > newfile.fa

FastaToTbl

#!/usr/bin/awk -f
{
        if (substr($1,1,1)==">")
        if (NR>1)
                    printf "\n%s\t", substr($0,2,length($0)-1)
        else 
            printf "%s\t", substr($0,2,length($0)-1)
        else 
                printf "%s", $0
}END{printf "\n"}

TblToFasta

#! /usr/bin/awk -f
{
  sequence=$NF

  ls = length(sequence)
  is = 1
  fld  = 1
  while (fld < NF)
  {
     if (fld == 1){printf ">"}
     printf "%s " , $fld

     if (fld == NF-1)
      {
        printf "\n"
      }
      fld = fld+1
  }
  while (is <= ls)
  {
    printf "%s\n", substr(sequence,is,60)
    is=is+60
  }
}
terdon
  • 242,166