filter out length

Question

I have a file which contains 3940 sequences and looks like

>TCONS_00000066 gene=XLOC_000030
CCGCCGGCTGCTGCGCGCACCGACTTGTCACCACCCCAGCACGTCCTCCACGTATACAAG
CGCTACGGTCCACCGCGGCAGCGTCGACGTCCTTGTCCGCAAACATGGTGGTGGCAGCTT
CCTCATCGAGCAGCAGCAACTCATCCTCGAGGGGAAGGGCCCAGAGCTTCTAATCCTACA
CGGCAACAACACTTTATACTTGTGTATAATTTCTCTTCGTTTCTGAGTTCATGGCTATCT
TTGTCTCTCTTATCTTCTCCCTTTTGCTATCTCTATATTTGTGATTGCCATGGAAATACA
>TCONS_00006042 gene=XLOC_003448
GCCACTAGCCAGCCCAGCCAGGGGAAGGGGAGGAGCTGCAAGCCCAACCCCCTGCTCAAC
CCTAAATTGCTTCCGCCGATCGGTGAGAGCTCCGATGCCTTCTTCTTCTTCTTCTTCCTC
CCCCTCTACCTGTTCCTTCTCCGAGATAACTGCAACATTTTCAGCACTTTTTCTGGCCAT
TCTCAAGTCCCCAGCCCAGGGACTAGAGTGTTACTATGGCTAGAGCAAATGAGATGGTCA
GGGCAGACTCAAGGATGATGGTTGTCTTTAGTGCCCTGGCATCTAAATCAGGGCCACTGA
CATTTGAAGACTCGCTCAGATTTGTCAAGAAAGTGAAGGCTTGTAACTACATGTTGTATT

I want the sequences which are longer than 200 characters in another file

"in another file" meaning you don't want them left in the input? Or do you want them in another file as well as in this file? — Anthon, Mar 14 '15 at 09:04
sequence > 200, do you mean linelength or all the data (except newlines?) between 2 >TCON tags? — Walter A, Mar 14 '15 at 09:52
yes all data. not line length. all data between 2 >TCONS tags — user106326, Mar 14 '15 at 11:54
@WALTER A: count all data between 2>TCON tags and filter out those greater than 200 in length — user106326, Mar 16 '15 at 03:47

score 2 · Answer 1 · answered Mar 14 '15 at 08:57

2

You could use sed or awk or grep.

awk 'length($0)>200' file > newfile

OR

grep '^.\{201\}' file > newfile

answered Mar 14 '15 at 08:57

Avinash Raj

3,703

i think he wants to copy the lines which has the length more than 200 to another file. – Avinash Raj Mar 14 '15 at 09:02
not working. gerating blank files – user106326 Mar 14 '15 at 09:07
is there any output produced after running awk 'length($0)>200' file command? – Avinash Raj Mar 14 '15 at 09:08
not single lines. staring from ">TCONS" but reading from GGATG. it will read from ATA* of every ">TCONS*" and print those which are >=200 in lenght to other file – user106326 Mar 14 '15 at 09:09
no awk 'length($0)>200' file produced a blank file – user106326 Mar 14 '15 at 09:10
it reads from the top and print only the lines which has the characters of minimum 200. no awk 'length($0)>200' file produced a blank file, because no one line on your input file has min 200 chars. – Avinash Raj Mar 14 '15 at 09:11
@user106326 how you calculate 200 chars? block by block? – Avinash Raj Mar 14 '15 at 09:13

jimmij · Answer 2 · 2015-03-14T09:56:14.273

2

With awk you need to set > as a record separator first:

awk 'BEGIN{RS=">";ORS=""}length($0)>200{print ">"$0}' input > output

Another option with pcregrep:

pcregrep -M '^>[^>]{201,}' input > output

Or to count only DNA sequences, not characters in header:

pcregrep -M '^>[^>]*\n[^>]{201,}' input > output

edited Mar 14 '15 at 09:56

answered Mar 14 '15 at 09:17

jimmij

47,140

again blank file generated – user106326 Mar 14 '15 at 09:25
@user106326 Impossible, is your file exactly in the format you show here? Anyway see the edit and try with pcregrep. – jimmij Mar 14 '15 at 09:29
oh sorry. it worked but its still not showing the required amount i wanted to see. from a previous perl script i got 3817 count. here i got 3873 – user106326 Mar 14 '15 at 09:36
@user106326 I see, the problem is that it counts characters in whole blocks, while it should only gen sequence, excluding header. Try pcregrep -M '^>[^>]*\n[^>]{201,}' file – jimmij Mar 14 '15 at 09:46
now its correct. i got 3817....thanks a lot – user106326 Mar 14 '15 at 10:08
again 3889 counts..:( – user106326 Mar 14 '15 at 10:40
@user106326 correct, not correct... what is the problem? – jimmij Mar 14 '15 at 11:05

Anthon · Answer 3 · 2015-03-14T11:49:51.017

1

Python (split.py):

import sys

# call with the file as parameter

base = 0
line = ''
with open(sys.argv[-1]) as fp:
    with open('shorter', 'w') as fps:
        with open('longer', 'w') as fpl:
            for x in fp:
                if line and x.startswith('>'):
                    print len(line), base
                    if (len(line) - base) >= 200:
                        fpl.write(line)
                    else:
                        fps.write(line)
                    line = x
                    base = len(x)  # lenght of the ">..." line
                    continue
                if x.startswith('>'):  # very first one
                    base = len(x)
                line += x
            if line:
                if len(line) >= 200:
                    fpl.write(line)
                else:
                    fps.write(line)
                line = ""

call with python split.py inputfile and then mv shorter inputfile (after checking that the files are OK)

edited Mar 14 '15 at 11:49

answered Mar 14 '15 at 08:58

Anthon

79,293

File "filterlength.py", line 2 with open('shorter', 'w') as fps: ^ IndentationError: unexpected indent – user106326 Mar 14 '15 at 09:23
@user106326 There was a space too many at the beginning of each line – Anthon Mar 14 '15 at 09:24
Traceback (most recent call last): File "split.py", line 8, in fps.stdout.write(x) AttributeError: 'file' object has no attribute 'stdout' – user106326 Mar 14 '15 at 11:09
@user106326 I think the update better does what you need. it takes the length of the accumulated sequence (keeping into account the >...... which doesn't count against the 200) – Anthon Mar 14 '15 at 11:50
the shorter file is same as input while longer file is blank – user106326 Mar 14 '15 at 11:52
i want sequences longer than 200 . why move shoter into input?? – user106326 Mar 14 '15 at 12:00
Copy and paste your example to inputfile, makes sure that if you do wc inputfile you actually have 13 lines. That input gives two sequences in longer. Does your input file have newlines breaking up the sequences? – Anthon Mar 14 '15 at 12:02
@user106326 you asked "want [..] in another file" not "want [...] in another file as well" because of that you indicate you want to remove from the input. If that is not what you want update your question (I asked that 3 hours ago in a comment to your question). – Anthon Mar 14 '15 at 12:06

mikeserv · Answer 4 · 2015-03-14T13:54:05.500

You can do:

f=0
for arg in -e -v
do  f=$((f+1))
    tr '>\n' '\n>' <infile |
    grep "$arg" '>.\{200\}'|
    tr '>\n' '\n>' >"./outfile$f"
done

Which will write all sequences 200 chars or longer to outfile1 and all those shorter to outfile2.

It translates all > to \newlines and vice-versa - so your first example above becomes:

TCONS_00000066 gene=XLOC_000030>CCGCCGGCTGCTGCGCGCACCGACTTGTCACCACCCCAGCACGTCCTCCACGTATACAAG>CGCTACGGTCCACCGCGGCAGCGTCGACGTCCTTGTCCGCAAACATGGTGGTGGCAGCTT>CCTCATCGAGCAGCAGCAACTCATCCTCGAGGGGAAGGGCCCAGAGCTTCTAATCCTACA>CGGCAACAACACTTTATACTTGTGTATAATTTCTCTTCGTTTCTGAGTTCATGGCTATCT>TTGTCTCTCTTATCTTCTCCCTTTTGCTATCTCTATATTTGTGATTGCCATGGAAATACA>

It then matches (or doesn't) 200 chars from the first occurring > for each input line. For the first iteration of the for loop - the one which matches lines with 200 chars or more - it reverses the translation for the lines which grep matches and writes the results to outfile1. The second iteration gets shorter sequences and writes them to outfile2.

You should note that this count will include any newlines that were in the data.

Here's another way that doesn't have that problem:

sed -n '/^>/!{H;$!d
};   x;s/\n[ACGT]\{20\}/&/4p
'    <infile >outfile

@user106326 - you should try it again - you must have typed something wrong. There was no syntax error - though the iterator $f was staying at 0 because it was behind a pipe. Still, that would only get the outfile0 file written twice is all. — mikeserv, Mar 14 '15 at 13:56

Walter A · Answer 5 · 2015-03-14T15:02:39.700

cat file | while read -r line; do
  if [ ${#line} -gt 200 ]; then
    echo "${line}"
  fi
done

EDIT Question has been updated: not the length of one line but a set of lines is wanted.

In the following script I echo >TCONS otherwise the script will skip the last hit.

multiline=""
(cat input; echo ">TCONS string for last token") | while read line; do
        if [[ "$(echo "${line}"| cut -c1-6)" = ">TCONS" ]]; then
                if [ ${#multiline} -gt 200 ]; then
                        echo "${multiline}"
                fi
                multiline=""
        else
                multiline="${multiline}${line}"
        fi
done

score 0 · Answer 6 · edited Apr 13 '17 at 12:37

I already gave you my FastaToTbl and TblToFasta scripts in another answer. Using them, you could do

FastaToTbl sequences.fa | awk 'length($2)>200' | TblToFasta > newfile.fa

FastaToTbl

#!/usr/bin/awk -f
{
        if (substr($1,1,1)==">")
        if (NR>1)
                    printf "\n%s\t", substr($0,2,length($0)-1)
        else 
            printf "%s\t", substr($0,2,length($0)-1)
        else 
                printf "%s", $0
}END{printf "\n"}

TblToFasta

#! /usr/bin/awk -f
{
  sequence=$NF

  ls = length(sequence)
  is = 1
  fld  = 1
  while (fld < NF)
  {
     if (fld == 1){printf ">"}
     printf "%s " , $fld

     if (fld == NF-1)
      {
        printf "\n"
      }
      fld = fld+1
  }
  while (is <= ls)
  {
    printf "%s\n", substr(sequence,is,60)
    is=is+60
  }
}

filter out length

6 Answers6

FastaToTbl

TblToFasta