2

I have a file I want to parse:

mmu-miR-15-5p/16-5p/195-5p/424-5p/497-5p    0610007P14Rik
mmu-miR-326-3p/330-5p   0610007P14Rik
mmu-miR-326-3p/330-5p   Lmir
mmu-miR-15/16/195/424/497   0610007P14Rik
mmu-miR-15-5p/16-5p/195-5p/424-5p/497-5p/6838-5p    0610007P14Rik
mmu-miR-15/16/195/424-5p/497    Alinf
mmu-miR-326/330-5p  0610007P14Rik
mmu-miR-326/330 0610007P14Rik
mmu-miR-1/206/613   Crgi
mmu-miR-1-3p/206    0610007P14Rik

the desired output:

for the first line

mmu-miR-15-5p   0610007P14Rik
mmu-miR16-5p    0610007P14Rik
mmu-miR195-5p   0610007P14Rik
mmu-miR424-5p   0610007P14Rik
mmu-miR497-5p   0610007P14Rik

and so on...

I just need to replace / with mmu-miR and create a new line along with their second column.

I tried with following one line code on bash:

sed 's/\//\nmmu-miR/g' test.txt

mmu-miR-15-5p
mmu-miR16-5p
mmu-miR195-5p
mmu-miR424-5p
mmu-miR497-5p   0610007P14Rik
mmu-miR-326-3p
mmu-miR330-5p   0610007P14Rik
mmu-miR-326-3p
mmu-miR330-5p   Lmir

I tried to use a while loop and this sed command:

while read line; do 
    lineCols=( $line ); 
    v1=($(echo "${lineCols[0]}"));
    v2=($(echo "${lineCols[1]}"));
    sed 's/\//\n/g' ${v1};
done <test.txt

but got an error:

sed: can't read mmu-miR-15-5p/16-5p/195-5p/424-5p/497-5p: No such file or directory
sed: can't read mmu-miR-326-3p/330-5p: No such file or directory
sed: can't read mmu-miR-326-3p/330-5p: No such file or directory
sed: can't read mmu-miR-15/16/195/424/497: No such file or directory
sed: can't read mmu-miR-15-5p/16-5p/195-5p/424-5p/497-5p/6838-5p: No such file or directory

What am I doing wrong?

Kusalananda
  • 333,661
RKK
  • 77
  • 1
    You should avoid while read line; echo ... constructs (see here for further details). Furthermore, this looks like a job more suited for awk than sed, but that might be a pretty subjective matter. – Valentin B. Nov 14 '16 at 17:38

3 Answers3

2

How to achieve this with awk

For better readability/ease of use, create a awk script (myScript.awk) with following content:

{ 
  n=split($1, a, "/")
  split(a[1], b, "-")

  for (i=1; i<n+1; i++) {
    if (i == 1) {
      printf a[i]"\t"$2"\n"
    }
    else {
      printf b[1]"-"b[2]"-"a[i]"\t"$2"\n"
    }    
  }
}

How it works:

n=split($1, a, "/")

This line takes the first field (for example "mmu-miR-15-5p/16-5p/195-5p/424-5p/497-5p" for the first line), splits it with separator "/", stores it in array a and stores the number of elements split in n. For the first line:

a[1] = "mmu-miR-15-5p"
a[2] = "16-5p"
a[3] = "195-5p"
a[4] = "424-5p"
a[5] = "497-5p"
n = 5

Remember awk instructions are executed for every line so the result will be different for the next line !

split(a[1], b, "-")

Similarly, this line takes the first element of a and splits it with separator "-". This yields:

b[1] = "mmu"
b[2] = "miR"
b[3] = "15"
b[4] = "5p"

Once we have those arrays, all we need to do is loop over the number of output lines (number of "/" separated elements in an input line) and construct each line with bits of arrays a and b! We have to make an exception for the first line because a[1] already contains "mmu-miR-" hence the if to differentiate that case !

How to run it

awk -f myScript.awk input.txt

Tested it, it does output what you ask for in your question.

NOTE As stated in my comment on your question, using a single awk invocation is way more efficient and "shell-friendly" than looping on every line of your file.

EDIT NOTE I have modified the script following your comment. Should be fine now !

0

I think you're looking for something like that:

cat inputFile.txt | while read line
    do
        eval `echo "$line" | sed 's|^\([^/]*\)/\([^ ]*\) \(.*\)|name="\1" ports=\2 tag="\3"|'`
        echo "$name $tag"
        realname=`echo "$name" | sed 's|-[0-9].*||'`
        for port in $(echo $ports | sed 's|/| |g')
        do
            echo "$realname-$port $tag"
            #or echo "$realname$port $tag", but I suspect a typo in your initial post
        done
    done
SYN
  • 2,863
0

Assuming that the input is a header-less TSV file (i.e. a tab-delimited file with no header line(s)), then you may read it as such with Miller (mlr) and "un-nest" each record by the /-delimited strings in the first field. You may then prepend the string mmu-miR- to each value in the 1st field that does not already have it:

$ mlr --tsv -N nest --evar '/' -f 1 then put -S '$1 !=~ "^mmu-miR-" { $1 = "mmu-miR-" . $1 }' file
mmu-miR-15-5p   0610007P14Rik
mmu-miR-16-5p   0610007P14Rik
mmu-miR-195-5p  0610007P14Rik
mmu-miR-424-5p  0610007P14Rik
mmu-miR-497-5p  0610007P14Rik
mmu-miR-326-3p  0610007P14Rik
mmu-miR-330-5p  0610007P14Rik
mmu-miR-326-3p  Lmir
mmu-miR-330-5p  Lmir
mmu-miR-15      0610007P14Rik
mmu-miR-16      0610007P14Rik
mmu-miR-195     0610007P14Rik
mmu-miR-424     0610007P14Rik
mmu-miR-497     0610007P14Rik
mmu-miR-15-5p   0610007P14Rik
mmu-miR-16-5p   0610007P14Rik
mmu-miR-195-5p  0610007P14Rik
mmu-miR-424-5p  0610007P14Rik
mmu-miR-497-5p  0610007P14Rik
mmu-miR-6838-5p 0610007P14Rik
mmu-miR-15      Alinf
mmu-miR-16      Alinf
mmu-miR-195     Alinf
mmu-miR-424-5p  Alinf
mmu-miR-497     Alinf
mmu-miR-326     0610007P14Rik
mmu-miR-330-5p  0610007P14Rik
mmu-miR-326     0610007P14Rik
mmu-miR-330     0610007P14Rik
mmu-miR-1       Crgi
mmu-miR-206     Crgi
mmu-miR-613     Crgi
mmu-miR-1-3p    0610007P14Rik
mmu-miR-206     0610007P14Rik

The first Miller sub-command, nest, is here used to "un-nest" or "explode" the records into further records by splitting up the 1st field on slashes and duplicating the other fields (only one other field in this case) once for each generated string.

The second Miller sub-command, put, tests whether the value in the resulting 1st field starts with the correct prefix string and adds it if it doesn't. The -S option with put stops Miller from inferring a type on the fields and will treat all fields as text.


Given the input in the question, we may get the same result using awk like so:

awk -F '\t' '
    BEGIN { OFS=FS }
    {
        nf = split($1,a,"/")
    print a[1], $2
    for (i = 2; i &lt;= nf; ++i)
        print &quot;mmu-miR-&quot; a[i], $2
}' file

This also reads the file as a tab-delimited file and splits the 1st field on slashes, generating a set of new strings in the array a. It then prints the first generated string together with the 2nd field before iterating over the remaining generated strings, prepending each with the missing mmu-miR- prefix, and outputting them with the value from the 2nd field.

Kusalananda
  • 333,661