12

I have a two text files: string.txt and lengths.txt

String.txt:

abcdefghijklmnopqrstuvwxyz

lengths.txt

5
4
10
7

I want to get the file

>Entry_1
abcde
>Entry_2
fghi
>Entry_3
jklmnopqrs
>Entry_4
tuvwxyz

I'm working with about 28,000 entries and they vary between 200 and 56,000 characters.

At the moment, I'm using:

start=1
end=0
i=0
while read read_l
do
    let i=i+1
    let end=end+read_l
    echo -e ">Entry_$i" >>outfile.txt
    echo "$(cut -c$start-$end String.txt)" >>outfile.txt
    let start=start+read_l
    echo $i
done <lengths.txt

But it's very inefficient. Any better ideas?

  • How about str="$(cat string.txt)"; i=0; while read j; do echo "${file:$i:$j}"; i=$((i+j)); done <length.txt ..seems fast enough as done only by shell.. – heemayl Aug 12 '15 at 11:30
  • It's not a lot faster to be honest. It's still taking quite a long time. I'm quite new to linux/programming so if you think there's a faster method not only using shell, I'm open to ideas. – user3891532 Aug 12 '15 at 11:45
  • 4
    Try { while read l<&3; do head -c"$l"; echo; done 3<lengths.txt; } <String.txt. – jimmij Aug 12 '15 at 11:58
  • @jimmij, how about sticking that into an answer – iruvar Aug 12 '15 at 12:20

4 Answers4

8

Generally, you don't want to use shell loops to process text. Here, I'd use perl:

$ perl -lpe 'read STDIN,$_,$_; print ">Entry_" . ++$n' lengths.txt < string.txt
>Entry_1
abcde
>Entry_2
fghi
>Entry_3
jklmnopqrs
>Entry_4
tuvwxyz

That's one command, that reads (with buffering so a lot more efficiently than the shell's read command that reads one byte (or a few bytes for regular files) at a time) both files only once (without storing them full in memory), so is going to be several orders of magnitude more efficient than solutions that run external commands in a shell loop.

(add the -C option if those numbers should be numbers of characters in the current locale as opposed to number of bytes. For ASCII characters like in your sample, that won't make any difference).

  • That's a convoluted reuse of $_ as both output and input parameter to read, but it reduces the byte count in the script. – Jonathan Leffler Aug 12 '15 at 14:49
  • In a quick test (the OP's sample repeated 100000 times), I find this solution is about 1200 times as fast as @jimmij's (0.3 seconds vs 6 minutes (with bash, 16 seconds with PATH=/opt/ast/bin:$PATH ksh93)). – Stéphane Chazelas Aug 13 '15 at 11:12
7

You can do

{
  while read l<&3; do
    {
      head -c"$l"
      echo
    } 3<&-
  done 3<lengths.txt
} <String.txt

It requires some explanation:

The main idea is to use { head ; } <file and is derived from the underestimated @mikeserv answer. However in this case we need to use many heads, so while loop is introduced and a little bit of tweaking with file descriptors in order to pass to head input from both files (file String.txt as a main file to process and lines from length.txt as an argument to -c option). The idea is that benefit in speed should come from not needing to seek through the String.txt each time a command like head or cut is invoked. The echo is just to print newline after each iteration.

How much it is faster (if any) and adding >Entry_i between lines is left as an exercise.

jimmij
  • 47,140
  • Neat use of the I/O redirection. Since the tag is Linux, you can reasonably assume the shell is Bash and use read -u 3 to read from descriptor 3. – Jonathan Leffler Aug 12 '15 at 14:30
  • @JonathanLeffler, Linux has little to do with bash. The great majority of Linux-based systems doesn't have bash installed (think Android and other embedded systems). bash being the slowest shell of all, switching to bash will likely degrade performance more significantly than the little gain that switching from read <&3 to read -u3 might bring (which in any case will be insignificant compared to the cost of running an external command like head). Switching to ksh93 that has head builtin (and one that supports the non-standard -c option) would improve performances a lot more. – Stéphane Chazelas Aug 12 '15 at 15:05
  • Note that the argument of head -c (for the head implementations where that non-standard option is available) is a number of bytes, not characters. That would make a difference in multi-byte locales. – Stéphane Chazelas Aug 12 '15 at 15:10
6

bash, version 4

mapfile -t lengths <lengths.txt
string=$(< String.txt)
i=0 
n=0
for len in "${lengths[@]}"; do
    echo ">Entry_$((++n))"
    echo "${string:i:len}"
    ((i+=len))
done

output

>Entry_1
abcde
>Entry_2
fghi
>Entry_3
jklmnopqrs
>Entry_4
tuvwxyz
glenn jackman
  • 85,964
4

What about awk?

Create a file called process.awk with this code:

function idx(i1, v1, i2, v2)
{
     # numerical index comparison, ascending order
     return (i1 - i2)
}
FNR==NR { a[FNR]=$0; next }
{ i=1;PROCINFO["sorted_in"] = "idx";
        for (j in a) {
                print ">Entry"j;
                ms=substr($0, i,a[j])
                print ms
                i=i+length(ms)
        }
}

Save it and execute awk -f process.awk lengths.txt string.txt

jcbermu
  • 4,736
  • 18
  • 26
  • Based on the use of PROCINFO, this is not standard awk, but gawk. In that case I would prefer another gawk only feature, the FIELDWIDTHS: awk -vFIELDWIDTHS="$(tr '\n' ' ' < lengths.txt)" '{for(i=1;i<=NF;i++)print">Entry"i ORS$i}' string.txt – manatwork Aug 13 '15 at 09:11