Fast method of splitting string from textfile?

Question

I have a two text files: string.txt and lengths.txt

String.txt:

abcdefghijklmnopqrstuvwxyz

lengths.txt

I want to get the file

>Entry_1
abcde
>Entry_2
fghi
>Entry_3
jklmnopqrs
>Entry_4
tuvwxyz

I'm working with about 28,000 entries and they vary between 200 and 56,000 characters.

At the moment, I'm using:

start=1
end=0
i=0
while read read_l
do
    let i=i+1
    let end=end+read_l
    echo -e ">Entry_$i" >>outfile.txt
    echo "$(cut -c$start-$end String.txt)" >>outfile.txt
    let start=start+read_l
    echo $i
done <lengths.txt

But it's very inefficient. Any better ideas?

How about str="$(cat string.txt)"; i=0; while read j; do echo "${file:$i:$j}"; i=$((i+j)); done <length.txt ..seems fast enough as done only by shell.. — heemayl, Aug 12 '15 at 11:30
It's not a lot faster to be honest. It's still taking quite a long time. I'm quite new to linux/programming so if you think there's a faster method not only using shell, I'm open to ideas. — user3891532, Aug 12 '15 at 11:45
Try { while read l<&3; do head -c"$l"; echo; done 3<lengths.txt; } <String.txt. — jimmij, Aug 12 '15 at 11:58

score 8 · Answer 1 · edited Apr 13 '17 at 12:36

8

Generally, you don't want to use shell loops to process text. Here, I'd use perl:

$ perl -lpe 'read STDIN,$_,$_; print ">Entry_" . ++$n' lengths.txt < string.txt
>Entry_1
abcde
>Entry_2
fghi
>Entry_3
jklmnopqrs
>Entry_4
tuvwxyz

That's one command, that reads (with buffering so a lot more efficiently than the shell's read command that reads one byte (or a few bytes for regular files) at a time) both files only once (without storing them full in memory), so is going to be several orders of magnitude more efficient than solutions that run external commands in a shell loop.

(add the -C option if those numbers should be numbers of characters in the current locale as opposed to number of bytes. For ASCII characters like in your sample, that won't make any difference).

edited Apr 13 '17 at 12:36

Community

1

answered Aug 12 '15 at 13:01

Stéphane Chazelas

544,893

That's a convoluted reuse of $_ as both output and input parameter to read, but it reduces the byte count in the script. – Jonathan Leffler Aug 12 '15 at 14:49
In a quick test (the OP's sample repeated 100000 times), I find this solution is about 1200 times as fast as @jimmij's (0.3 seconds vs 6 minutes (with bash, 16 seconds with PATH=/opt/ast/bin:$PATH ksh93)). – Stéphane Chazelas Aug 13 '15 at 11:12

score 7 · Accepted Answer · edited Apr 13 '17 at 12:36

7

You can do

{
  while read l<&3; do
    {
      head -c"$l"
      echo
    } 3<&-
  done 3<lengths.txt
} <String.txt

It requires some explanation:

The main idea is to use { head ; } <file and is derived from the underestimated @mikeserv answer. However in this case we need to use many heads, so while loop is introduced and a little bit of tweaking with file descriptors in order to pass to head input from both files (file String.txt as a main file to process and lines from length.txt as an argument to -c option). The idea is that benefit in speed should come from not needing to seek through the String.txt each time a command like head or cut is invoked. The echo is just to print newline after each iteration.

How much it is faster (if any) and adding >Entry_i between lines is left as an exercise.

edited Apr 13 '17 at 12:36

Community

1

answered Aug 12 '15 at 12:50

jimmij

47,140

Neat use of the I/O redirection. Since the tag is Linux, you can reasonably assume the shell is Bash and use read -u 3 to read from descriptor 3. – Jonathan Leffler Aug 12 '15 at 14:30
@JonathanLeffler, Linux has little to do with bash. The great majority of Linux-based systems doesn't have bash installed (think Android and other embedded systems). bash being the slowest shell of all, switching to bash will likely degrade performance more significantly than the little gain that switching from read <&3 to read -u3 might bring (which in any case will be insignificant compared to the cost of running an external command like head). Switching to ksh93 that has head builtin (and one that supports the non-standard -c option) would improve performances a lot more. – Stéphane Chazelas Aug 12 '15 at 15:05
Note that the argument of head -c (for the head implementations where that non-standard option is available) is a number of bytes, not characters. That would make a difference in multi-byte locales. – Stéphane Chazelas Aug 12 '15 at 15:10

score 6 · Answer 3 · answered Aug 12 '15 at 13:08

bash, version 4

mapfile -t lengths <lengths.txt
string=$(< String.txt)
i=0 
n=0
for len in "${lengths[@]}"; do
    echo ">Entry_$((++n))"
    echo "${string:i:len}"
    ((i+=len))
done

output

>Entry_1
abcde
>Entry_2
fghi
>Entry_3
jklmnopqrs
>Entry_4
tuvwxyz

jcbermu · Answer 4 · 2015-08-12T13:29:28.903

4

What about awk?

Create a file called process.awk with this code:

function idx(i1, v1, i2, v2)
{
     # numerical index comparison, ascending order
     return (i1 - i2)
}
FNR==NR { a[FNR]=$0; next }
{ i=1;PROCINFO["sorted_in"] = "idx";
        for (j in a) {
                print ">Entry"j;
                ms=substr($0, i,a[j])
                print ms
                i=i+length(ms)
        }
}

Save it and execute awk -f process.awk lengths.txt string.txt

edited Aug 12 '15 at 13:29

answered Aug 12 '15 at 13:23

jcbermu

4,736
18
26

Based on the use of PROCINFO, this is not standard awk, but gawk. In that case I would prefer another gawk only feature, the FIELDWIDTHS: awk -vFIELDWIDTHS="$(tr '\n' ' ' < lengths.txt)" '{for(i=1;i<=NF;i++)print">Entry"i ORS$i}' string.txt – manatwork Aug 13 '15 at 09:11

Fast method of splitting string from textfile?

4 Answers4