How do I break up a file like split to stdout for piping to a command?

Question

I have a large .sql file full of SELECT statements that contain data I want to insert into my SQL Server database. I'm looking for how I could basically take the file's contents, 100 lines at a time, and pass it to the commands I have set to do the rest.

Basically, I'm looking for split that will output to stdout, not files.

I'm also using CygWin on Windows, so I don't have access to the full suite of tools.

Have you looked at using BULK INSERT? Separate the data from the SQL statement. — bsd, Apr 20 '14 at 11:47

score 5 · Answer 1 · edited Apr 13 '17 at 12:36

I think the easiest way to do this is:

while IFS= read -r line; do
  { printf '%s\n' "$line"; head -n 99; } |
  other_commands
done <database_file

You need to use read for the first line in each section as there appears to be no other way to stop when the end of the file is reached. For more information see:

don_crissti · Answer 2 · 2017-02-01T14:48:23.820

Basically, I'm looking for split that will output to stdout, not files.

If you have access to gnu split, the --filter option does exactly that:

‘--filter=command’

    With this option, rather than simply writing to each output file, write
    through a pipe to the specified shell command for each output file.

So in your case, you could either use those commands with --filter, e.g.

split -l 100 --filter='{ cat Header.sql; cat; } | sqlcmd; printf %s\\n DONE' infile

or write a script, e.g. myscript:

#!/bin/sh

{ cat Header.sql; cat; } | sqlcmd
printf %s\\n '--- PROCESSED ---'

and then simply run

split -l 100 --filter=./myscript infile

mikeserv · Answer 3 · 2014-04-21T16:12:41.997

_linc() ( ${sh-da}sh ${dbg+-vx} 4<&0 <&3 ) 3<<-ARGS 3<<\CMD
        set -- $( [ $((i=${1%%*[!0-9]*}-1)) -gt 1 ] && {
                shift && echo "\${inc=$i}" ; }
        unset cmd ; [ $# -gt 0 ] || cmd='echo incr "#$((i=i+1))" ; cat'
        printf '%s ' 'me=$$ ;' \
        '_cmd() {' '${dbg+set -vx ;}' "$@" "$cmd" '
        }' )
        ARGS
        s= ; sed -f - <<-INC /dev/fd/4 | . /dev/stdin
                i_cmd <<"${s:=${me}SPLIT${me}}"
                ${inc:+$(printf '$!n\n%.0b' `seq $inc`)}
                a$s
        INC
CMD

The above function uses sed to apply its argument list as a command string to an arbitrary line increment. The commands you specify on the command line are sourced into a temporary shell function which is fed a here document on stdin consisting of every increment's step worth of lines.

You use it like this:

time printf 'this is line #%d\n' `seq 1000` |
_linc 193 sed -e \$= -e r \- \| tail -n2
    #output
193
this is line #193
193
this is line #386
193
this is line #579
193
this is line #772
193
this is line #965
35
this is line #1000
printf 'this is line #%d\n' `seq 1000`  0.00s user 0.00s system 0% cpu 0.004 total

The mechanism here is very simple:

i_cmd <<"${s:=${me}SPLIT${me}}"
${inc:+$(printf '$!n\n%.0b' `seq $inc`)}
a$s

That's the sed script. Basically we just printf $increment * n;. So if you set your increment to 100 printf will write you a sed script consisting of 100 lines that say only $!n, one insert line for the top end of the here-doc, and one append for the bottom line - that's it. Most of the rest just handles options.

The next command tells sed to print the current line, delete it, and pull in the next one. The $! specifies that it should only try on any line but the last.

Provided only an incrementer it will:

printf 'this is line #%d\n' `seq 10` |                                  ⏎
_linc 3
    #output
incr #1
this is line #1
this is line #2
this is line #3
incr #2
this is line #4
this is line #5
this is line #6
incr #3
this is line #7
this is line #8
this is line #9
incr #4
this is line #10

So what's happening behind the scenes here is the function is set to echo a counter and cat its input if not provided a command string. If you saw it on the command line it would look like:

{ echo "incr #$((i=i+1))" ; cat ; } <<HEREDOC
this is line #7
this is line #8
this is line #9
HEREDOC

It executes one of these for every increment. Look:

printf 'this is line #%d\n' `seq 10` |
dbg= _linc 3
    #output
set -- ${inc=2}
+ set -- 2
me=$$ ; _cmd() { ${dbg+set -vx ;} echo incr "#$((i=i+1))" ; cat
}
+ me=19396
        s= ; sed -f - <<-INC /dev/fd/4 | . /dev/stdin
                i_cmd <<"${s:=${me}SPLIT${me}}"
                ${inc:+$(printf '$!n\n%.0b' `seq $inc`)}
                a$s
        INC
+ s=
+ . /dev/stdin
+ seq 2
+ printf $!n\n%.0b 1 2
+ sed -f - /dev/fd/4
_cmd <<"19396SPLIT19396"
this is line #1
this is line #2
this is line #3
19396SPLIT19396
+ _cmd
+ set -vx ; echo incr #1
+ cat
this is line #1
this is line #2
this is line #3
_cmd <<"19396SPLIT19396"

REALLY FAST

time yes | sed = | sed -n 'p;n' |
_linc 4000 'printf "current line and char count\n"
    sed "1w /dev/fd/2" | wc -c
    [ $((i=i+1)) -ge 5000 ] && kill "$me" || echo "$i"'

    #OUTPUT

current line and char count
19992001
36000
4999
current line and char count
19996001
36000
current line and char count
[2]    17113 terminated  yes |
       17114 terminated  sed = |
       17115 terminated  sed -n 'p;n'
yes  0.86s user 0.06s system 5% cpu 16.994 total
sed =  9.06s user 0.30s system 55% cpu 16.993 total
sed -n 'p;n'  7.68s user 0.38s system 47% cpu 16.992 total

Above I tell it to increment on every 4000 lines. 17s later and I've processed 20 million lines. Of course the logic isn't serious there - we only read each line twice and count all of their characters, but the possibilities are pretty open. Also if you look closely you might notice it's seemingly the filters providing the input that are taking the majority of the time anyway.

it's worth noting that the shear complexity of the shell magic in this makes it not portable - it certainly doesnt run on bash4 on osx 10.9. :) it wants to expand to use dash, and sed -f - doesnt make bsd sed happy either... not to mention having to pull the heredoc markers back to ^... — keen, Jan 10 '17 at 21:04

Ole Tange · Answer 4 · 2014-04-26T00:48:20.603

GNU Parallel is made for this:

cat bigfile | parallel --pipe -N100 yourscript

It will default to running 1 job per CPU core. You can force running a single job with '-j1'.

Version 20140422 includes a fast version that can deliver 3.5 GB/s. The price is that it cannot deliver the exact 100 lines, but if you know the approximate line length you can set --block to 100 times that (here I assume the line length is close to 500 bytes):

parallel --pipepart --block 50k yourscript :::: bigfile

score 1 · Accepted Answer · answered Apr 20 '14 at 11:29

1

I ended up with something that's seemingly gross, if there's a better way please post it:

#!/bin/sh

DONE=false
until $DONE; do
    for i in $(seq 1 $2); do 
        read line || DONE=true;
        [ -z "$line" ] && continue;
        lines+=$line$'\n';
    done
    sql=${lines::${#lines}-10}
    (cat "Header.sql"; echo "$sql";) | sqlcmd
    #echo "--- PROCESSED ---";
    lines=;
done < $1

Run with ./insert.sh "File.sql" 100 where the 100 is the number of lines to process at a time.

answered Apr 20 '14 at 11:29

Ehryk

1,852

I'm not sure exactly what assumptions are safe with SQL, but for general safety you should do IFS= read -r line. Consider the different between echo ' \t\e\s\t ' | { read line; echo "[$line]"; } and echo ' \t\e\s\t ' | { IFS= read -r line; echo "[$line]"; }. Also echo is not safe with arbitrary strings (eg line="-n"; echo "$line"), it is safer to use printf '%s\n. – Graeme Apr 20 '14 at 12:50

How do I break up a file like split to stdout for piping to a command?

5 Answers5

REALLY FAST

Linked

Related