1

Using a bash shell operated through a pipe (non-interactive), I'm trying to pass huge amounts of data to a shell command. So far, I cannot get it to work reliably.

For example, using a here document, it would look like this:

(sed s/X//|base64 -d|lzcat|tar x) << EOF
XXQAAgAD//////////wAzG+wBunDDREwYD51KYXL50sahXmBTOGSine7WC0RATjpIrem5ygsQWKoZ
XwhPmkJAuCyqnO1KQAoFruXjSOsR3KJY+zHvzYFOgpl3ZJa+1+b0cB0w2vYzj53qplKMTjRkchPnr
XZ/nbloA=
EOF

But with huge amounts of data, this won't work since bash tries to load it all into memory before passing it to the command.

On the other hand, if I do it directly without a here document, it should be passed directly to the command, but then the shell seems to interpret an unpredictable amount of lines as shell commands:

(sed s/X//|base64 -d|lzcat|tar x)
XXQAAgAD//////////wAzG+wBunDDREwYD51KYXL50sahXmBTOGSine7WC0RATjpIrem5ygsQWKoZ
XwhPmkJAuCyqnO1KQAoFruXjSOsR3KJY+zHvzYFOgpl3ZJa+1+b0cB0w2vYzj53qplKMTjRkchPnr
XZ/nbloA=

I guess this has something to do with how the non-interactive shell buffers input.

I do not need to return to the shell having the data passed, so a solution like the latter one would work for me, if it behaved predictably.

ternary
  • 111

3 Answers3

4

The idea of having a gigabyte size shell script seems absurd to me. So just put the data into a separate file.

If you insist on having just one file: Have the shell ignore this data. Put it at the end of the file after an exit. The shell (at least bash) does not read until the end of the file then.

Use an external command for extracting the data from the file and pass it to the intended commands:

#! /bin/bash

do_something_with_the_data () {
        wc
}

test -f "$0" || exit 3

awk '/^DATABLOCK-1$/ { run=1; next; }; run==0 { next; }; '\
'$0=="" { exit; }; { print; }' "$0" |
        do_something_with_the_data

awk '/^DATABLOCK-2$/ { run=1; next; }; run==0 { next; }; '\
'$0=="" { exit; }; { print; }' "$0" |
        do_something_with_the_data

exit 0

DATABLOCK-1
foo bar baz

DATABLOCK-2
x
y
z
Hauke Laging
  • 90,279
  • 1
    Should be two upvotes: (1) for the exit [what was I thinking], and (2) for the outrageous implementation of a HereDoc. – Paul_Pedant May 30 '20 at 17:41
0

You are passing << EOF, which tells the shell to do expansions and substitutions in the data. That's going to give it a headache, and possibly have unintended effects. You should quote the redirection to disable shell parsing of the data like: << 'EOF' but you must not quote the terminating EOF. If the EOF would be the last thing in the script, it is permitted to omit it (IIRC).

Please quantify "huge data". I tested this for a client requirement, and I got bored at 10MB (and that was long ago on smaller ram that you would ever see today).

The sed is probably wrong. It replaces only the initial X in each line. You probably want: sed 's/X//g'

tar has nothing to eXtract from. It needs an archive name, presumably "-" to read stdin.

Your final version is wrong. The pipeline has no redirection at all, so it will hang forever with sed reading stdin on the command line. The stuff starting XXQAAgAD/ will be interpreted as a command name if it gets that far.

I'm not clear why you would embed a huge amount of static data inside a shell script. That's what data files and pipes are for. What specific issue are you trying to solve here?

Of course, if you had a file which was archived through tar, compressed with xz, encoded with base64, and emailed to you, it all makes perfect sense. Except for the bit where you embed the data into a shell script. And the bit where it removes the first X.

Paul_Pedant
  • 8,679
  • 1
    "The pipeline has no redirection at all, so it will hang forever with sed reading stdin on the command line." -- it's not wrong if the the script is readable from stdin. Not sure what the command line has to do with it. "[tar] needs an archive name" -- well, it does have a default, which on my Debian, seems to be stdin. It could well be a tape drive, too, so yes, -f - would seem prudent. – ilkkachu May 30 '20 at 19:07
  • 1
    "You probably want: sed 's/X//g'" -- It's very likely that they definitely don't want that, as removing one particular 6-bit block from everywhere in some base64-encoded data doesn't sound useful at all. It's much more likely they just want to remove the Xs from the start of the lines (I don't know what they're used for, but there is one at the start of each and every line of the data.) – ilkkachu May 30 '20 at 19:07
  • 1
    One place where a large amount (can't call it huge) of data is included in shell scripts are in several installation scripts that I've see. The installation script includes either a tar.gz file or (more often) an .rpm or .deb file that it then extracts and installs. I can't say that I like the practice but it is one that I've had to adapt to. – doneal24 May 30 '20 at 21:29
0

Without the here-doc, it works fine for me, provided the script is available on stdin. If stdin is seekable, Bash seeks back to the end of the first line before running it; if it's not, it reads one character at a time to leave the stream in the same position. Dash (Debian's /bin/sh) doesn't, though.

The content here is a gzip compressed tar file with a file called hello.txt (it's different from that in the question):

$ ls
data.sh
$ cat data.sh 
sed -e 's/^X//' | base64 -d | tar -zxf -
XH4sIANuo0l4AA+3RMQrCQBCF4ak9xZ5AZmc363mCCglEAusoHl8TxM4iRZLm/5rHwCseTHcdhvHo
XL5f16EfJecp4anS+NX1zViSmUnJjOVkUjWbFJOiKm34ed29rCNL7s6/e/u2dL3W8bTFoW930/8Pe
XKwAAAAAAAAAAAAAAAAAAS70BbZqA2QAoAAA=
$ bash < data.sh 
$ cat hello.txt 
hello

See also:

You probably want to use tar -f -, as the default input might well be a tape drive, depending on the system (and for GNU tar, how it's compiled).

But really, self-extracting shell scripts like this ask the user to start off running some code you sent them, and there's something fishy about that. Plus the fact that base-64 encoding expands the data significantly, you'd use less space if you just transferred the tar file as a separate file. That is, if it's possible, which I should probably assume it isn't since we're talking scripts like this.

ilkkachu
  • 138,973
  • Can you explain why your example reads the file twice, and why it repositions where it does? It works only if you explicitly run bash < data.sh, AND if you edit the whole data set to insert on every line a preceding :, AND if you fix the sed to sub : instead of X (bash V4.3.8). It hangs if you just run ./data.sh, and it fails every line if they lack a : (which also shows it reads all the data too).The "it" than works is not the "it" that was posted. – Paul_Pedant Jun 01 '20 at 09:31
  • @Paul_Pedant, the : vs X is really inconsequential, I would use something like : here to make it visual distinct from the base64-encoded data, but we can change it back. I also used a whole different set of data, the original (with the Xs and base64 encoding removed) was an LZMA compressed tar file with a file called foo inside, mine was a gzip compressed tar file with hello.txt. Mostly because I'm uncomfortable about including binary dumps I didn't create. One could also just remove the leading Xs/colons/whatever and the accompanying sed, they're not really necessary. – ilkkachu Jun 01 '20 at 10:08
  • @Paul_Pedant, as for running with bash < data.sh, yes, that's the important part, that's how it works. They said they're "Using a bash shell operated through a pipe", so I assume cat data.sh | bash would be closer, but the question isn't exactly clear on how it works. Perhaps I should have asked them what they're doing exactly before writing the answer, especially since it's quite possible they're not using Bash, but Dash, because that would explain the "shell seems to interpret an unpredictable amount of lines as shell commands" part. – ilkkachu Jun 01 '20 at 10:13
  • @Paul_Pedant, If you try their script with bash < data.sh, or mine, it works fine. It's not the shell that reads the file in full there, it's the pipeline. – ilkkachu Jun 01 '20 at 10:13
  • @Paul_Pedant, none of this makes any sense to do if one can just transfer a regular tarball over, and run tar xf foo.tar on it from a command line. That would be much better, and what I tried to mention in the last paragraph. But then, I guess I should assume that's not possible since they're even trying something like this. – ilkkachu Jun 01 '20 at 10:24
  • Thanks, got it now, my dumb. The pipeline has no explicit input. This works because (a) it inherits stdin from the bash, and (b) bash is squeaky-clean about stream positioning. – Paul_Pedant Jun 01 '20 at 15:14
  • @Paul_Pedant, yes, exactly! It's not something very commonly done... – ilkkachu Jun 01 '20 at 17:00