Storing longer text output in memory in shell variables vs Writing to disk

Question

I have the below call to db2 database command line tool inside a loop, which runs 100k iterations.

(the output from db2 is 5 rows of 20 chars approx per each call, total of approx 100k calls. The input to SP is prefixed to the output line from the SP and dumped to a log file.)

while read line
do
    db2 -x "call stored_procedure_XYZ($line)" |sed "s/^/$line/" >>log_file.txt
done<$infile

I am trying to make the whole thing run faster by taking out the sed and disk write from inside the loop. Is it advisable to store all the output in a variable, and later modify and write it to the disk?

What would be the length the shell variable can hold?

(bash, aix 6.1)

If all you are doing is appending output to a file what is the point of putting it all in a variable first? What else are you doing with the data? — jw013, Dec 10 '14 at 20:11
I am thinking writing to disk only once after the loop gets over might improve performance, instead of writing multiple times inside the loop — dbza, Dec 10 '14 at 20:21
Do you finish processing the data within an iteration, or do you need to process it further after the loop? — muru, Dec 10 '14 at 20:29
@dbza You should try it yourself. Whether it will make a difference really depends on how slow your disk is. Another thing you can do is move the redirection outside the loop, as long as your loop doesn't output anything else. — jw013, Dec 10 '14 at 20:30
@jw013 Yeah I will try, but will the shell variable have a limit for characters? Can it hold all data the file could hold? — dbza, Dec 10 '14 at 20:35
@muru There is other stuff outside the loop, but in this context all i need to do is prefix the loop counter to the db2 output and write db2 output to a file. — dbza, Dec 10 '14 at 20:36
This might be of interest: http://stackoverflow.com/questions/1078031/what-is-the-maximum-size-of-an-environment-variable-value if you can shift processing from outside the loop to inside it, you might be gain performance from holding off on the file, without significant increase in memory usage. — muru, Dec 10 '14 at 20:40
@dbza Maximum size probably will not matter. As far as disk I/O goes, buffering more than a few megabytes is probably not going to improve performance. Even if a shell has an unlimited variable size, at some point you are going to run out of memory, and your kernel is going to start swapping, at which point you've probably already lost any performance benefits you might have gained. — jw013, Dec 10 '14 at 20:40
maybe im misreading, but if your plan to increase performance involves replacing sed with a shell loop then that is the wrong way to go. it looks like you're calling sed singly already for each iteration of a loop - that's where you get the loop counter value? for big jobs you should be using some stream capable tool - like sed to tell the shell what to do - not vice versa. `<infile cmd | cmd | cmd | cmd >outfile'. 10 to 1 thr shell is the weakest link in your performance chain. — mikeserv, Dec 10 '14 at 21:15
Can you just have nl do the call stored... part with its -separator string. It's not very clear what you're doing though - you provide no sample input or output. What does db2 do... why? — mikeserv, Dec 10 '14 at 23:05
@mikeserv db2 appears to be some sort of IBM database command line utility. It is highly unlikely that nl does anything close to what it does. — jw013, Dec 11 '14 at 00:53
@dbza Since you have edited your question to show the entire loop, I would recommend moving >>log_file.txt outside the loop, so you have done<infile >>log_file.txt as the last line instead. That way you keep the file open and avoid reopening and reclosing it each iteration. If you want to try anything more complicated though, you should first take measurements to see if disk I/O is really taking enough time to be worth optimizing. You can probably do this by replacing >>log_file.txt with >/dev/null to get rid of disk I/O altogether and see how much of a speed-up you get. — jw013, Dec 11 '14 at 01:22
@jw013 Thanks ! i'll try that. You can add an answer as well since this is what I was looking at — dbza, Dec 12 '14 at 01:53
@jw013 i guess you meant donelog_file.txt and not >> to append — dbza, Dec 12 '14 at 01:54

Cuneyit Kiris · Answer 1 · 2014-12-10T21:46:26.223

5

Bash variable size is not fixed.It is very likely hold arbitrary amounts of data as long as malloc can find sufficient memory and contiguous address space.Let's assume you stored large large amount of data in your variable.When you try to write data to your file,possibly you will get error something like that

/bin/echo ${LARGE_DATA} >> ${YourFile}    
/bin/echo: Argument list too long

This error related to max length of your command argument. Please check Limits on size of arguments and environment section which stated in execve man page http://man7.org/linux/man-pages/man2/execve.2.html

"... the memory used to store the environment and argument strings was limited to 32 pages (defined by the kernel constant MAX_ARG_PAGES). On architectures with a 4-kB page size, this yields a maximum size of 128 kB ... "

EDIT:

Please also note that the above error for /bin/echo is just an example, it is possible to get a similar error, when you try other ways while writing a file.It is about argument size.

SUGGESTION:

If we think writing to file operations atomically, each time pipe is generated for writing, file descriptors are opened and closed.It takes a some time.Instead of using /bin/echo or others, you can write your own "WriteFile" program with higher level language like C/C++.What you need to is I/O redirection.

Open file descriptor
Write data
Close file descriptor
Optimize your code
Done

Please check System Calls like ssize_t write(int fd, const void *buf, size_t count);

http://linux.die.net/man/2/write

edited Dec 10 '14 at 21:46

answered Dec 10 '14 at 21:13

Cuneyit Kiris

59

1

that's an excellent point - I've run into that before in tests. it gets to the point where only shell builtins run because the kernel cant load another executable - you end up dead in the water. – mikeserv Dec 10 '14 at 21:24
The argument list too long problem can easily be bypassed. It's not really worth mentioning here. Who would use /bin/echo when nearly every shell has a built-in version? – jw013 Dec 10 '14 at 21:34
@jw013 - how do you do it then? This isn't about /bin/echo - it's about /bin/everything - the environment gets full. If you know how to do it I'd like to know. I ran into it doing tests for this answer and took much of what I know about it from here. What else can you do when you can't exec? And what good is a shell in that case anyway? – mikeserv Dec 10 '14 at 22:25
@mikeserv If the goal is simply to get the contents of a shell variable into a file, you don't need to use exec() functions at all. If the shell does something natively without calling any exec function, there is no argument list at all. There is no reason to try to put this shell variable into the environment. – jw013 Dec 11 '14 at 00:40
@jw013 - I don't think I do misunderstand - my point is the shell's pretty bad at that stuff. That's why all of those other tools are there. It's good for the in-between things - the setup and teardown - but it's not the main-show. Its whole purpose is execing - it is only a shell for other commands. Why would you cripple it? Let it parse args - it's great at that - don't overrun its arg buffers. What does it do natively of any consequence that doesn't require an exec? – mikeserv Dec 11 '14 at 00:43
@mikeserv Ok you've lost me. All I thought we were trying to do is put the contents of a shell variable into a file. The answer I am commenting on is suggesting /bin/echo ${VARIABLE} >> file, and I am saying that there is no reason to use /bin/echo over a shell built-in. What are you trying to do? – jw013 Dec 11 '14 at 00:45
By my understanding of the question, this "answer" is mostly off-topic and irrelevant. The first sentence gives a rough "probable" answer which I already mentioned in a comment, and the rest of it proceeds to veer off-topic into explaining why a dumb command that nobody would want to run anyways may not work. – jw013 Dec 11 '14 at 00:47
1

@mikeserv Well obviously everything stops working if you exhaust the memory in the machine. That's why I was making the assumption that wouldn't happen. It is only about 10 MB (100k iterations * 100 chars) of data after all and everyone has 10 MB nowadays. The problem isn't running out of memory. And as I explained, you don't need to put this 10 MB shell variable in any argument lists or environments. So what exactly do you think you are running out of? We aren't passing this 10 MB shell variable to db2 if that is what you are thinking because the question doesn't say that at all. – jw013 Dec 11 '14 at 00:50
@mikeserv A single 8MB env* var* - and this is where you keep getting stuck. Who said anything about environment variables? Where in the question do you see OP trying to put this variable into the environment? Need I remind you of the difference between an unexported shell variable and an environment variable? – jw013 Dec 11 '14 at 00:55
1

@mikeserv I'm sorry, that is not the case. You can see this yourself: foo='not in the environment'; envfoo='in the environment'; export envfoo; env | grep foo= and see what you can get. Only the exported variable is shown. I've looked at your links and nothing in them contradicts this. – jw013 Dec 11 '14 at 00:59
@jw013 - it seems like I remember differently, but maybe I was exporting before. You're right - and this holds it up: var=$(dd bs=8M count=1 </dev/zero| tr \\0 .); cat. Thanks for keeping at it - I can be stubborn. That said, the whole 100k loop output (not to mention doing it there) in a shell variable thing is still an awful idea, I think. – mikeserv Dec 11 '14 at 01:02
1

@jw013 could you check http://unix.stackexchange.com/a/120842/12586 – Cuneyit Kiris Dec 11 '14 at 01:05
@jw013 I referred to execve man page so the others have similar limitations, I didn^t say it is because of execve limitation.Hope I helped clarifying my answer. – Cuneyit Kiris Dec 11 '14 at 01:09
@jw013 I suggested a solution to the problem(I know the suggestion is off-topic), because/but keeping 8MB or whatever amount of data in a variable is not right thing to do.You need stream the data in any moment the data arrived/generated. – Cuneyit Kiris Dec 11 '14 at 01:15
@CuneyitKiris That is an interesting link and quite a long one. Would you care to elaborate as to why you want me to look at it? Like I said, your answer not really address the question at all. There is nothing to put in the environment or argument list here. I don't see how your suggestion of "write a C program that writes data" is relevant or helpful. – jw013 Dec 11 '14 at 01:25
@jw013 u need to give the variable a built-in function as a parameter (can be any built-in echo,xargs, sed etc.) to write a file (correct me if im wrong ,otherwise please explain ) . All the the commands has own max param size ,almots all of them allocate memory to keep the arg data.so how streaming cannot be the solution (which u can handle with small C code) instead of using such commands ? – Cuneyit Kiris Dec 11 '14 at 01:38

Storing longer text output in memory in shell variables vs Writing to disk

1 Answers1