I know there's the command split
which can split a file into several chunks based on file size, and then there's the command wc
which can count words. I just don't know how to use the two together.
6 Answers
Must it be done with wc
? Because here I've ran into a very nice attempt to use regex as a csplit
pattern. I don't have a system to test it right now but the regex itself seem to do the job.
The expression looks like that:
csplit input-file.txt '/([\w.,;]+\s+){500}/'

- 829,060

- 3,484
-
nice, but it's not working:
bash: syntax error near unexpected token
('` – ixtmixilix Mar 01 '13 at 08:24 -
@ixtmixilix You have to quote the argument for the shell. – Gilles 'SO- stop being evil' Mar 01 '13 at 23:26
-
1
You can also use this software for split word files. Word Files Splitter is an excellent tool to split a single word file into multiple word files according to number of pages or number of sections in a word files.
Try doing this :
file=/etc/passwd
count=2
count_file_lines=$(wc -l < "$file")
split -n$((count_file_lines/count)) "$file"
ls -ltr x*
That will divide a file by $count
(2)

- 33,867
-
The Op wants to split by words, not lines. Also, why are you using a variable for count if it does not change? – terdon Mar 01 '13 at 11:02
See if this helps (I would try it in a temporary directory first):
perl -e '
undef $/;
$file=<>;
while($file=~ /\G((\S+\s+){500})/gc)
{
$i++;
open A,">","chunk-$i.txt";
print A $1;
close A;
}
$i++;
if($file=~ /\G(.+)\Z/sg)
{
open A,">","chunk-$i.txt";
print A $1;
}
' your_file_name_here
Edit: Code corrected and tested. Sorry for the previous mistakes.
Edit 2: This will spit out files called chunk-#.txt
starting from chunk-1.txt
. Each "chunk" will contain 500 words and the last "chunk" will contain whatever is left in the file (<=500 words). You can customize this behavior by changing the relevant parts in the code.

- 39,549
Unix utilities usually operate on full lines, so your best bet is to first modify your input so that it writes one word per line, like this (you might have to modify this command a bit if you have other characters in your words):
<inputfile tr -c A-Za-z0-9 \\n
Since you're only interested in the words, it might be a good idea to get rid of the blank lines, by piping the output into a grep
call. Here's how your full command might look like:
<inputfile tr -c A-Za-z0-9 \\n | grep -v '^$' | split -l 500
You could later join the new files in order to get everything back on a single line (using something like tr -d \\n
), but if you're planning on doing more manipulation with Unix tools, keep them this way, split
is not the only program that will operate on whole lines.

- 20,023
-
This is interesting, but completely butchers the content of the text (in this case, a novel). Ideally, I'd like to preserve the line breaks where they are. – Jonathan Mar 02 '13 at 20:49
-
Yes I thought about preserving the line breaks, and I can think of a few hacks to retrieve them after processing, (like initially replacing them with a unique character and translating that unique character back to a newline at the end) but I'm afraid I don't know the proper technique to do this in a robust way that guarantees a 100% success. I wish someone with more experience could chime in to solve this problem. – rahmu Mar 03 '13 at 02:57
This simple command line should do the trick. It will create multiple chunks of 70 characters from the source text file
cntr=1;for chunk in `sed -e 's/.\{70\}/&\n/g' source.txt`; do echo $chunk > chunk$cntr.txt; cntr=$(( $cntr + 1 )); done

- 121
-
But this splits the text along characters, right? I'm looking for a solution that keeps words intact. – Jonathan Jul 30 '14 at 17:41
-
csplit
command is what you are looking for, in my opinion. – ixtmixilix Mar 01 '13 at 08:25