How can I split a large text file into chunks of 500 words or so?

Question

I know there's the command split which can split a file into several chunks based on file size, and then there's the command wc which can count words. I just don't know how to use the two together.

if you need to do this on a regular basis, i would just recommend thinking in terms of lines or number of chunks. if it absolutely must be about number of words, the csplit command is what you are looking for, in my opinion. — ixtmixilix, Mar 01 '13 at 08:25
This is for a word frequency analysis, so it's important that the files are split by words. — Jonathan, Mar 02 '13 at 01:38

score 4 · Answer 1 · edited Mar 01 '13 at 23:26

4

Must it be done with wc? Because here I've ran into a very nice attempt to use regex as a csplit pattern. I don't have a system to test it right now but the regex itself seem to do the job.

The expression looks like that:

 csplit input-file.txt '/([\w.,;]+\s+){500}/'

edited Mar 01 '13 at 23:26

Gilles 'SO- stop being evil'

829,060

answered Mar 01 '13 at 08:03

Eugene S

3,484

nice, but it's not working: bash: syntax error near unexpected token('` – ixtmixilix Mar 01 '13 at 08:24
@ixtmixilix You have to quote the argument for the shell. – Gilles 'SO- stop being evil' Mar 01 '13 at 23:26
1

With quotes I get: csplit: '/([\\w.,;]+\\s+){500}/': match not found – Jonathan Mar 02 '13 at 00:01

score 0 · Answer 2 · answered Jul 30 '14 at 09:16

You can also use this software for split word files. Word Files Splitter is an excellent tool to split a single word file into multiple word files according to number of pages or number of sections in a word files.

http://www.windowindia.net/word-files-splitter.html

score 0 · Answer 3 · answered Mar 01 '13 at 07:35

0

Try doing this :

file=/etc/passwd
count=2
count_file_lines=$(wc -l < "$file")
split -n$((count_file_lines/count)) "$file"
ls -ltr x*

That will divide a file by $count (2)

answered Mar 01 '13 at 07:35

Gilles Quénot

33,867

The Op wants to split by words, not lines. Also, why are you using a variable for count if it does not change? – terdon Mar 01 '13 at 11:02

Joseph R. · Answer 4 · 2013-03-03T19:53:40.973

See if this helps (I would try it in a temporary directory first):

perl -e '
         undef $/;
         $file=<>;
         while($file=~ /\G((\S+\s+){500})/gc)
         {
            $i++;
            open A,">","chunk-$i.txt";
            print A $1;
            close A;
         }
         $i++;
         if($file=~ /\G(.+)\Z/sg)
         {
          open A,">","chunk-$i.txt";
          print A $1;
         }
        '  your_file_name_here

Edit: Code corrected and tested. Sorry for the previous mistakes.

Edit 2: This will spit out files called chunk-#.txt starting from chunk-1.txt. Each "chunk" will contain 500 words and the last "chunk" will contain whatever is left in the file (<=500 words). You can customize this behavior by changing the relevant parts in the code.

score 0 · Answer 5 · answered Mar 02 '13 at 13:19

0

Unix utilities usually operate on full lines, so your best bet is to first modify your input so that it writes one word per line, like this (you might have to modify this command a bit if you have other characters in your words):

<inputfile tr -c A-Za-z0-9 \\n

Since you're only interested in the words, it might be a good idea to get rid of the blank lines, by piping the output into a grep call. Here's how your full command might look like:

<inputfile tr -c A-Za-z0-9 \\n | grep -v '^$' | split -l 500

You could later join the new files in order to get everything back on a single line (using something like tr -d \\n), but if you're planning on doing more manipulation with Unix tools, keep them this way, split is not the only program that will operate on whole lines.

answered Mar 02 '13 at 13:19

rahmu

20,023

This is interesting, but completely butchers the content of the text (in this case, a novel). Ideally, I'd like to preserve the line breaks where they are. – Jonathan Mar 02 '13 at 20:49
Yes I thought about preserving the line breaks, and I can think of a few hacks to retrieve them after processing, (like initially replacing them with a unique character and translating that unique character back to a newline at the end) but I'm afraid I don't know the proper technique to do this in a robust way that guarantees a 100% success. I wish someone with more experience could chime in to solve this problem. – rahmu Mar 03 '13 at 02:57

score -1 · Answer 6 · answered Jul 30 '14 at 09:49

-1

This simple command line should do the trick. It will create multiple chunks of 70 characters from the source text file

cntr=1;for chunk in `sed -e 's/.\{70\}/&\n/g' source.txt`; do echo $chunk > chunk$cntr.txt; cntr=$(( $cntr + 1 )); done

answered Jul 30 '14 at 09:49

Jay

121

But this splits the text along characters, right? I'm looking for a solution that keeps words intact. – Jonathan Jul 30 '14 at 17:41
ah of course. I totally misunderstood. – Jay Jul 31 '14 at 16:15

How can I split a large text file into chunks of 500 words or so?

6 Answers6

Linked