5

I know there's the command split which can split a file into several chunks based on file size, and then there's the command wc which can count words. I just don't know how to use the two together.

Jonathan
  • 1,270
  • if you need to do this on a regular basis, i would just recommend thinking in terms of lines or number of chunks. if it absolutely must be about number of words, the csplit command is what you are looking for, in my opinion. – ixtmixilix Mar 01 '13 at 08:25
  • This is for a word frequency analysis, so it's important that the files are split by words. – Jonathan Mar 02 '13 at 01:38

6 Answers6

4

Must it be done with wc? Because here I've ran into a very nice attempt to use regex as a csplit pattern. I don't have a system to test it right now but the regex itself seem to do the job.

The expression looks like that:

 csplit input-file.txt '/([\w.,;]+\s+){500}/'
Eugene S
  • 3,484
0

You can also use this software for split word files. Word Files Splitter is an excellent tool to split a single word file into multiple word files according to number of pages or number of sections in a word files.

http://www.windowindia.net/word-files-splitter.html

0

Try doing this :

file=/etc/passwd
count=2
count_file_lines=$(wc -l < "$file")
split -n$((count_file_lines/count)) "$file"
ls -ltr x*

That will divide a file by $count (2)

  • The Op wants to split by words, not lines. Also, why are you using a variable for count if it does not change? – terdon Mar 01 '13 at 11:02
0

See if this helps (I would try it in a temporary directory first):

perl -e '
         undef $/;
         $file=<>;
         while($file=~ /\G((\S+\s+){500})/gc)
         {
            $i++;
            open A,">","chunk-$i.txt";
            print A $1;
            close A;
         }
         $i++;
         if($file=~ /\G(.+)\Z/sg)
         {
          open A,">","chunk-$i.txt";
          print A $1;
         }
        '  your_file_name_here

Edit: Code corrected and tested. Sorry for the previous mistakes.

Edit 2: This will spit out files called chunk-#.txt starting from chunk-1.txt. Each "chunk" will contain 500 words and the last "chunk" will contain whatever is left in the file (<=500 words). You can customize this behavior by changing the relevant parts in the code.

Joseph R.
  • 39,549
0

Unix utilities usually operate on full lines, so your best bet is to first modify your input so that it writes one word per line, like this (you might have to modify this command a bit if you have other characters in your words):

<inputfile tr -c A-Za-z0-9 \\n

Since you're only interested in the words, it might be a good idea to get rid of the blank lines, by piping the output into a grep call. Here's how your full command might look like:

<inputfile tr -c A-Za-z0-9 \\n | grep -v '^$' | split -l 500

You could later join the new files in order to get everything back on a single line (using something like tr -d \\n), but if you're planning on doing more manipulation with Unix tools, keep them this way, split is not the only program that will operate on whole lines.

rahmu
  • 20,023
  • This is interesting, but completely butchers the content of the text (in this case, a novel). Ideally, I'd like to preserve the line breaks where they are. – Jonathan Mar 02 '13 at 20:49
  • Yes I thought about preserving the line breaks, and I can think of a few hacks to retrieve them after processing, (like initially replacing them with a unique character and translating that unique character back to a newline at the end) but I'm afraid I don't know the proper technique to do this in a robust way that guarantees a 100% success. I wish someone with more experience could chime in to solve this problem. – rahmu Mar 03 '13 at 02:57
-1

This simple command line should do the trick. It will create multiple chunks of 70 characters from the source text file

cntr=1;for chunk in `sed -e 's/.\{70\}/&\n/g' source.txt`; do echo $chunk > chunk$cntr.txt; cntr=$(( $cntr + 1 )); done
Jay
  • 121