12

Related, but no satisfactory answers: How can I split a large text file into chunks of 500 words or so?

I'm trying to take a text file (http://mattmahoney.net/dc/text8.zip) with > 10^7 words all in one line, and split it into lines with N words each. My current approach works, but is fairly slow and ugly (using shell script):

i=0
for word in $(sed -e 's/\s\+/\n/g' input.txt)
do
    echo -n "${word} " > output.txt
    let "i=i+1"

    if [ "$i" -eq "1000" ]
    then
        echo > output.txt
        let "i=0"
    fi
done

Any tips on how I can make this faster or more compact?

  • if you want it faster, you need to use something else then bash script. I would recommend some C. It can fit to few lines. – Jakuje Sep 04 '15 at 19:36

8 Answers8

15

Use xargs (17 seconds):

xargs -n1000 <file >output

It uses the -n flag of xargs which defines the max number of arguments. Just change 1000 to 500 or whatever limit you want.

I made a test file with 10^7 words:

$ wc -w file
10000000 file

Here are the time stats:

$ time xargs -n1000 <file >output
real    0m16.677s
user    0m1.084s
sys     0m0.744s
chaos
  • 48,171
7

Perl seems quite astonishingly good at this:

Create a file with 10,000,000 space separated words

for ((i=1; i<=10000000; i++)); do printf "%s " $RANDOM ; done > one.line

Now, perl to add a newline after each 1,000 words

time perl -pe '
    s{ 
        (?:\S+\s+){999} \S+   # 1000 words
        \K                    # then reset start of match
        \s+                   # and the next bit of whitespace
    }
    {\n}gx                    # replace whitespace with newline
' one.line > many.line

Timing

real    0m1.074s
user    0m0.996s
sys     0m0.076s

verify results

$ wc one.line many.line
        0  10000000  56608931 one.line
    10000  10000000  56608931 many.line
    10000  20000000 113217862 total

The accepted awk solution took just over 5 sec on my input file.

glenn jackman
  • 85,964
5

Assuming your definition of word is a sequence of non-blank characters separated by blanks, here's an awk solution for your single-line file

awk '{for (i=1; i<=NF; ++i)printf "%s%s", $i, i % 500? " ": "\n"}i % 500{print ""}' file
iruvar
  • 16,725
5

Not really suitable when Number of words is a big number but if it's a small number (and ideally, no leading/trailing spaces in your one-line file) this should be quite fast (e.g. 5 words per line):

tr -s '[[:blank:]]' '\n' <input.txt | paste -d' ' - - - - - >output.txt
don_crissti
  • 82,805
  • 1
    This is perfectly fine with large numbers as well, and blindingly fast. Just generate the paste string on the fly. For example: tr -s '[[:blank:]]' '\n' < text8 | paste -d' ' $(perl -le 'print "- " x 1000') – terdon Sep 06 '15 at 11:16
  • @terdon - true, though for large numbers one has to build up the command arguments e.g. as you did or via set etc... and even then, there's a sytem specific max number of arguments (I'm not familiar with all flavors of paste but I think with some implementations there are limits as to no. of args/input files and/or output line length...) – don_crissti Sep 07 '15 at 18:54
4

The same sed command can be simplified by specifying how many word-space patterns you want to match. I didn't have any big string files to test it out on, but without the loops in your original script this should run as fast as your processor can stream the data. Added benefit, it'll work equally well on multi-line files.

n=500; sed -r "s/((\w+\s){$n})/\1\n/g" <input.txt >output.txt
3

The venerable fmt(1) command, while not strictly operating on "a particular number of words" can fairly quickly wrap long lines to a particular goal (or maximum) width:

perl -e 'for (1..100) { print "a"x int 3+rand(7), " " }' | fmt

Or with modern perl, for a specific number of words, say, 10, and assuming a single space as the word boundary:

... | perl -ple 's/(.*? ){10}\K/\n/g'
thrig
  • 34,938
2

The coreutils pr command is another candidate: the only wrinkle seems to be that it is necessary to force the page width to be large enough to accommodate the output width.

Using a file created using @Glenn_Jackman's 10,000,000 word generator,

$ time tr '[[:blank:]]' '\n' < one.line | pr -s' ' -W 1000000 -JaT -1000 > many.line

real    0m2.113s
user    0m2.086s
sys 0m0.411s

where the counts are confirmed as follows

$ wc one.line multi.line 
        0  10000000  56608795 one.line
    10000  10000000  56608795 many.line
    10000  20000000 113217590 total

[Glenn's perl solution is still a little faster, ~1.8s on this machine].

steeldriver
  • 81,074
1

in Go I would try it like this

//wordsplit.go

//$ go run wordsplit.go bigtext.txt

package main


import (
    "fmt"
    "io/ioutil"
    "log"
    "os"
    "strings"
)


func main() {
    myfile, err := os.Open(os.Args[0])
    if err != nil {
        log.Fatal(err)
    }
    defer myfile.Close()
    data, err := ioutil.ReadAll()
    if err != nil {
        log.Fatal(err)
    }
    words := strings.Split(data, " ")
    newfile, err := os.Create("output.txt")
    if err != nil {
        log.Fatal(err)
    }
    defer newfile.Close()
    for i := 0; i < len(words)-10; i+10 {
        newfile.WriteString(words[i:i+10])
    }
    newfile.WriteString(words[-(len(words)%10):])
    fmt.Printf("Formatted %s into 10 word lines in output.txt", os.Args[0])
}