Split: how to split into different percentages?

Question

How can I split a text file into 70% and 30% using the split command ?

Are you wedded to using the split command? If not, you can easily do this with straight text manipulation, certainly using perl or python. As long as the file is not too wrong, read it in to memory as a string, then split the string. If the file is too big, then more work is needed. — Faheem Mitha, Mar 28 '11 at 09:20
@Faheem Mitha The file is 64MB. I like the idea to use split because it is faster than writing code. I was wonder now if I specify the number of lines corresponding 70% of the file, I get a big file and a small file. Shouldn't it work ? — aneuryzm, Mar 28 '11 at 09:23
Please share your answer. (http://meta.stackexchange.com/questions/12513/should-i-not-answer-my-own-questions) — dogbane, Mar 28 '11 at 09:48

score 18 · Accepted Answer · answered Mar 28 '11 at 10:01

18

The commands below will work for percentages above 50% (if you want to split only into two files), quick and dirty approach.

1) split 70% based on lines

split -l $[ $(wc -l filename|cut -d" " -f1) * 70 / 100 ] filename

2) split 70% based on bytes

split -b $[ $(wc -c filename|cut -d" " -f1) * 70 / 100 ] filename

answered Mar 28 '11 at 10:01

forcefsck

7,964

3

On MacOSX wc sometimes returns the number of lines with a space in front of it, something that breaks this script. First piping to xargs will remove those spaces and make things work again:
split -l $[ $(wc -l filename | xargs | cut -d" " -f1) * 70 / 100 ] filename
– Emil Stenström Jan 22 '18 at 09:19

don_crissti · Answer 2 · 2018-03-28T14:36:00.830

You could use csplit to split into two pieces (using any percentage) e.g. first piece - first 20% of lines, second piece - the remaining 80% of lines:

csplit infile $(( $(wc -l < infile) * 2 / 10 + 1))

$(wc -l < infile) : total number of lines
2 / 10 : percentage
+1 : add one line because csplit splits up to but not including line N

You can only split based on lines though.
Basically, as long as you have the line number via $(( $(wc -l < file) * 2 / 10)) you can use any line-oriented tool:

sed 1,$(( $(wc -l < infile) * 2 / 10))'{
w 20-infile
d
}' infile > 80-infile

or, even cooler:

{ head -n$(( $(wc -l < infile) * 2 / 10)) > 20-infile; cat > 80-infile; } <infile

though some heads are dumb and won't comply with the standards so this won't work on all setups...

score 2 · Answer 3 · answered Dec 21 '14 at 20:39

{   BS=$(($(wc -c <file) * $P / 100))
    dd count=1 bs="$BS" >file1; cat
} <file >file2 2>/dev/null

...should work for this simple case because you're only splitting once - and so probably split is a little overkill. So long as the file is seekable, dd will only do a single read() on <stdin, and so cat is left to begin its read() at whatever point dd leaves it.

If the file is large then a count=1 bs=$big_ol_num could get a little unwieldy, and it can be blocked out with some extra - yet simple - shell math.

A non-seekable input - like from a pipe - might skew dd's results, though this can be handled as well w/ GNU dd's iflag=fullblock.

score 0 · Answer 4 · answered Mar 28 '18 at 13:22

The following code using head and tail works with any ratio (40 to 60 in this case):

export FILE_NAME=train.vw
head -n $[ $(wc -l ${FILE_NAME}|cut -d" " -f1) * 40 / 100 ] ${FILE_NAME} > train_40.vw
tail -n +$[ ($(wc -l ${FILE_NAME}|cut -d" " -f1) * 40 / 100) + 1 ] ${FILE_NAME} > train_60.vw

Split: how to split into different percentages?

4 Answers4