How can I split a text file into 70% and 30% using the split command ?
-
Are you wedded to using the split command? If not, you can easily do this with straight text manipulation, certainly using perl or python. As long as the file is not too wrong, read it in to memory as a string, then split the string. If the file is too big, then more work is needed. – Faheem Mitha Mar 28 '11 at 09:20
-
@Faheem Mitha The file is 64MB. I like the idea to use split because it is faster than writing code. I was wonder now if I specify the number of lines corresponding 70% of the file, I get a big file and a small file. Shouldn't it work ? – aneuryzm Mar 28 '11 at 09:23
-
And yes.. it worked.. Should I delete the question ? – aneuryzm Mar 28 '11 at 09:29
-
Up to you, but not necessary. – Faheem Mitha Mar 28 '11 at 09:36
-
Please share your answer. (http://meta.stackexchange.com/questions/12513/should-i-not-answer-my-own-questions) – dogbane Mar 28 '11 at 09:48
4 Answers
The commands below will work for percentages above 50% (if you want to split only into two files), quick and dirty approach.
1) split 70% based on lines
split -l $[ $(wc -l filename|cut -d" " -f1) * 70 / 100 ] filename
2) split 70% based on bytes
split -b $[ $(wc -c filename|cut -d" " -f1) * 70 / 100 ] filename

- 7,964
-
3On MacOSX wc sometimes returns the number of lines with a space in front of it, something that breaks this script. First piping to xargs will remove those spaces and make things work again:
– Emil Stenström Jan 22 '18 at 09:19split -l $[ $(wc -l filename | xargs | cut -d" " -f1) * 70 / 100 ] filename
You could use csplit
to split into two pieces (using any percentage) e.g. first piece - first 20% of lines, second piece - the remaining 80% of lines:
csplit infile $(( $(wc -l < infile) * 2 / 10 + 1))
$(wc -l < infile)
: total number of lines
2 / 10
: percentage
+1
: add one line because csplit
splits up to but not including line N
You can only split based on lines though.
Basically, as long as you have the line number via $(( $(wc -l < file) * 2 / 10))
you can use any line-oriented tool:
sed 1,$(( $(wc -l < infile) * 2 / 10))'{
w 20-infile
d
}' infile > 80-infile
or, even cooler:
{ head -n$(( $(wc -l < infile) * 2 / 10)) > 20-infile; cat > 80-infile; } <infile
though some head
s are dumb and won't comply with the standards so this won't work on all setups...

- 82,805
{ BS=$(($(wc -c <file) * $P / 100))
dd count=1 bs="$BS" >file1; cat
} <file >file2 2>/dev/null
...should work for this simple case because you're only splitting once - and so probably split
is a little overkill. So long as the file is seekable, dd
will only do a single read()
on <stdin
, and so cat
is left to begin its read()
at whatever point dd
leaves it.
If the file is large then a count=1 bs=$big_ol_num
could get a little unwieldy, and it can be blocked out with some extra - yet simple - shell math.
A non-seekable input - like from a pipe - might skew dd
's results, though this can be handled as well w/ GNU dd
's iflag=fullblock
.

- 58,310
The following code using head
and tail
works with any ratio (40 to 60 in this case):
export FILE_NAME=train.vw
head -n $[ $(wc -l ${FILE_NAME}|cut -d" " -f1) * 40 / 100 ] ${FILE_NAME} > train_40.vw
tail -n +$[ ($(wc -l ${FILE_NAME}|cut -d" " -f1) * 40 / 100) + 1 ] ${FILE_NAME} > train_60.vw

- 101