This script will split a text file into a given number of sections, avoiding splitting text lines across sections. It can be used where there is only sufficient space to hold one section at a time. It operates by copying sections of the source file starting at the end, then truncating the source to free up space. So if you have a 1.8GB file and 0.5GB free space, you would need to use 4 sections (or more if you wish to have smaller output files). The last section is just renamed, as there is no need to copy it. After splitting, the source file no longer exists (there would be no room for it anyway).
The main part is an awk script (wrapped in Bash), which only sets up the section sizes (including adjusting to the section coincides with a newline). It uses the system() function to invoke dd, truncate and mv for all the heavy lifting.
$ bash --version
GNU bash, version 4.4.20(1)-release (x86_64-pc-linux-gnu)
$ awk --version
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 4.0.1, GNU MP 6.1.2)
$ dd --version
dd (coreutils) 8.28
$ truncate --version
truncate (GNU coreutils) 8.28
The script takes between one and four arguments:
./splitBig Source nSect Dest Debug
Source: is the filename of the file to be split into sections.
nSect: is the number of sections required (default 10).
Dest: is a printf() format used to generate the names of the sections.
Default is Source.%.3d, which appends serial numbers (from .001 up) to the source name.
Section numbers correspond to the original order of the source file.
Debug: generates some diagnostics (default is none).
Test Results:
$ mkdir TestDir
$ cd TestDir
$
$ cp /home/paul/leipzig1M.txt ./
$ ls -s -l
total 126608
126608 -rw-rw-r-- 1 paul paul 129644797 Aug 27 15:54 leipzig1M.txt
$
$ time ../splitBig leipzig1M.txt 5
real 0m0.780s
user 0m0.045s
sys 0m0.727s
$ ls -s -l
total 126620
25324 -rw-rw-r-- 1 paul paul 25928991 Aug 27 15:56 leipzig1M.txt.001
25324 -rw-rw-r-- 1 paul paul 25929019 Aug 27 15:56 leipzig1M.txt.002
25324 -rw-rw-r-- 1 paul paul 25928954 Aug 27 15:56 leipzig1M.txt.003
25324 -rw-rw-r-- 1 paul paul 25928977 Aug 27 15:56 leipzig1M.txt.004
25324 -rw-rw-r-- 1 paul paul 25928856 Aug 27 15:56 leipzig1M.txt.005
$
$ rm lei*
$ cp /home/paul/leipzig1M.txt ./
$ ls -s -l
total 126608
126608 -rw-rw-r-- 1 paul paul 129644797 Aug 27 15:57 leipzig1M.txt
$ time ../splitBig leipzig1M.txt 3 "Tuesday.%1d.log" 1
.... Section 3 ....
#.. findNl: dd bs=8192 count=1 if="leipzig1M.txt" skip=86429864 iflag=skip_bytes status=none
#.. system: dd bs=128M if="leipzig1M.txt" skip=86430023 iflag=skip_bytes of="Tuesday.3.log" status=none
#.. system: truncate -s 86430023 "leipzig1M.txt"
.... Section 2 ....
#.. findNl: dd bs=8192 count=1 if="leipzig1M.txt" skip=43214932 iflag=skip_bytes status=none
#.. system: dd bs=128M if="leipzig1M.txt" skip=43214997 iflag=skip_bytes of="Tuesday.2.log" status=none
#.. system: truncate -s 43214997 "leipzig1M.txt"
.... Section 1 ....
#.. system: mv "leipzig1M.txt" "Tuesday.1.log"
real 0m0.628s
user 0m0.025s
sys 0m0.591s
$ ls -s -l
total 126612
42204 -rw-rw-r-- 1 paul paul 43214997 Aug 27 15:58 Tuesday.1.log
42204 -rw-rw-r-- 1 paul paul 43215026 Aug 27 15:58 Tuesday.2.log
42204 -rw-rw-r-- 1 paul paul 43214774 Aug 27 15:58 Tuesday.3.log
$
Script:
#! /bin/bash --
LC_ALL="C"
splitFile () { #:: (inFile, Pieces, outFmt, Debug)
local inFile="${1}" Pieces="${2}" outFmt="${3}" Debug="${4}"
local Awk='
BEGIN {
SQ = "\042"; szLine = 8192; szFile = "128M";
fmtLine = "dd bs=%d count=1 if=%s skip=%d iflag=skip_bytes status=none";
fmtFile = "dd bs=%s if=%s skip=%d iflag=skip_bytes of=%s status=none";
fmtClip = "truncate -s %d %s";
fmtName = "mv %s %s";
}
function findNl (fIn, Seek, Local, cmd, lth, txt) {
cmd = sprintf (fmtLine, szLine, SQ fIn SQ, Seek);
if (Db) printf ("#.. findNl: %s\n", cmd);
cmd | getline txt; close (cmd);
lth = length (txt);
if (lth == szLine) printf ("#### Line at %d will be split\n", Seek);
return ((lth == szLine) ? Seek : Seek + lth + 1);
}
function Split (fIn, Size, Pieces, fmtOut, Local, n, seek, cmd) {
for (n = Pieces; n > 1; n--) {
if (Db) printf (".... Section %3d ....\n", n);
seek = int (Size * ((n - 1) / Pieces));
seek = findNl( fIn, seek);
cmd = sprintf (fmtFile, szFile, SQ fIn SQ, seek,
SQ sprintf (outFmt, n) SQ);
if (Db) printf ("#.. system: %s\n", cmd);
system (cmd);
cmd = sprintf (fmtClip, seek, SQ fIn SQ);
if (Db) printf ("#.. system: %s\n", cmd);
system (cmd);
}
if (Db) printf (".... Section %3d ....\n", n);
cmd = sprintf (fmtName, SQ fIn SQ, SQ sprintf (outFmt, n) SQ);
if (Db) printf ("#.. system: %s\n", cmd);
system (cmd);
}
{ Split( inFile, $1, Pieces, outFmt); }
'
stat -L -c "%s" "${inFile}" | awk -v inFile="${inFile}"
-v Pieces="${Pieces}" -v outFmt="${outFmt}"
-v Db="${Debug}" -f <( printf '%s' "${Awk}" )
}
Script body starts here.
splitFile "${1}" "${2:-10}" "${3:-${1}.%.3d}" "${4}"
logrotate
and apply it to the exiting file, too. This would then prevent the same scenario in future and allow compressing older logs. However I am not sure how the initial splitting would be done regarding your disk size limitation. – FelixJN Aug 18 '23 at 13:05grep -Ei "string_to_search|string_to_not_skip'
. If you need to see lines before or after the found lines, you'll want the-A
and-B
options. – doneal24 Aug 18 '23 at 13:06dd
to write the last n-bytes to a file, then usetruncate
to reduce the size of the logfile by that amount. Loop through that until the file is nicely chopped up. Of course this does not take newlines as the standard cutoff position and needs to be executed with great care. – FelixJN Aug 18 '23 at 13:25split --line-bytes=100M logfile
. This will split the file into multiple components, always splitting at the end-of-line.(Note - I haven't investigated how this handles multi-byte encodings, such as UTF-8. It should be safe.)
– Popup Aug 18 '23 at 15:05neovim
(nvim -u NONE -R application.log
), and just search to your heart's desire in it (once opened,/\Vstring_to_search
). I just tried it; generated a 15 GB file with a couple hundred million lines, and yeah, that works. It's not "fast" by any means, but it gives you enough speed to find the places you want to start looking into, copy a couple thousand lines before to a new file, and then work fast. Analyzing the log on an IO-bound server is a bad idea anyways.. – Marcus Müller Aug 18 '23 at 16:12logrotate
) and it might mean you're using the wrong tools to analyze the logs. Specific logs have specific tools; for example, packet logs have tools tailored to making detection intruders easy; database logs have other tools; webserver logs… Also, if this application is under your control, it might be time to consider better logging formats than plaintext, if, and only if, the application is actually logging the things you need (if it's not, adjust logging to omit irrelevant information). – Marcus Müller Aug 18 '23 at 16:14less application.log
and search for your string within with/
? – Stéphane Chazelas Aug 18 '23 at 16:36"the log file is that of a big corporations. it's easily 10GB per day."
That's still not an excuse not to use
logrotate
. If you split it into smaller chunks and compress (with something like lz4) you should get much more manageable logs.
– Popup Aug 21 '23 at 08:04lz4
(orzstd
) is great for logs, as it's blindingly fast to operate on compressed files. Keep the logs compressed, and decompress in a pipe to do whatever you need to.