163

How do I remove the first 300 million lines from a 700 GB text file on a system with 1 TB disk space total, with 300 GB available?  (My system has 2 GB of memory.)  The answers I found use sed, tail, head:

But I think (please correct me) I cannot use them due to the disk space being limited to 1 TB and they produce a new file and/or have a tmp file during processing.

The file contains database records in JSON format.

Braiam
  • 35,991
Kris
  • 1,283
  • 19
    How much memory do you have? – choroba Sep 21 '20 at 10:11
  • Unfortunately only 2GB :/ – Kris Sep 21 '20 at 10:15
  • How much of the 1TB is actually available? Do you have 300GB free or is it even less? Can you compress the file to free up some space? – terdon Sep 21 '20 at 10:19
  • I have around 340GB disk space available. I could compress the file - but wouldn't make this even more difficult to rm the first x lines? – Kris Sep 21 '20 at 10:23
  • 20
    If you can compress the file, and if compressing it leaves space enough to store a second copy of the compressed file, then this would be the way to go. A pipeline to solve your issue would stream the decompressed data through some transformation, and then compress the processed data. See e.g. terdon's answer. – Kusalananda Sep 21 '20 at 11:52
  • 1
  • 8
    See Truncating the first 100MB of a file in linux on [SO] for a really interesting option which really is in-place truncation of the front of the file. NOTE, I have not tried it and you would be strongly advised to test it on a disposable file first – Chris Davies Sep 21 '20 at 13:19
  • @roaima utilizing sparse files (for filesystems that support it) in various ways is yet another way to go about it, yeah. I think I wrote a similar answer somewhere that kind of used split... but I can't find it at the moment. – frostschutz Sep 21 '20 at 15:28
  • 11
    Is this a real situation or a hypothetical one ? I can't think of a realistic situation where 700 GB of text would contain something meaningful. Looking for background. – Criggie Sep 21 '20 at 21:06
  • 2
    @Criggie I have worked with DNA data. 700 GB of A,C,G,T was not uncommon, though mostly they were in smaller chunks than a single file. – Ole Tange Sep 22 '20 at 06:57
  • I finally found an older answer I wrote to a similar problem: decompress gzip in place, read and eat, punch hole as you go. In that answer it's used with | gunzip > newfile but you can just use it with | tail -n +300000001 > newfile instead. – frostschutz Sep 22 '20 at 07:46
  • 5
    @Criggie It's actually a realistic one. I'm migrating our company's database and this is 700GB file with profile data in JSON format, each line one profile. The process that imported the data stopped after running several days after 70% done. Now, to not repeat the full import, I want to cut the first 70% (300 million lines). – Kris Sep 22 '20 at 07:54
  • 3
    The easiest way is to scp the file to a different machine with a larger storage, fixing the file, and copying it back. – choroba Sep 22 '20 at 08:19
  • 17
    You have a backup, right? Treat that as your source and overwrite the target file. If you don't have a backup then clearly the data's not important so it doesn't matter if you just delete it and start over – Chris Davies Sep 22 '20 at 08:56
  • 3
    @roaima This is one option. Exporting the data from the old DB again, skipping the first 300 Mio entries and cp it again over to the new and import it there. It roughly takes a day. I just thought there must be a way to do it in place: the data are already there - just truncate the first part of it and continue. But I'm not in a desperate position. I have backups and no (extreme) time pressure. So in this sense it's an exercise. – Kris Sep 22 '20 at 10:29
  • Safest would be the backups route (as suggested in a proper answer a little further down). The whole idea of truncating data from the front of a real data file makes me shudder, to be honest! – Chris Davies Sep 22 '20 at 10:45
  • 1
    @roaima If you have a backup, use one of the solutions to update the file in place. No need to access the backup (other than to verify it) until or unless the in-place update fails. – Keith Thompson Sep 22 '20 at 16:53
  • 1
    This question is getting attention because it's on the front page of Hacker News. – Keith Thompson Sep 22 '20 at 16:54
  • 1
  • @choroba took the words right out of my mouth - even buying a new 2tb drive should be cost effective and safer than alternatives like deleting portions in place - glad OP clarified this is an exercise at this point – TCooper Sep 23 '20 at 00:39
  • Is this being done for work? Upload to a cloud service and use standard unix utilities. Yes it will cost money but not as much as your time. – verisimilidude Sep 23 '20 at 03:08
  • 1
    A variety of interesting answers, but it seems that you actually have enough disk space for the 30% of the file that you want to keep? So any old standard method should work too. – Michał Politowski Sep 23 '20 at 06:24
  • 2
    This sounds like a trick-question in a job interview. It is a hardware problem - just get a second drive ;) – jonatan Sep 24 '20 at 06:55
  • I'd just program this is C and do it manually. Clearly, it is possible. Just not easily with standard command line utilities. – FUZxxl Sep 28 '20 at 07:13
  • It seems like this would be easier with two pointers inside the document. First pointer is at the start, second pointer goes to line 300 million. Then you copy each line to the start-pointer as you iterate to the end. Then add a null to the end of the file. S – Rob Sep 28 '20 at 18:34
  • TFW when you need real log rotation and a new disk. – Warren P Sep 28 '20 at 22:18

13 Answers13

159

Removing the first n lines (or bytes) can be done in-place using dd (or alternatively using loop devices). It does not use a temporary file and there is no size limit; however, it is dangerous since there is no track of progress, and any error leaves you with a broken file.

Example: Create a sample file with 1000 lines:

$ seq 1 1000 > 1000lines.txt
$ head -n 3 1000lines.txt
1
2
3
$ tail -n 3 1000lines.txt
998
999
1000

We want to remove the first 300 lines. How many bytes does it correspond to?

$ stat -c %s 1000lines.txt
3893 # total bytes
$ head -n 300 1000lines.txt | wc -c
1092 # first 300 lines bytes
$ echo $((3893-1092))
2801 # target filesize after removal

The file is 3893 bytes, we want to remove the first 1092 bytes, leaving us with a new file of 2801 bytes.

To remove these bytes, we use the GNU dd command, with conv=notrunc as otherwise the file would be deleted before you can copy its contents:

$ dd conv=notrunc iflag=skip_bytes skip=1092 if=1000lines.txt of=1000lines.txt
5+1 records in
5+1 records out
2801 bytes (2.8 kB, 2.7 KiB) copied, 8.6078e-05 s, 32.5 MB/s

This removes the first 300 lines, but now the last 1092 bytes repeat, because the file is not truncated yet:

$ truncate -s 2801 1000lines.txt

This reduces the file to its final size, removing duplicated lines at end of file.

The result:

$ stat -c %s 1000lines.txt 
2801

$ head -n 3 1000lines.txt 301 302 303

$ tail -n 3 1000lines.txt 998 999 1000

The process for a larger file is similar. You may need to set a larger blocksize for better performance (the blocksize option for dd is bs).

The main issue is determining the correct byte offset for the exact line number. In general it can only be done by reading and counting. With this method, you have to read the entire file at least once even if you are discarding a huge chunk of it.

Mmmh mmh
  • 898
  • 1
  • 7
  • 7
frostschutz
  • 48,978
  • 36
    I wouldn't use this solution if my life depended on it ;-) Editing file in place may lead to all kinds of issues, including completely losing its contents. – Artem S. Tashkinov Sep 21 '20 at 11:36
  • 17
    @ArtemS.Tashkinov sure... inplace operations are always dangerous. That's even true for nondestructive badblocks, lvm pvmove, mdadm grow, ... anyway for those worried about dd in particular I also added loop device method in another answer. – frostschutz Sep 21 '20 at 15:18
  • How is it known that dd conv=notrunc iflag=skip_bytes skip=1092 if=1000lines.txt of=1000lines.txt did not create a sizable temporary file in the making of output (making this solution not available for OP's case)? – chux - Reinstate Monica Sep 22 '20 at 11:18
  • 12
    dd does not create temporary files. this answer assumes your filesystem behaves normally. if you add network or fuse filesystems into the mix, it might not work this way. – frostschutz Sep 22 '20 at 11:33
  • conv=notrunk is available on FreeBSD dd too which is not GNU's version. – Rob Sep 22 '20 at 17:14
  • @Rob it's usually the skip/count/seek_bytes that's missing from other flavors of dd – frostschutz Sep 22 '20 at 17:19
  • @ArtemS.Tashkinov I wouldn't use this solution without backing up the drive first. I don't think mortal danger is required for that. – J... Sep 22 '20 at 19:22
  • I really appreciate both your posts as it opened my eyes on this "infile" way to shorten huge files. The principle are, in hindsight, easy and I wonder why I never thought about it until you pointed it out (well, if I imagined it I would have feared that the writing part could somehow create a "\0 until the writing point" sparse file.. instead of keeping it as is), but thank you for this! It avoids the need for more space on the filesystem, and is one pass, so it's quite efficient too. – Olivier Dulac Sep 23 '20 at 06:50
  • How about UTF-8? Does "wc -c" take that into account? – osiris Sep 23 '20 at 14:57
  • @osiris check the man page. – A.B Sep 23 '20 at 15:05
133

If you have enough space to compress the file, which should free a significant amount of space allowing you to do other operations, you can try this:

gzip file && zcat file.gz | tail -n +300000001 | gzip > newFile.gz

That will first gzip the original input file (file) to create file.gz. Then, you zcat the newly created file.gz, pipe it through tail -n +300000001 to remove the first 3M lines, compress the result to save disk space and save it as newFile.gz. The && ensures that you only continue if the gzip operation was successful (it will fail if you run out of space).

Note that text files are very compressible. For example, I created a test file using seq 400000000 > file, which prints the numbers from 1 to 400,000,000 and this resulted in a 3.7G file. When I compressed it using the commands above, the compressed file was only 849M and the newFile.gz I created only 213M.

terdon
  • 242,166
  • 33
    The entropy of text produced by seq is probably very low, which is why this particular text is very compressible. Something more random will most likely perform much worse. – probably_someone Sep 21 '20 at 19:45
  • 51
    I'm guessing the content of a 700GB text files (with lines) isn't very random too. The OP doesn't specify, be I would guess it contains something like logging data or a database dump. It might well compross to under 25% of the original size. If everything is 7 bits ASCII characters there's room for ~50% reduction already. – AVee Sep 21 '20 at 20:02
  • 2
    The OP stated there is 340GB free space on the disk, so the first gzip may not be needed as the resulting smaller file will probably fit when compressed. That will make it quite a bit faster probably. – AVee Sep 21 '20 at 20:08
  • 2
    @probably_someone yes, that's a fair point. This was just the easiest way to demonstrate. – terdon Sep 21 '20 at 20:35
  • 1
    @AVee tail can produce temp files, so it needs more space while running than will be taken up by the final file. In any case, having a 700GB text file uncompressed is just a waste of space, so you may as well compress it regardless. – terdon Sep 21 '20 at 20:36
  • @probably_someone, a 2:1 compression ratio is sufficient for this method to work, and it's rare for text files to have a compression ratio worse than 4:1. – Mark Sep 22 '20 at 00:48
  • 4
    tail | gzip avoids saving a compressed copy of the lines you want to remove. – Peter Cordes Sep 22 '20 at 02:07
  • Also, of course other compressors like lz4 or xz trade compression speed for space; if you have spare space and aren't I/O bottlenecked, lz4 can compress faster than gzip. (But if you're only doing this once, gzip is probably fine, despite not taking advantage of multiple cores.) – Peter Cordes Sep 22 '20 at 02:13
  • 3
    @PeterCordes gzip and bzip2 can both be done on multiple cores, using pigz and lbzip2, respectively. – James_pic Sep 22 '20 at 08:36
  • @terdon Agreed, but I'm kinda assuming the 700GB file will be deleted once this is done. If it's going to be kept you are definitely correct. – AVee Sep 22 '20 at 08:53
  • 4
    @AVee don't you mean 12.5% for having 7 bit ascii? Or do you have something different in mind? – doetoe Sep 22 '20 at 23:35
  • I would not compress the result, since that would require decompressing it again. – Daniel F Sep 23 '20 at 02:00
  • 3
    @DanielF there's not much reason to ever have a decompressed text file of that size. If you need to extract information from it, you can use zcat or zmore to view it, zgrep to search through it, or even open it directly in a good editor like emacs. – terdon Sep 23 '20 at 07:56
  • @terdon thanks for the info. I didn't know about these commands. – Daniel F Sep 23 '20 at 08:06
  • 20
    @terdon In the end I used your solution (I'm sure others would have worked too - oh my, did this little question explode). The compressed file was only 140GB (JSON data with lots of equal field names) and fit on the disk. Thank you very much! – Kris Sep 23 '20 at 20:50
  • @Kris I submitted this Q&A to HN(https://news.ycombinator.com/item?id=24553499) which would have brought tens of thousands of views.. and there were some nice suggestions in that thread as well – Sundeep Sep 24 '20 at 04:57
  • 1
    I don't see a point to gziping the original file. Why not do: tail -n +300000001 file | gzip > newFile.gz directly? If there's enough space to store the gzip-compressed version of the original file, there must be enough space to store the gzip-compressed version of the truncated version. – jamesdlin Sep 28 '20 at 04:30
  • 3
    RFC 1951 DEFLATE is a very old format. Only use it for backwards compatibility. Your default format should be zstd. It is always better than gzip (smaller and faster at the same time), the cli utility uses multiple threads (-T0), it has a much wider range of time-space trade-off than gzip and it has long range mode. If you want something even faster, use lz4. If you want something that compresses better, use lzma/xz, but it's slow. If you have natural language text, try (slow) ppmd from p7zip-full. – Z.T. Sep 28 '20 at 12:16
  • 2
    @jamesdlin more like the other way around: why keep a large uncompressed text file on disk, ever? That just takes up space for no reason. Since tools like zcat, zgrep and zmore allow you to easily manipulate/extract the compressed data, there's really no good reason to keep the file uncompressed. Also, in this case, when I wrote the answer, it was under the assumption that there isn't enough space to hold the new file without first compressing the old one. And remember that tail can produce temp files while working, which also take up space. – terdon Sep 28 '20 at 15:22
  • @jamesdlin I moved our discussion to a chat room so as not to clutter the comments. – terdon Sep 29 '20 at 12:00
38

On some filesystems like ext4 or xfs, you can use the fallocate() system call for that.

  • 2
    Nice, this is what I was going to suggest. If the line you want to keep doesn't start on a block boundary, doing this first could still be a useful setup to compress + decompress. Although in-place dd to copy is probably best, and doesn't need any extra space so wouldn't benefit from this at all. – Peter Cordes Sep 22 '20 at 02:16
32

You can do it with losetup, as an alternative to the dd method described here. Again, this method is dangerous all the same.

Again, the same test file and sizes (remove lines 1-300 from 1000 lines file):

$ seq 1 1000 > 1000lines.txt
$ stat -c %s 1000lines.txt
3893 # total bytes
$ head -n 300 1000lines.txt | wc -c
1092 # first 300 lines bytes
$ echo $((3893-1092))
2801 # target filesize after removal

Create a loop device:

# losetup --find --show 1000lines.txt
/dev/loop0
losetup: 1000lines.txt: \
Warning: file does not fit into a 512-byte sector; \
the end of the file will be ignored.
# head -n 3 /dev/loop0
1 
2 
3 
# tail -n 3 /dev/loop0
921
922
923

Whoops. There are numbers missing. What's going on?

Loop devices require their backing files to be multiple of sector size. Text files with lines don't usually fit that scheme, so in order to not miss the end of file (last partial sector) content, just append some more data first, then try again:

# head -c 512 /dev/zero >> 1000lines.txt
# losetup --find --show 1000lines.txt
/dev/loop1
losetup: 1000lines.txt: \
Warning: file does not fit into a 512-byte sector; \
the end of the file will be ignored.
# tail -n 3 /dev/loop1
999
1000
\0

The warning persists but the content is complete now, so that's okay.

Create another one, this time with the 300 line offset:

# losetup --find --show --offset=1092 1000lines.txt
/dev/loop2
losetup: 1000lines.txt: \
Warning: file does not fit into a 512-byte sector; \
the end of the file will be ignored.
# head -n 3 /dev/loop2
301
302
303
# tail -n 3 /dev/loop2
999
1000
\0

Here's the nice thing about loop devices. You don't have to worry about truncating the file by accident. You can also easily verify that your offsets are indeed correct before performing any action.

Finally, just copy it over, from offset device to full:

cp /dev/loop2 /dev/loop1

Dissolve loop devices:

losetup -d /dev/loop2 /dev/loop1 /dev/loop0

(Or: losetup -D to dissolve all loop devices.)

Truncate the file to target filesize:

truncate -s 2801 1000lines.txt

The result:

$ head -n 3 1000lines.txt 
301
302
303
$ tail -n 3 1000lines.txt 
998
999
1000
frostschutz
  • 48,978
18

Another vote for custom program if you really DO need the task. C or any powerful enough dynamic language like Perl or Python will do. I won't write out the source here, but will describe algorithm that will prevent data loss while you move data around:

  1. Read your big file from the end counting line-breaks. After gathering some pre-defined amount of lines that you can safely fit in free space, write this chunk as separate file and cut the big file's tail. Use chunk's filename to store line numbers.
  2. After that you will end with completely erased big file and lots of much smaller files taking same space.
  3. Count your 300 millions lines - you can delete all chunks corresponding to unnecessary lines right away, since you know what chunks contain which lines.
  4. If you don't actually need the big file, you can simply operate directly on remaining chunks with whatever tools you need using wildcards or stringing them together with cat as necessary.
  5. If you need the big file after all and freed up space is enough to store the sum of remaining chunks after you've deleted unnecessary ones - simply combine them together with cp or cat.
  6. If you need the big file and there is not enough space, write another small program that will do the reverse of step 1: Save list and individual length of each chunk to some list file. Read chunks one-by-one and append them to newly created "big file". Each time you've done appending chunk to big file, you will delete a separate small file containing this chunk, thus allowing you to reassemble file in-place. If you interrupted process of writing chunk at any time, you can restart writing of big file by calculating correct offset for any particular chunk because you've saved each chunk size in advance.
  • What if the chunk you're writing -- containing those lines you counted from the end of the original file -- does not fit in the remaining 300GB on the device? I mean your original file is still there, occupying 700GB? – Armen Michaeli Sep 28 '20 at 13:26
  • @amn "some pre-defined amount of lines that you can safely fit in free space". Either select really safe staic amount of lines or dynamically keep track of length of what you've already read and stop some megabytes ahead of hitting free space limit. – Oleg V. Volkov Sep 28 '20 at 14:51
8

With ksh93:

tail -n +300000001 < file 1<>; file

The 1<>; operator is a ksh93-specific variation on the standard 1<> operator (which opens in read+write mode without truncation), that truncates the file after the command has returned at the position the command left its stdout at if that command was successful.

With other shells, you can always do the truncating-in-place-afterwards by hand with perl for instance:

{
  tail -n +300000001 &&
    perl -e 'truncate STDOUT, tell STDOUT'
} < file 1<> file

To get a progress bar, using pv:

{
  head -n 300000000 | pv -s 300000000 -lN 'Skipping 300M lines' > /dev/null &&
    cat | pv -N 'Rewriting the rest' &&
    perl -e 'truncate STDOUT, tell STDOUT'
} < file 1<> file

(using head | pv and cat | pv as pv would refuse to work if its input and output were pointing to the same file. pv -Sls 300000000 would also not work as pv doesn't leave the pointer within the file just after the 300000000th line after existing like head does (and is required to by POSIX for seekable files). pv | cat instead of cat | pv would allow pv to know how much it needs to read and give you an ETA, but it's currently bogus in that it doesn't take into account the cases where it's not reading from the start of that file as is the case here).

Note that those are dangerous as the file is being overwritten in place. There is a chance that you run out of disk space if the first 300M lines contained holes (shouldn't happen for a valid text file), and the rest of the file takes up more space than you have spare space on the FS.

  • Or with just Perl: perl -ne 'print if $. > 300_000_000; }{ truncate(STDOUT, tell STDOUT);' < file 1<> file. Or with Perl and pv: < file pv | perl -ne 'print if $. > 300_000_000; }{ truncate(STDOUT, tell STDOUT);' 1<> file, which should work as the whole file passes through pv. – ilkkachu Sep 22 '20 at 20:34
  • 1
    @ilkkachu, yes though perl will likely be less efficient than head/tail at doing the hard and long work here. Your pv variant is quite neater than mine though. Feel free to edit it in if you don't want to add your own answer. – Stéphane Chazelas Sep 22 '20 at 20:42
4

The limitation of this problem is the amount of storage wherever that is located. Significant RAM is not required since fundamentally you can simply read one byte from wherever your file is stored and then either write or not write that byte [character] out to a new file wherever that may reside. Where the infile and outfile reside can be in totally separate places... on separate partitions, disks, or across a network. You do not need to read and write to the same folder. So for the attached program, you can simply give a full path name for and to work around disk space limitations. You will be at the mercy of other limitations, such as disk or network I/O speed, but it will work. Taking very long to work is better than not being able to happen.

  • adjust LL which is a hardcoded line length I used to read in a whole line at a time from a text file, I set it to 2048 characters. Set it to 1000000 if you like, which would require 1MB of RAM should you have extremely long lines in the text file.
  • if your text file is ridiculously large... I often deal with up to 10GB text files... consider doing a gzip -9 on it to create a mytextfile.gz. Being a text file will likely compress to 5% the size, which is helpful considering disk i/o speed vs cpu speed.
  • I write out your new file with n_deleted_lines to an uncompressed text file, so that will likely be huge.
  • this program is written in standard C, i kept it as simple as possible.
  • it checks and will not harm your original text file.
  • you do not have to compress your original text file for this to work, compressing it is optional.
  • you can have your original file on one disk or network location, and write the output file with N deleted lines to some other disk or network location, just use a full naming convention for example

delete_n_lines.x /home/ron/mybigfile.txt /some_nfs_mounted_disk/mybigfile_deletedlines.txt


/*  this file named    delete_n_lines.c
compile by    gcc -W delete_n_lines.c -o delete_n_lines.x -lz

have your huge text file already compressed via &quot;gzip -9&quot; to save disk space

this program will also read a regular uncompressed text file

*/

include <stdlib.h>

include <stdio.h>

include <string.h>

include <zlib.h>

define LL 2048 /* line length, number of characters up to '\n' */

int main ( int argc, char argv[] ) { gzFile fin; FILE fout; char line[LL]; long int i, n = 0; long int n_lines_to_delete = 0;

if ( argc != 4 ) { printf(" Usage: %s <infile> <outfile> <first_N_lines_to_delete>\n\n", argv[0] ); exit( 0 ); }

n = sscanf( argv[3], "%d", &n_lines_to_delete ); if ( n == 0 ) { printf("\n Error: problem reading N lines to delete\n\n" ); exit( 0 ); }

if ( strcmp( argv[1], argv[2] ) == 0 ) { printf("\n Error: infile and outfile are the same.\n" ); printf(" don't do that\n\n"); exit( 0 ); }

fout = fopen( argv[2], "w" ); if ( fout == NULL ) { printf("\n Error: could not write to %s\n\n", argv[2] ); exit( 0 ); }

fin = gzopen( argv[1], "r" ); if ( fin == NULL ) { printf("\n Error: could not read %s\n\n", argv[1] ); fclose( fout ); exit( 0 ); }

n = 0; gzgets( fin, line, LL ); while ( ! gzeof( fin ) ) { if ( n < n_lines_to_delete ) n++; else fputs( line, fout );

  gzgets( fin, line, LL );

}

gzclose( fin ); fclose( fout );

printf("\n deleted the first %d lines of %s, output file is %s\n\n", n, argv[1], argv[2] );

return 0; }

ron
  • 6,575
  • just actually recognizing a 700GB text file, that's really **** big, being on a 1TB disk would not surprise me if doing gzip on it failed after an hour of trying to compress it. So, requiring 700gb to hold that text file, and assuming > 500gb to hold the resulting output file having only 3m lines deleted, you're gonna have to find storage elsewhere. Other than [buying] another disk, you could make use of some online storage that hosts N terabytes and then mount it in linux by whatever means is acceptable. – ron Sep 21 '20 at 16:14
  • 5
    gzip will have no trouble compressing it.The only problem will be whether there is enough space left for writing the compressed output. – Ángel Sep 21 '20 at 20:53
3

I created a tool that may be of use to you: hexpeek is a hex editor designed for working with huge files and runs on any recent POSIX-like system (tested on Debian, CentOS, and FreeBSD).

One can use hexpeek or an external tool to find the 300-millionth newline. Then, assuming that X is the hexadecimal zero-indexed position of the first octet after the 300-millionth newline, the file can be opened in hexpeek and a single command 0,Xk will delete the first X octets in the file.

hexpeek requires no tmpfile to perform this operation; although the optional backup mode does and would likely need to be disabled via the -backup flag (sadly the current backup algorithm does not accommodate a rearrangement affecting more file space than is available for the backup file).

Of course, a custom C program can accomplish the same thing.

3

What about using vim for in-place editing?

Vim is already capable of reasoning about lines:

vim -c ":set nobackup nowritebackup" -c ":300000000delete" -c ":wq" filename

Explanation:

vim will execute the various command passed to the -c switches as if they where passesed in an interactive session.

So:

  1. we disable backup copy creation
  2. we delete the first 300 million lines (cursor starts at line 0 on startup)
  3. we save the file

That should do the trick. I have used vim in a similar fashion in the past, it works. It may not be copy-paste safe, OP should do some tests and possibly adapt the command to their needs.

Just to be sure, you might want to remove the -c ":wq" switches at the end, and visually inspect the file for correctness.

znpy
  • 197
  • 2
    vim will make a backup copy of the original, so you'll need 700G free on the disk which the OP doesn't have. – Stéphane Chazelas Sep 22 '20 at 12:49
  • 1
    i added another switch: -c ":set nobackup nowritebackup" to fix the problem you let me notice – znpy Sep 22 '20 at 13:06
  • @znpy I tested this on a small file with ":3delete", and it only removed the third line leaving the first two intact. How to get rid of first n lines altogether? – user111111111 Sep 22 '20 at 14:04
  • 4
    You'd also need noswapfile. But then the whole file would be loaded in memory. That's probably not an option for a 700G large file. – Stéphane Chazelas Sep 22 '20 at 14:09
  • 5
    The right deletion command appears to be -c ":1,300000000delete", however that stil leaves the problem with vim trying to load the entire file into memory. – JanKanis Sep 22 '20 at 14:34
3

Think of Towers of Hanoi. Sort of.

First, move the lines you want into a new file:

find the start of line 3 million and 1
create a new, empty file
repeat {
  read a decent number of blocks from the end of the old file
  append the blocks to the end of the new file
  truncate the old file by that many blocks
} until you get to the start of line 3 million and 1.

You should now have a file that contains just the lines you want, but not in the right order.

So lets do the same thing again to put them into the right order:

Truncate the original file to zero blocks` (i.e. delete the first 3 million lines)
repeat {
  read the same number of blocks from the end of the new file (except the first time, when you won't have an exact number of blocks unless the first 3 million lines were an exact number of blocks long)
  append those blocks to the end of the original file
  truncate the new file by that many blocks
} until you have processed the whole file.

You should now have just the lines you want, and in the right order.

Actual working code is left as an exercise for the reader.

Ben Aveling
  • 1,440
  • But you can't "append in front" of new file, so it won't do if string order is important. – Oleg V. Volkov Sep 22 '20 at 14:00
  • 1
    If order matters, seek to the appropriate offset in the new file before writing. This will create a sparse file and allow you to fill it up piece by piece. – rrauenza Sep 22 '20 at 18:57
2

There are various approaches to remove the first lines. I recommend you to split up the file into chunks, change them (remove the first lines) and concatenate the chunks again.

In your case it would be very dangerous to change the file in-place. If something goes wrong you have no fallback option!

Here is my working solution (bash). You probably need some improvements ...

function split_into_chunks {
    BIG_FILE=$1
while [ $(stat -c %s $BIG_FILE) -gt 0 ]
do
CHUNK_FILE=&quot;chunk.$(ls chunk.* 2&gt;/dev/null | wc -l)&quot;
tail -10 $BIG_FILE &gt; $CHUNK_FILE
test -s $CHUNK_FILE &amp;&amp; truncate -s -$(stat -c %s $CHUNK_FILE) $BIG_FILE
done

}

function concat_chunks { BIG_FILE=$1 test ! -s $BIG_FILE || (echo "ERROR: target file is not empty"; return)

for CHUNK_FILE in $(ls chunk.* | sort -t . -k2 -n -r)
do
cat $CHUNK_FILE &gt;&gt; $BIG_FILE
rm $CHUNK_FILE
done

}

Test:

$ seq 1000 > big-file.txt 
$ stat -c "%s %n" chunk.* big-file.txt 2>/dev/null | tail -12
3893 big-file.txt
$ md5sum big-file.txt; wc -l big-file.txt 
53d025127ae99ab79e8502aae2d9bea6  big-file.txt
1000 big-file.txt

$ split_into_chunks big-file.txt $ stat -c "%s %n" chunk.* big-file.txt | tail -12 40 chunk.9 31 chunk.90 30 chunk.91 30 chunk.92 30 chunk.93 30 chunk.94 30 chunk.95 30 chunk.96 30 chunk.97 30 chunk.98 21 chunk.99 0 big-file.txt

$ # here you could change the chunks $ # the test here shows that the file will be concatenated correctly again

$ concat_chunks big-file.txt $ stat -c "%s %n" chunk.* big-file.txt 2>/dev/null | tail -12 3893 big-file.txt $ md5sum big-file.txt; wc -l big-file.txt 53d025127ae99ab79e8502aae2d9bea6 big-file.txt 1000 big-file.txt

Hint: You definitely need to make sure that all your chunks are not too small (very long processing time) and not too big (not enough disk space)! My example uses 10 lines per chunk - I assume that is too low for your task.

sealor
  • 139
  • I'd go so far as to say that OP should, if possible, modify the producer and consumer of this large file to leave everything as chunks (if this is not just a one off). – Dolphin Sep 28 '20 at 11:31
1

i'd do it as

<?php
$fp1 = fopen("file.txt", "rb");
// find the position of the 3M'th line:
for ($i = 0; $i < 300_000_000; ++ $i) {
    fgets($fp1);
}
// the next fgets($fp1) call will read line 3M+1 :)
$fp2 = fopen("file.txt", "cb");
// copy all remaining lines from fp1 to fp2
while (false !== ($line = fgets($fp1))) {
    fwrite($fp2, $line);
}
fclose($fp1);
// remove every line that wasn't copied over to fp2
ftruncate($fp2, ftell($fp2));
fclose($fp2);

or if i need it to run fast for some reason, i'd do the same in C++ with mmap() memory mapping, this should run much faster:

#include <iostream>
#include <fstream>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>

int main(){ const std::string target_file = "file.txt"; std::fstream fp1(target_file, std::fstream::binary); fp1.exceptions(std::fstream::failbit | std::fstream::badbit); fp1.seekg(0, std::fstream::end); const std::streampos total_file_size_before_truncation = fp1.tellg(); fp1.seekg(0, std::fstream::beg); const int fd = open(target_file.c_str(), O_RDWR); char content_mmaped = (char )mmap(NULL, total_file_size_before_truncation, PROT_READ, MAP_PRIVATE, fd, 0); const std::string_view content_view(content_mmaped, total_file_size_before_truncation); size_t line_no = 0; size_t line_pos = 0; size_t i = 0; for(; i < total_file_size_before_truncation; ++i){ if(content_mmaped[i] == '\n'){ ++line_no; line_pos = i; if(line_no >= (3000000-1)){ break; } } } // idk why i have to do all those casts... fp1.write(&content_mmaped[i], std::streamoff(std::streamoff(total_file_size_before_truncation)-std::streamoff(i))); fp1.close(); munmap(content_mmaped, total_file_size_before_truncation); ftruncate(fd, i); close(fd); }

  • this should run significantly faster than every other line-accurate answer here, except user431397's answer (but this works on any filesystem, unlike user431397's approach, which only works on certain filesystems)

(but if i don't need the speed, i would probably use the first approach, as the code is much easier to read and probably less likely to contain bugs as a result)

0

You can just read and write to the file in place and then truncate the file. There may even be a way to do this with cli tools, not sure, but here it is in Java (untested).

RandomAccessFile out = new RandomAccessFile("file.txt", "rw");
RandomAccessFile in = new RandomAccessFile("file.txt", "r");
String line = null;
long rows = 0;
while( (line=in.readLine()) != null ){
    if( rows > 300000000 ) {
        out.writeBytes(line);
        out.write('\n');
    }
    rows++;
}
in.close();
out.setLength( out.getFilePointer() );
out.close();
  • You need to keep track of data already moved in some additional file - a simple text file with last successfully written offset will do. It will be needed in case in-place rewrite process is interrupted for any reason, so it can be resumed from same position. – Oleg V. Volkov Sep 22 '20 at 17:25
  • Agreed, that would be prudent. – Chris Seline Sep 23 '20 at 19:32