Automated trim of 28 million text records

Question

I have a plain text data file, consisting of 28 million tab-delimited records, each containing nine fields. The left ends of the first three records look like this:

I would like to truncate each of those records after the first four fields. That is, I don't need fields five through nine. After truncation, the records in the image (above) would look like this:

20937085    0f42ab32-22cd-4dcf-927b-a8d9a183d68b    Travelling Man  2001233
20937086    4dce8f93-45ee-4573-8558-8cd321256233    Live Up 2001233
20937087    48fabe3f-0fbd-4145-a917-83d164d6386f    Radiate 2001233

I think the last time I used Emacs for anything substantial was around 1983. I have missed it. For better and for worse, I was distracted by the arrival of the IBM PC. That, or the sheer passage of time, may have had a deadening effect upon the portion of the intellect previously devoted to a different sort of computing.

For whatever reason, Emacs is now a largely foreign language to me. But I think it may provide the only solution within my reach at present.

If anyone can give me a nudge toward a means of automating the removal of fields five through nine from the right ends of those 28 million records, it would be most appreciated.

g-gundam · Accepted Answer · 2022-12-21T00:16:20.823

For the curious, there is a csv-mode that can be used to remove the unwanted columns, but it's not going to perform very well on a file with 28 million records in it. (I tried.)

The Emacs Way

Install csv-mode. M-x package-install RET csv-mode RET
Open the tab-delimited file.
M-x csv-mode
C-c C-k to invoke csv-kill-fields. (On a large file, expect each step to be slow.)
Answer y after it puts the whole buffer into a region.
On the next prompt, say 5-9 to delete columns 5 through 9.
This will work fine on smaller files, but I tried it on a tsv file I generated with 28 million rows, and 10 minutes later, my emacs is still grinding at 100% CPU. If I don't run out of RAM, I think it'll work, but I don't recommend it. UPDATE: 20 minutes later, it still wasn't done so I hit C-g to get out of it. It deleted a lot, but it didn't get to the end of the file.

The Unix Way (with cut)

cut -f 1-4 < big.tsv > new.tsv

This took about 4 seconds to process 28 million lines on my system.

I used cut as suggested. With the input and output files located on a USB drive plugged into a ten-year-old dual-core laptop with 8GB RAM, it took under eight minutes. A look at the contents of the output file (in Emacs; still too big for e.g., Notepad++) suggested it yielded the desired result. File size was reduced from 2.9GB to 1.9GB. I will have to wait for another compelling reason to reacquaint with Emacs. Many thanks. — Ray Woodcock, Dec 21 '22 at 02:41

score 2 · Answer 2 · answered Dec 20 '22 at 04:25

2

How are the fields delimited? Several spaces, or one tab? Consider using cut, specifically cut -f (by field) rather than emacs.

answered Dec 20 '22 at 04:25

jeffkowalski

486
2
5

1

>28 million **tab-delimited** records. I agree that doing it outside of emacs might be the better move, especially since the file is so large. – g-gundam Dec 20 '22 at 04:34
Agreed, `cut` is all you need here. Otherwise `awk` or `perl` should also be pretty fast. You want to use the (or a) right tool for the job, which is definitely not Emacs for this quantity of text (it can do it, but it will be sloooow by comparison). – phils Dec 20 '22 at 04:45

Automated trim of 28 million text records

2 Answers2

The Emacs Way

The Unix Way (with cut)

Linked