Need something that is faster than "wc -l"

Question

For a really big file like 1GB wc -l happens to be slow. Do we have a faster way calculating the number of newlines for a particular file?

Buy faster disks? Given that each byte of the input must be inspected for its 0x0Ainess, I/O is doubtless the bottleneck. — thrig, Mar 01 '16 at 21:49
If you suspect wc of having too much overhead you may try to implement your own foreach byte in file: if byte == '\n': linecount++. If implemented in C or assembler I don't think it'll get any faster, except perhaps in kernel space on an RTOS with highest priority (or even use an interrupt for that - you just can't do anything else with the system... allright, I digress ;-)) — Murphy, Mar 01 '16 at 22:07
And just to get a feeling for the scale I did a quick time wc -l some_movie.avi on an uncached file, resulting in 5172672 some_movie.avi -- real 0m57.768s -- user 0m0.255s -- sys 0m0.863s. Which basically proves @thrig right, I/O shatters your performance in this case. — Murphy, Mar 01 '16 at 22:18
Best way to show it's a disk IO bottlneck, do time wc -l some_large_file_smaller_than_cache twice in quick succession and see how fast the second operation is, then time wc -l some_large_file_larger_than_cache and see how the time doesn't change between runs. For a ~280MB file here, the time goes from 1.7 seconds to 0.2 seconds, but for a 2GB file it's 14 seconds both times. — EightBitTony, Mar 02 '16 at 00:01
How slow is too slow for you? What does /usr/bin/time wc -l <file> say?What's your hardware? Is it faster if you run the command repeatedly? We really need more information ;) — marcelm, Mar 02 '16 at 11:01
@marcelm: Yes, caching definitively plays significant role, since when I repeat the process I get up to thousand times faster results as second attempt. As far as hardware is concerned, I tried this on different systems having SSD disk and non SSD disk, and having L1d, L1i, L2 and in some systems L3 level caching. The slowest results and greatest caching speedup I found on VPS with no SSD and 1GB of RAM where I executed plain time wc -l file, on a 130MB textual log file the first time for ~15s, and second time for less then 0.3s. With SSD disk and >1GB files execution time was <1s. — prosti, Mar 02 '16 at 13:56
I have a superfast SSD on x64, `time wc -l 2gb_of_urand 8896625 /home/cat/2gb_of_urand
real 0m0.768s user 0m0.204s sys 0m0.236s` — cat, Mar 02 '16 at 14:40

score 22 · Answer 1 · edited Mar 02 '16 at 17:59

You can try to write in C:

#include <unistd.h>
#include <stdio.h>
#include <string.h>
int main(){
  char buf[BUFSIZ];
  int nread;
  size_t nfound=0;
  while((nread=read(0, buf, BUFSIZ))>0){
    char const* p;
    for(p=buf; p=memchr(p,'\n',nread-(p-buf)); nfound++,p++) {;}
  }
  if(nread<0) { perror("Error"); return 1; }
  printf("%lu\n", nfound);
  return 0;
}

Save in e.g., wcl.c, compile e.g., with gcc wcl.c -O2 -o wcl and run with

<yourFile ./wcl

This finds newlines sprinkled in a 1GB file on my system in about 370ms (repeated runs). (Increasing buffer sizes slightly increases the time, which is to be expected -- BUFSIZ should be close to optimal). This is very comparable to the ~380ms I'm getting from wc -l.

Mmaping gives me a better time of about 280ms, but it of course has the limitation of being limited to real files (no FIFOS, no terminal input, etc.):

#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
int main(){
  struct stat sbuf;
  if(fstat(0, &sbuf)<0){ perror("Can't stat stdin"); return 1; }

  char* buf = mmap(NULL, sbuf.st_size, PROT_READ, MAP_PRIVATE, 0/*stdin*/, 0/*offset*/);
  if(buf == MAP_FAILED){ perror("Mmap error"); return 1; } 

  size_t nread = sbuf.st_size, nfound=0;
  char const* p;
  for(p=buf; p=memchr(p,'\n',nread-(p-buf)); nfound++,p++) {;}

  printf("%lu\n", nfound);
  return 0;
}

I created my test file with:

 $ dd if=/dev/zero of=file bs=1M count=1042

and added some test newlines with:

 $ echo >> 1GB

and a hex editor.

I was surprised at the mmap result TBH. I used to think mmaping was faster than read/write, but then I saw some linux benchmarks that showed the opposite. Looks like it is very true in this case. — Petr Skocik, Mar 02 '16 at 00:27
mmap is going to get vastly better results on linux because it'll map to huge pages these days, and TLB misses are sloooowwwwwww. — jthill, Mar 02 '16 at 01:12
There may be some benefit to reading different parts of the file in separate threads (e.g. with an OpenMP for loop) so that some progress might be made while one thread is stalled waiting for input. But on the other hand, it might hamper I/O scheduling, so all I can recommend is to try it, and measure! — Toby Speight, Mar 02 '16 at 12:52
@TobySpeight Yeah, multithreading might speed it up. Also looking scanning two bytes at a time via a 2^16 look up tables provided a pretty good speed up last time I played with it. — Petr Skocik, Nov 22 '19 at 14:54

score 18 · Answer 2 · edited May 23 '17 at 12:39

You can improve on the solution suggested by @pskocik by reducing the number of calls to read. There are a lot of calls to read BUFSIZ chunks from a 1Gb file. The usual approach to doing this is by increasing the buffer size:

just for fun, try increasing the buffer-size by a factor of 10. Or 100. On my Debian 7, BUFSIZ is 8192. With the original program, that's 120 thousand read operations. You can probably afford a 1Mb input buffer to reduce that by a factor of 100.
for a more optimal approach, applications may allocate a buffer as large as the file, requiring a single read operation. That works well enough for "small" files (though some readers have more than 1Gb on their machine).
finally, you could experiment with memory-mapped I/O, which handles the allocation as such.

When benchmarking the various approaches, you might keep in mind that some systems (such as Linux) use most of your machine's unused memory as a disk cache. A while back (almost 20 years ago, mentioned in the vile FAQ), I was puzzled by unexpectedly good results from a (not very good) paging algorithm which I had developed to handle low-memory conditions in a text editor. It was explained to me that it ran fast because the program was working from the memory buffers used to read the file, and that only if the file were re-read or written would there be a difference in speed.

The same applies to mmap (in another case still on my to-do list to incorporate into an FAQ, a developer reported very good results in a scenario where the disk cache was the actual reason for improvement). Developing benchmarks takes time and care to analyze the reasons for the good (or bad) performance.

Need something that is faster than "wc -l"

2 Answers2