13

For a really big file like 1GB wc -l happens to be slow. Do we have a faster way calculating the number of newlines for a particular file?

Thomas Dickey
  • 76,765
prosti
  • 1,038
  • 25
    Buy faster disks? Given that each byte of the input must be inspected for its 0x0Ainess, I/O is doubtless the bottleneck. – thrig Mar 01 '16 at 21:49
  • 2
    If you suspect wc of having too much overhead you may try to implement your own foreach byte in file: if byte == '\n': linecount++. If implemented in C or assembler I don't think it'll get any faster, except perhaps in kernel space on an RTOS with highest priority (or even use an interrupt for that - you just can't do anything else with the system... allright, I digress ;-)) – Murphy Mar 01 '16 at 22:07
  • 3
    And just to get a feeling for the scale I did a quick time wc -l some_movie.avi on an uncached file, resulting in 5172672 some_movie.avi -- real 0m57.768s -- user 0m0.255s -- sys 0m0.863s. Which basically proves @thrig right, I/O shatters your performance in this case. – Murphy Mar 01 '16 at 22:18
  • 11
    Best way to show it's a disk IO bottlneck, do time wc -l some_large_file_smaller_than_cache twice in quick succession and see how fast the second operation is, then time wc -l some_large_file_larger_than_cache and see how the time doesn't change between runs. For a ~280MB file here, the time goes from 1.7 seconds to 0.2 seconds, but for a 2GB file it's 14 seconds both times. – EightBitTony Mar 02 '16 at 00:01
  • 1
    How slow is too slow for you? What does /usr/bin/time wc -l <file> say?What's your hardware? Is it faster if you run the command repeatedly? We really need more information ;) – marcelm Mar 02 '16 at 11:01
  • @marcelm: Yes, caching definitively plays significant role, since when I repeat the process I get up to thousand times faster results as second attempt. As far as hardware is concerned, I tried this on different systems having SSD disk and non SSD disk, and having L1d, L1i, L2 and in some systems L3 level caching. The slowest results and greatest caching speedup I found on VPS with no SSD and 1GB of RAM where I executed plain time wc -l file, on a 130MB textual log file the first time for ~15s, and second time for less then 0.3s. With SSD disk and >1GB files execution time was <1s. – prosti Mar 02 '16 at 13:56
  • 1
    I have a superfast SSD on x64, `time wc -l 2gb_of_urand 8896625 /home/cat/2gb_of_urand

    real 0m0.768s user 0m0.204s sys 0m0.236s`

    – cat Mar 02 '16 at 14:40

2 Answers2

22

You can try to write in C:

#include <unistd.h>
#include <stdio.h>
#include <string.h>
int main(){
  char buf[BUFSIZ];
  int nread;
  size_t nfound=0;
  while((nread=read(0, buf, BUFSIZ))>0){
    char const* p;
    for(p=buf; p=memchr(p,'\n',nread-(p-buf)); nfound++,p++) {;}
  }
  if(nread<0) { perror("Error"); return 1; }
  printf("%lu\n", nfound);
  return 0;
}

Save in e.g., wcl.c, compile e.g., with gcc wcl.c -O2 -o wcl and run with

<yourFile ./wcl

This finds newlines sprinkled in a 1GB file on my system in about 370ms (repeated runs). (Increasing buffer sizes slightly increases the time, which is to be expected -- BUFSIZ should be close to optimal). This is very comparable to the ~380ms I'm getting from wc -l.

Mmaping gives me a better time of about 280ms, but it of course has the limitation of being limited to real files (no FIFOS, no terminal input, etc.):

#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
int main(){
  struct stat sbuf;
  if(fstat(0, &sbuf)<0){ perror("Can't stat stdin"); return 1; }

  char* buf = mmap(NULL, sbuf.st_size, PROT_READ, MAP_PRIVATE, 0/*stdin*/, 0/*offset*/);
  if(buf == MAP_FAILED){ perror("Mmap error"); return 1; } 

  size_t nread = sbuf.st_size, nfound=0;
  char const* p;
  for(p=buf; p=memchr(p,'\n',nread-(p-buf)); nfound++,p++) {;}

  printf("%lu\n", nfound);
  return 0;
}

I created my test file with:

 $ dd if=/dev/zero of=file bs=1M count=1042 

and added some test newlines with:

 $ echo >> 1GB 

and a hex editor.

Faheem Mitha
  • 35,108
Petr Skocik
  • 28,816
  • I was surprised at the mmap result TBH. I used to think mmaping was faster than read/write, but then I saw some linux benchmarks that showed the opposite. Looks like it is very true in this case. – Petr Skocik Mar 02 '16 at 00:27
  • 4
    mmap is going to get vastly better results on linux because it'll map to huge pages these days, and TLB misses are sloooowwwwwww. – jthill Mar 02 '16 at 01:12
  • There may be some benefit to reading different parts of the file in separate threads (e.g. with an OpenMP for loop) so that some progress might be made while one thread is stalled waiting for input. But on the other hand, it might hamper I/O scheduling, so all I can recommend is to try it, and measure! – Toby Speight Mar 02 '16 at 12:52
  • The read() version may benefit from read-ahead. – Barmar Mar 02 '16 at 17:34
  • 1
    @TobySpeight Yeah, multithreading might speed it up. Also looking scanning two bytes at a time via a 2^16 look up tables provided a pretty good speed up last time I played with it. – Petr Skocik Nov 22 '19 at 14:54
18

You can improve on the solution suggested by @pskocik by reducing the number of calls to read. There are a lot of calls to read BUFSIZ chunks from a 1Gb file. The usual approach to doing this is by increasing the buffer size:

  • just for fun, try increasing the buffer-size by a factor of 10. Or 100. On my Debian 7, BUFSIZ is 8192. With the original program, that's 120 thousand read operations. You can probably afford a 1Mb input buffer to reduce that by a factor of 100.
  • for a more optimal approach, applications may allocate a buffer as large as the file, requiring a single read operation. That works well enough for "small" files (though some readers have more than 1Gb on their machine).
  • finally, you could experiment with memory-mapped I/O, which handles the allocation as such.

When benchmarking the various approaches, you might keep in mind that some systems (such as Linux) use most of your machine's unused memory as a disk cache. A while back (almost 20 years ago, mentioned in the vile FAQ), I was puzzled by unexpectedly good results from a (not very good) paging algorithm which I had developed to handle low-memory conditions in a text editor. It was explained to me that it ran fast because the program was working from the memory buffers used to read the file, and that only if the file were re-read or written would there be a difference in speed.

The same applies to mmap (in another case still on my to-do list to incorporate into an FAQ, a developer reported very good results in a scenario where the disk cache was the actual reason for improvement). Developing benchmarks takes time and care to analyze the reasons for the good (or bad) performance.

Further reading:

Thomas Dickey
  • 76,765
  • 2
    You're overestimating the influence of buffer sizes above a certain threshold. Typically, increasing the buffer size beyond 4KB-ish doesn't help much, and may in fact be detrimental because it may push the buffer out of L1 cache. On my machine, testing with dd, using 1MB buffers is slower than 8KB. The 8KB default value for wc is actually chosen pretty well, it will be close to optimal for a large range of systems. – marcelm Mar 02 '16 at 11:06