3

Given a file with m lines, how to get the n-th line. m can be smaller than n sometimes. I have tried:

method1: sed -ne '10p' file.txt
method2: sed -ne '10p' <file.txt
method3: sed -ne '10{p;q;}' file.txt
method4: awk 'NR==10' file.txt

in LeetCode's https://leetcode.com/problems/tenth-line/. method1 beats others. I don't know why. I think method3 should be faster.

Are there faster ways?

Updates:

Following @skwllsp suggestion, I run some commands. The results are: instructions

instructions commands
428,160,537 perf stat sed -ne '10p' file.txt
427,426,310 perf stat sed -ne '10p' <file.txt
1,033,730   perf stat sed -ne '10{p;q;}' file.txt
1,111,502   perf stat awk 'NR == 10 { print ; exit ;} ' file.txt 

method4 has changed according to @Archemar's answer

and

777,525 perf -stat tail -n +10 file.txt |head -n 1

which is much less than method1.

frams
  • 546
  • The link you have provided requires a subscription to see the stuff :/P – sjsam Apr 26 '16 at 04:27
  • @sjsam: Sorry, I paste the result page url. I have chage it to the right one – frams Apr 26 '16 at 04:31
  • If you feed sed a gigantic file, method 3 will likely stay roughly constant and method 1 will get progressively slower. How big was the file you tested? – Wildcard Apr 26 '16 at 04:33
  • @Wildcard: There are 7 tests. It's a black box too me. The first test case has only 10 lines. – frams Apr 26 '16 at 04:43
  • @frams : I believe the first and second method makes no difference. I have just started a thread here. You may wanna stay updated with it. – sjsam Apr 26 '16 at 05:25
  • @sjsam: Thank you very much! But after some tests, it appears that method1 took more time than method2 both on LeetCode and my Terminal. – frams Apr 26 '16 at 05:48
  • @Archemar: useful link. Can it be improved with only one line to output? – frams Apr 26 '16 at 06:37

2 Answers2

3

Let's measure your tests to see how many instructions are done in each method. I created my own file seq 2000000 > 2000000.txt and I want to find which method is the fastest.


$ perf stat sed -ne '10p' 2000000.txt 
10

 Performance counter stats for 'sed -ne 10p 2000000.txt':

        203.877247 task-clock                #    0.991 CPUs utilized          
                 5 context-switches          #    0.025 K/sec                  
                 3 cpu-migrations            #    0.015 K/sec                  
               214 page-faults               #    0.001 M/sec                  
       405,075,423 cycles                    #    1.987 GHz                     [50.20%]
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
       838,221,677 instructions              #    2.07  insns per cycle         [75.20%]
       203,113,013 branches                  #  996.251 M/sec                   [74.99%]
           766,918 branch-misses             #    0.38% of all branches         [75.16%]

       0.205683270 seconds time elapsed

So first method - 838,221,677 instructions.


$ perf stat sed -ne '10{p;q;}' 2000000.txt 
10

 Performance counter stats for 'sed -ne 10{p;q;} 2000000.txt':

          1.211558 task-clock                #    0.145 CPUs utilized          
                 2 context-switches          #    0.002 M/sec                  
                 0 cpu-migrations            #    0.000 K/sec                  
               213 page-faults               #    0.176 M/sec                  
         1,633,950 cycles                    #    1.349 GHz                     [23.73%]
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
           824,789 instructions              #    0.50  insns per cycle        
           164,935 branches                  #  136.135 M/sec                  
            11,751 branch-misses             #    7.12% of all branches         [83.24%]

       0.008374725 seconds time elapsed

So, third method - 824,789 instructions. It is much better than the first method.


The improved forth method

$ perf stat awk 'NR == 10 { print ; exit ;} ' 2000000.txt 
10

 Performance counter stats for 'awk NR == 10 { print ; exit ;}  2000000.txt':

          1.357354 task-clock                #    0.162 CPUs utilized          
                 2 context-switches          #    0.001 M/sec                  
                 0 cpu-migrations            #    0.000 K/sec                  
               282 page-faults               #    0.208 M/sec                  
         1,777,749 cycles                    #    1.310 GHz                     [11.54%]
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
           919,636 instructions              #    0.52  insns per cycle        
           185,695 branches                  #  136.807 M/sec                  
            11,218 branch-misses             #    6.04% of all branches         [91.64%]

       0.008375258 seconds time elapsed

Little bit worse than the second method. Anyway it is as efficient as the third method.


You might repeat the same tests with your file and see which method is the best.


A measument for the second method:
$ perf stat sed -ne '10p' <2000000.txt 
10

 Performance counter stats for 'sed -ne 10p':

        203.278584 task-clock                #    0.998 CPUs utilized          
                 1 context-switches          #    0.005 K/sec                  
                 3 cpu-migrations            #    0.015 K/sec                  
               213 page-faults               #    0.001 M/sec                  
       403,941,976 cycles                    #    1.987 GHz                     [49.84%]
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
       835,372,994 instructions              #    2.07  insns per cycle         [74.92%]
       203,327,145 branches                  # 1000.239 M/sec                   [74.90%]
           773,067 branch-misses             #    0.38% of all branches         [75.35%]

       0.203714402 seconds time elapsed

It is as bad as the first method

0

for awk

 awk 'NR == 10 { print ; exit ;} ' file.txt

I think perl is faster, there was a same question here about one year ago.

see also

  1. cat line X to line Y on a huge file
  2. How can I get a specific line from a file?
Archemar
  • 31,554