103

I have a list of directories and subdirectories that contain large csv files. There are about 500 million lines in these files, each is a record. I would like to know

  1. How many lines are in each file.
  2. How many lines are in directory.
  3. How many lines in total

Most importantly, I need this in 'human readable format' eg. 12,345,678 rather than 12345678

It would be nice to learn how to do this in 3 ways. Plain vanilla bash tools, awk etc., and perl (or python).

Hexatonic
  • 1,275
  • 3
  • 10
  • 9

8 Answers8

144

How many lines are in each file.

Use wc, originally for word count, I believe, but it can do lines, words, characters, bytes, and the longest line length. The -l option tells it to count lines.

wc -l <filename>

This will output the number of lines in :

$ wc -l /dir/file.txt
32724 /dir/file.txt

You can also pipe data to wc as well:

$ cat /dir/file.txt | wc -l
32724
$ curl google.com --silent | wc -l
63

How many lines are in directory.

Try:

find . -name '*.pl' | xargs wc -l

another one-liner:

( find ./ -name '*.pl' -print0 | xargs -0 cat ) | wc -l

BTW, wc command counts new lines codes, not lines. When last line in the file does not end with new line code, this will not counted.

You may use grep -c ^ , full example:

#this example prints line count for all found files
total=0
find /path -type f -name "*.php" | while read FILE; do
     #you see use grep instead wc ! for properly counting
     count=$(grep -c ^ < "$FILE")
     echo "$FILE has $count lines"
     let total=total+count #in bash, you can convert this for another shell
done
echo TOTAL LINES COUNTED:  $total

How many lines in total

Not sure that I understood you request correctly. e.g. this will output results in the following format, showing the number of lines for each file:

# wc -l `find /path/to/directory/ -type f`
 103 /dir/a.php
 378 /dir/b/c.xml
 132 /dir/d/e.xml
 613 total

Alternatively, to output just the total number of new line characters without the file by file counts to following command can prove useful:

# find /path/to/directory/ -type f -exec wc -l {} \; | awk '{total += $1} END{print total}'
 613

Most importantly, I need this in 'human readable format' eg. 12,345,678 rather than 12345678

Bash has a printf function built in:

printf "%0.2f\n" $T

As always, there are many different methods that could be used to achieve the same results mentioned here.

malyy
  • 2,177
  • By the way, how do I use printf in your examples? I tried to pipe to it from wc -l, but it didn't work. – Hexatonic Feb 08 '16 at 02:12
  • try

    find . -name '*.pl' | xargs wc -l | awk '{printf ("%0.2f ", $1) } {print $2}'

    change the output of 'printf' for your needs

    – malyy Feb 08 '16 at 15:34
  • This doesn't add commas to the number to make it more human readable though. It just adds a zeros to the end. – Hexatonic Feb 08 '16 at 16:05
  • 1
    echo 1000000000000 | xargs printf "%'d\n" 1,000,000,000,000 – Hexatonic Feb 08 '16 at 16:07
  • this (printf "%'d\n") what I meant by "change the output of 'printf' for your needs" ,) – malyy Feb 10 '16 at 09:52
  • 1
    @Hexatonic printf doesn't read its arguments from stdin, but rather from the command line (compare piping to echo vs piping to cat; cat reads from stdin, echo doesn't). Instead, use printf "$(find ... | xargs ...)" to supply the output as arguments to printf. – BallpointBen Aug 21 '18 at 22:02
  • n.b. xargs is limited to a certain size and thus you can get spurious results. Unfortunately only the GNU wc command takes a stream/file input for file listings. See my answer. – Adam Gent Oct 30 '19 at 15:32
  • This doesn't seem to work for files with spaces in the file name. – Aaron Franke Apr 15 '22 at 20:01
  • Although this was not asked in this question, if someone is looking to list the 5 largest files, I used : find . -name '*.pl' | xargs wc -l | sort -k1 -n -r | head -n 5 – Raphaël Mar 31 '23 at 08:47
31

In many cases combining the wc command and the wildcard * may be enough.
If all your files are in a single directory you can call:

wc -l src/*

You can also list several files and directories:

wc -l file.txt readme src/* include/*

This command will show a list of the files and their number of lines.
The last line will be the sum of the lines from all files.


To count all files in a directory recursively:

First, enable globstar by adding shopt -s globstar to your .bash_profile. Support for globstar requires Bash ≥ 4.x which can be installed with brew install bash if needed. You can check your version with bash --version.

Then run:

wc -l **/*

Note that this output will be incorrect if globstar is not enabled.

Thomio
  • 411
  • 1
    And for counting files in the currrent directory recursively: wc -l **/* – Taylor D. Edmiston Mar 31 '18 at 18:35
  • @TaylorEdmiston For me (on Mac) that only counts the files exactly one directory down. It skips the files in the current directory, and for any instance that would be more than one directory deep it warns that it's a directory: "wc: parent_dir/child_dir: read: Is a directory" – M. Justin Sep 16 '18 at 05:56
  • 1
    @Thomio It requires globstar to be enabled. On macOS, I believe it is disabled out of the box. I've just sent an edit to your answer that adds the command and how to enable globstar. – Taylor D. Edmiston Sep 17 '18 at 17:00
  • 1
    That is a quality answer. – Bluebaron Mar 20 '20 at 14:26
9

This command will give list of lines code in each directory:

find . -name '*.*' -type f | xargs wc -l
peterh
  • 9,731
Suresh.A
  • 107
  • 2
    On nix systems file names don't necessarily have a dot . in them. This command will find files like output.txt and output.thu.txt but will not include files with names like output. Using `-name would work, but then there's no point in using-nameat all. The command would befind . -type f | xargs wc -l` – Stephen P Oct 10 '23 at 17:15
6

I will just augment @malyy answer for the following (to big for a comment):

How many lines in total

Many answers are using wc command line file option with xargs. The problem with this is xargs is limited to a rather small platform dependent size.

Furthermore there is a difference between BSD (macOS) and GNU (linux/homebrew) wc.

The GNU one is ideal because it can read the file listing from a file instead of arguments (--files0).

If you are on mac and have homebrew you should do the following:

find . -name "*.pl" -print0 | gwc -l --files0=-

Notice the gwc instead of wc.

Adam Gent
  • 161
  • 1
    Found the above to be effective. Used a slightly different flavor:

    find . -type f -name "*.cs" -print0 | wc -l --files0-from=-

    – NathanAW Sep 26 '22 at 13:55
  • @NathanAW I found this too reading the man and it works perfectly. I use it to count the total line of code of project. – XouDo Jun 16 '23 at 08:26
4

a bit late to the game, but I got a bunch of argument errors with the above due to the size of the dir. This worked for me:

for i in $(find . -type f); do wc -l $i; done >> /home/counts.txt

2

cat would combine the files into one and outputs everything to stdout, you can do a wc -l on that for a total count of lines of files in a directory:

cat /path/to/directory/* | wc -l
2

If you need only total lines from all files use grep

find . -name '*.csv' | xargs wc -l | grep total

This command works too

wc -l *.csv | grep total
  • (1) This will fail if any filenames contain whitespace or other special characters. (2) While xargs is a handy program, it’s not magic. The kernel imposes a limit on how long a command line can be; xargs cannot magically exceed that limit. Therefore, if there’s too much input, xargs will execute the specified command (wc -l, in your example) multiple times. So you could, in theory, get output like 1 file0, 2 file1, 4 file2, 7 total, ⁠ 8 file3, 16 file4, 24 total. (2b) Worse, if (hypothetically) the command-line limit is so small that xargs can handle  … (Cont’d) – G-Man Says 'Reinstate Monica' Jul 25 '22 at 13:09
  • (Cont’d) …  only 1000 files at a time, and you have 6001 files, then it will run wc -l seven times; once for files 1-1000, once for files 1001-2000, etc., … and the seventh run will get just one file: file6001.  Since that invocation of wc has only one argument, it will not produce a “total” line.  (There’s a simple trick to work around this. Do you know it? Can you think of it?)  (3) Your grep will capture files that have “total” in their names.  (4) Like most of the other answers, you have ignored the requirement to insert thousands separators (,) every third digit. – G-Man Says 'Reinstate Monica' Jul 25 '22 at 13:09
  • Thanks for your comment! It is very useful! – Viktor Efimoff Jul 30 '22 at 08:25
-1

On the theme of “another way to do it, not necessarily the best way”, how about:

find /path/to/directory -name '*' -exec wc {} +

This is ok for white space in the filename but still suffers from the “too many files” problem.

  • Using -name '*' in find is pointless. It produces exactly the same list as find /path/to/directory — plus your "find" includes directories, causing Is a directory errors from wc. And wc without args lists 3 numbers per file; lines, words, and characters. You want -type f on find to get only files without directories, and -l on wc to count only lines: find . -type f -exec wc -l {} + – Stephen P Oct 10 '23 at 17:28