1

I am trying to create a loop to sort files within each of many directories by size, then copy the largest two to another location, keeping the directory format (below).

folder/sample 1  
       .../s1.fastq.gz  
       .../s2.fastq.gz  
       .../s3.fastq.gz  
       .../s4.fastq.gz  
folder/sample 2  
       .../s1.fastq.gz  
       .../s2.fastq.gz  
       .../s3.fastq.gz  
       .../s4.fastq.gz  

I'm new to Linux, so I'm struggling. I tried:

#!/bin/bash
mkdir newfolder

for dir in folder/*
do
echo $dir
ls -S $dir/*.gz | head -n +2 | cp -T newfolder

done

However, I get the following error.

cp: missing destination file operand after 'newfolder.'

How do I correctly feed the large files into the copy function?

I've also tried using xargs, but I get the error

xargs: invalid option -- 'w'

because I am not correctly feeding in one line at a time.

3 Answers3

1

zsh would be a much better choice of shell for that than bash:

#! /bin/zsh -
ret=0
for dir (folder/*(/)) {
  two_largest_files=($dir/*.gz(N.OL[1,2]))
  if (($#two_largest_files)) {
    mkdir -p newfolder/$dir:t &&
      cp -v $two_largest_files newfolder/$dir:t/ || ret=$?
  }
}
exit $ret

(note that -v, for verbose is not supported by all cp implementations, replace with (set -x; cp $two...) if yours doesn't support it).

0

There are two issues with your code:

  1. Never, ever try to parse the output of ls, use stat instead;
  2. When files are "many", or filenames contain "funny" characters ("/sample 1/") use find and xargs. Refer to man find and man xargs for further info.

Do something like:

mkdir newdir

find . -type f -name '*.gz' -print0 |\
  xargs -0 -r stat --printf="%s:%N" |\
  sort -rn |\
  head -n 2 |\
  cut -d: -f 2 |\
  xargs cp -T newdir

Warning! Untested code (I'm on phone). Replace the last line with

xargs echo cp -T newdir

Until it works.

For the curious, see https://mywiki.wooledge.org/ParsingLs

Kusalananda
  • 333,661
waltinator
  • 4,865
  • You're using nul-delimiters only part of the way through that pipeline, but drop them for sort. There is no need to escape the newlines. – Kusalananda Nov 14 '19 at 08:07
0

This is quite complex. First, you shouldn't parse the output of ls, as for files with newlines in their names things can get messy. So it's best to use the NUL as record (line) delimiter in all the pipeline. This is an example:

for dir in folder/*
do
    echo "$dir"
    find "$dir" -type f -print0 -exec du -h0 {} + | sort -hrz | head -zn 2 |
        sed -z 's/^.*[[:space:]]// ' | xargs -0I@ cp -v @ newfolder
done
  1. find finds the files in the given "$dir" - you should use the quotes here. It also applies du to all the files to get their size.
  2. sort sorts the results by size.
  3. head limits to top 2.
  4. sed gets rid of the size value before the file name.
  5. xargs builds the actual command using the arguments from the pipeline.

NUL delimiter has normally to be indicated in all commands, and so there are the z flags in sort, head and sed; 0 in du and xargs; and they're produced by the -print0 switch of find.

(I don't know why you use the -T flag in cp. In my example it's not there, and there's -v instead to give feedback.)