Moving large number of files into directories based on file names in linux

Question

I have a large number of files in a directory in a linux server with this name pattern:

1_file.txt
2_file.txt
3_file.txt
...
1455728_file.txt

Is there a way to move the first 100000 files (1_file.txt to 100000_file.txt) into directory 1_100000, the second 100000 files (100001_file.txt to 200000_file.txt) into directory 100001_200000, and so on ... ?

Yes. You can do this using a shell script that uses a for loop to mv each file into the correct location. Alternatively, it might be easier to pipe the output from ls into split -l 100000 in order to generate the directories you want. Maybe someone else will come along and write a one-liner for you. — sjy, Sep 15 '20 at 00:02
You are more likely to get help if you show your attempts or at least sketch one, as sjy has kindly done above. — Quasímodo, Sep 15 '20 at 00:37

waltinator · Accepted Answer · 2020-09-15T01:38:14.163

2

Untested

I would do something like:

#!/bin/bash
bottom=0
while [[ $bottom -lt 150000 ]] ; do
    myfirst=$((bottom + 1))
    mylast=$((bottom + 100000))
    bottom=$((bottom + 100000))
dir=&quot;${myfirst}_$mylast&quot;
[[ -d &quot;$dir&quot; ]] || mkdir &quot;$dir&quot;
seq $myfirst $mylast | \
    while read p ; do
        q=&quot;${p}_file.txt&quot;
        [[ -f &quot;$q&quot; ]] &amp;&amp; echo &quot;$q&quot;
    done | \
        xargs --no-run-if-empty  echo mv -t &quot;$dir&quot;


done

Remove the echo from echo mv when you want to do it for real.

edited Sep 15 '20 at 01:38

answered Sep 15 '20 at 00:44

waltinator

4,865

That is exactly what I needed. Thanks! – Bahram Sep 15 '20 at 14:49

MiniMax · Answer 2 · 2020-09-16T01:38:33.253

script.sh

#!/bin/bash
step=100000
file_dir=$1
Counting of files in the directory
shopt -s nullglob
file_list=("${file_dir}"/*)
file_num=${#file_list[@]}
Every file's common part
suffix='_file.txt'
for((from = 1, to = step; from <= file_num; from += step, to += step)); do
    new_dir="${from}_${to}"
    mkdir "${file_dir}/${new_dir}"
if ((to &gt; file_num)); then
    to=&quot;$file_num&quot;
fi

# Generating filenames by `seq` command and passing them to `xargs`
seq -f &quot;${file_dir}/%.f${suffix}&quot; &quot;$from&quot; &quot;$to&quot; | xargs mv -t &quot;${file_dir}/${new_dir}&quot;

done

Usage: ./script.sh files

Testing

I have generated files by this command:

printf '%s\0' files/{1..1455728}_file.txt | xargs -0 touch

then do:

$ time ./script.sh files
Time is:
real    10m43,618s
user    0m9,953s
sys 0m19,671s

Quite slow.

Result

$ ls -1v files
1_100000
100001_200000
200001_300000
300001_400000
400001_500000
500001_600000
600001_700000
700001_800000
800001_900000
900001_1000000
1000001_1100000
1100001_1200000
1200001_1300000
1300001_1400000
1400001_1500000

score 0 · Answer 3 · answered Sep 30 '20 at 07:45

Arithmetic is possible in the the shell, but it's always awkward, so I recommend you look for another scripting language to do most the work here. The following uses awk, but you could use perl equally well. I'd like to be able to say that you could also use python easily in the example below, but aspects of python's syntax make it not obvious how to embed a python script in-line into a pipeline like this. (It can be done, but it's irritatingly tricky.) Note that I don't use awk to perform the actual moves, just to do the calculation needed to produce the needed destination directory. If you use perl or python, they can perform the filesystem operations as well.

Some assumptions:

You want to move the file with its full original name. It's not much harder to modify the script to strip off the numeric prefix of the original (although then it had better be the case that the files don't all end in _file.txt).
There is only a single _ and no spaces in the filenames. If that's not true, something like the following can still work but you need to be more careful in the awk script and following shell loop.

So, given those, the following should work.

ls | 
awk -F_ '
{
    n = $1 - 1               # working zero based is easier here
    base = n - (n % 100000)  # round down to the nearest multiple of 100,000
    printf "%d_%d %s_%s\n", base + 1, base + 100000, $1, $2
}' |
while read destdir orig
do
    mkdir -p $destdir 
    mv $orig $destdir
done

So, what's going on here?

ls | ...

This just lists the filenames, and because the output is going to a pipe and not the terminal, it lists them one per line. The files will be sorted by ls's default order, but the rest of the script doesn't care about that and will work fine with a randomized list of filenames.

... | awk -F_ '
{
    n = $1 - 1               # working zero based is easier here
    base = n - (n % 100000)  # round down to the nearest multiple of 100,000
    printf "%d_%d %s_%s\n", base + 1, base + 100000, $1, $2
} | ...'

This is not complicated, but if you haven't played with awk before it's a bit tricky to understand. First, the goal here is to read the filenames one at a time from ls, and then for each filename produce an output line with two fields: the first field with the appropriate destination directory for the original filename, and the second field passing on the original filename so the following part of the pipeline can use it. So, in more details,

The -F_ flag to awk tells it to split each input line into fields on the _ character. Assuming that _ occurs only once in these filenames, awk will assign $1 to the numeric part of the name, and $2 to all the text after the _. Then, the braced block is applied with $1 and $2 set as just described.
The calculation of base identifies the which block of 100000 files this file belongs in. First, calculate n by subtracting 1 from the initial number of the filename. This zero-bases the number, which makes it easier to work with the modular arithmetic used in the next line. Next, round n down to the nearest multiple of 100,000. If n is already a multiple of 100,000 it is left undisturbed. (If you're not familiar with '%' operator, it N % M computes the remainder when N is divided by M. So, 5 % 3 == 2, 6 % 3 == 0, and so on.)
Finally, the printf assembles the output line necessary for the following stage of the pipeline. It produces a line with two fields, separated by a space. The first is the name of the destination directory, generated by using base to derive the upper and lower bound parts of the directory name; it's here that move back into a 1-based counting scheme for output. The second field is the reconstructed original input filename.

... | while read destdir orig
do
    mkdir -p $destdir && mv $orig $destdir
done

This is the final stage of the pipeline, and actually does all the moves. It reads each line produced by the awk script as two fields, and then

it ensures the directory exists, using mkdir -p (which does nothing if the directory already exists),
and if that succeeds, it moves the original file to the new directory.

It's often a good idea to use the mkdir ... && mv ... pattern in shell scripts, because if mkdir fails for any reason, the rename is not attempted.

This pattern of multiple pipeline stages, each incrementally transforming the data in some simple but useful way, is a very effective way of writing many sorts of shell scripts. It plays to the shell's strengths in process and pipeline control, while allowing you to push the more complex calculations that the shell isn't good at into the more appropriate languages.

Stéphane Chazelas · Answer 4 · 2020-09-30T18:09:53.713

Adapted from my answer to your related question:

#! /bin/zsh -
zmodload zsh/files # makes mv and a few other file manipulation commands builtin
batch=10000
highest=(<1->file.txt(n[-1]))
highest=${highest%%*}
for ((start = 1; start <= highest; start += batch)); do
  (( end = start + batch - 1))
  files=(<$start-$end>file.txt(N))
  if (($#files)); then
    mkdir -p ${start}${end} || exit
    mv -- $files ${start}_${end}/ || exit
  fi
done

Moving large number of files into directories based on file names in linux

4 Answers4

Counting of files in the directory

Every file's common part

Time is:

Linked