1

I have many files in a folder Main which are named like these:

2021_10_15_23_35_SIP_CDR_pid3894_ins2_thread_1_4718.csv.gz
2021_11_24_21_15_Gi_pid25961_ins2_thread_1_6438.csv.gz  2021_11_25_20_55_Gi_pid29741_ins5_thread_4_7540.csv.gz
2021_11_24_21_15_Gi_pid27095_ins2_thread_1_6485.csv.gz  2021_11_25_20_55_Gi_pid30842_ins3_thread_2_7489.csv.gz
2021_11_24_21_15_Gi_pid27095_ins3_thread_2_6485.csv.gz  2021_11_25_20_55_Gi_pid30842_ins4_thread_3_7488.csv.gz
2021_11_24_21_15_Gi_pid27095_ins4_thread_3_6485.csv.gz  2021_11_25_20_55_Gi_pid30842_ins5_thread_4_7489.csv.gz
2021_11_24_21_15_Gi_pid681_ins5_thread_4_6457.csv.gz

The first 10 characters shows the date, followed by the digits which is the time in 24 hour format. The rest is the file details which we can ignore.

I want to create folders within the Main folder based on the date in the filename and then another folder inside the date folder based on the hour in file name. Eventually I want to move the files from the Main folder into the respective hour folder.

Main -> Date -> hh -> file.csv.gz

For eg: The file 2021_11_24_21_15_Gi_pid27095_ins3_thread_2_6485.csv.gz in the Main folder will eventually end up in a folder like this with the below path Main/2021_11_24/21/2021_11_24_21_15_Gi_pid27095_ins3_thread_2_6485.csv.gz

Can you please help with the bash script to achieve the grouping of files in folders like mentioned above?

2 Answers2

2

Using the perl rename utility:

Note: perl rename is also known as file-rename, perl-rename, or prename. Not to be confused with the rename utility from util-linux which has completely different and incompatible capabilities and command-line options. perl rename is the default rename on Debian...IIRC, it's in the prename package on Centos and the command should be executed as prename rather than rename.

$ rename -n 'if (m/(^\d{4}_\d\d_\d\d)_(\d\d)/) {
               my ($date,$hour) = ($1,$2);
               my $dir = "./$date/$hour/";
               mkdir $date;
               mkdir $dir;
               s=^=$dir=
             }' *
rename(2021_10_15_23_35_SIP_CDR_pid3894_ins2_thread_1_4718.csv.gz, ./2021_10_15/23/2021_10_15_23_35_SIP_CDR_pid3894_ins2_thread_1_4718.csv.gz)
rename(2021_11_24_21_15_Gi_pid25961_ins2_thread_1_6438.csv.gz, ./2021_11_24/21/2021_11_24_21_15_Gi_pid25961_ins2_thread_1_6438.csv.gz)
rename(2021_11_24_21_15_Gi_pid27095_ins2_thread_1_6485.csv.gz, ./2021_11_24/21/2021_11_24_21_15_Gi_pid27095_ins2_thread_1_6485.csv.gz)
rename(2021_11_24_21_15_Gi_pid27095_ins3_thread_2_6485.csv.gz, ./2021_11_24/21/2021_11_24_21_15_Gi_pid27095_ins3_thread_2_6485.csv.gz)
rename(2021_11_24_21_15_Gi_pid27095_ins4_thread_3_6485.csv.gz, ./2021_11_24/21/2021_11_24_21_15_Gi_pid27095_ins4_thread_3_6485.csv.gz)
rename(2021_11_24_21_15_Gi_pid681_ins5_thread_4_6457.csv.gz, ./2021_11_24/21/2021_11_24_21_15_Gi_pid681_ins5_thread_4_6457.csv.gz)
rename(2021_11_25_20_55_Gi_pid29741_ins5_thread_4_7540.csv.gz, ./2021_11_25/20/2021_11_25_20_55_Gi_pid29741_ins5_thread_4_7540.csv.gz)
rename(2021_11_25_20_55_Gi_pid30842_ins3_thread_2_7489.csv.gz, ./2021_11_25/20/2021_11_25_20_55_Gi_pid30842_ins3_thread_2_7489.csv.gz)
rename(2021_11_25_20_55_Gi_pid30842_ins4_thread_3_7488.csv.gz, ./2021_11_25/20/2021_11_25_20_55_Gi_pid30842_ins4_thread_3_7488.csv.gz)
rename(2021_11_25_20_55_Gi_pid30842_ins5_thread_4_7489.csv.gz, ./2021_11_25/20/2021_11_25_20_55_Gi_pid30842_ins5_thread_4_7489.csv.gz)

The -n is a dry-run option, it will only show what it would do without actually doing it. Remove it (or replace with -v for verbose output) when you're sure the rename script is going to do what you want.

The script works by first extracting the date and hour portions of each filename (skipping any filenames that don't match). Then it creates the directories for the date and date/hour, then renames the filename into those directories.

This assumes that the filenames are in the current directory. If they aren't, you'll have to adjust the m// matching regex in the first line AND the s=== substitution regex in the second-last line.


Alternate version using the File::Path perl core module (which is included with perl), instead of using mkdir twice (the make_path function works like the mkdir -p shell command):

$ rename -v 'BEGIN {use File::Path qw(make_path)};
             if (m/(^\d{4}_\d\d_\d\d)_(\d\d)/) {
               my $dir = "./$1/$2/";
               make_path $dir;
               s=^=$dir=
             }' *
2021_10_15_23_35_SIP_CDR_pid3894_ins2_thread_1_4718.csv.gz renamed as ./2021_10_15/23/2021_10_15_23_35_SIP_CDR_pid3894_ins2_thread_1_4718.csv.gz
2021_11_24_21_15_Gi_pid25961_ins2_thread_1_6438.csv.gz renamed as ./2021_11_24/21/2021_11_24_21_15_Gi_pid25961_ins2_thread_1_6438.csv.gz
2021_11_24_21_15_Gi_pid27095_ins2_thread_1_6485.csv.gz renamed as ./2021_11_24/21/2021_11_24_21_15_Gi_pid27095_ins2_thread_1_6485.csv.gz
2021_11_24_21_15_Gi_pid27095_ins3_thread_2_6485.csv.gz renamed as ./2021_11_24/21/2021_11_24_21_15_Gi_pid27095_ins3_thread_2_6485.csv.gz
2021_11_24_21_15_Gi_pid27095_ins4_thread_3_6485.csv.gz renamed as ./2021_11_24/21/2021_11_24_21_15_Gi_pid27095_ins4_thread_3_6485.csv.gz
2021_11_24_21_15_Gi_pid681_ins5_thread_4_6457.csv.gz renamed as ./2021_11_24/21/2021_11_24_21_15_Gi_pid681_ins5_thread_4_6457.csv.gz
2021_11_25_20_55_Gi_pid29741_ins5_thread_4_7540.csv.gz renamed as ./2021_11_25/20/2021_11_25_20_55_Gi_pid29741_ins5_thread_4_7540.csv.gz
2021_11_25_20_55_Gi_pid30842_ins3_thread_2_7489.csv.gz renamed as ./2021_11_25/20/2021_11_25_20_55_Gi_pid30842_ins3_thread_2_7489.csv.gz
2021_11_25_20_55_Gi_pid30842_ins4_thread_3_7488.csv.gz renamed as ./2021_11_25/20/2021_11_25_20_55_Gi_pid30842_ins4_thread_3_7488.csv.gz
2021_11_25_20_55_Gi_pid30842_ins5_thread_4_7489.csv.gz renamed as ./2021_11_25/20/2021_11_25_20_55_Gi_pid30842_ins5_thread_4_7489.csv.gz

This isn't really any better than the first version, but it does demonstrate that you can use any perl code, any perl module to rename and/or move files.


Third version, this one uses File::Basename to split the input pathname into $path and $file portions. It can cope with filenames in the current directory, or in any other directory. File::Basename is a core perl module, so is included with perl. It provides three useful functions, basename() and dirname() (which work similarly to the shell tools of the same name), and fileparse() which is what I'm using in this script to extract both the basename and the directory into separate variables.

rename -n 'BEGIN {use File::Path qw(make_path); use File::Basename};
           my ($file, $path) = fileparse($_);
           if ($file =~ m/(\d{4}_\d\d_\d\d)_(\d\d)/) {
             my $dir = "$path/$1/$2";
             make_path $dir;
             $_ = "$dir/$file"
           }' /home/cas/rename-test/*
rename(/home/cas/rename-test/2021_10_15_23_35_SIP_CDR_pid3894_ins2_thread_1_4718.csv.gz, /home/cas/rename-test/2021_10_15/23/2021_10_15_23_35_SIP_CDR_pid3894_ins2_thread_1_4718.csv.gz)
rename(/home/cas/rename-test/2021_11_24_21_15_Gi_pid25961_ins2_thread_1_6438.csv.gz, /home/cas/rename-test/2021_11_24/21/2021_11_24_21_15_Gi_pid25961_ins2_thread_1_6438.csv.gz)
rename(/home/cas/rename-test/2021_11_24_21_15_Gi_pid27095_ins2_thread_1_6485.csv.gz, /home/cas/rename-test/2021_11_24/21/2021_11_24_21_15_Gi_pid27095_ins2_thread_1_6485.csv.gz)
rename(/home/cas/rename-test/2021_11_24_21_15_Gi_pid27095_ins3_thread_2_6485.csv.gz, /home/cas/rename-test/2021_11_24/21/2021_11_24_21_15_Gi_pid27095_ins3_thread_2_6485.csv.gz)
rename(/home/cas/rename-test/2021_11_24_21_15_Gi_pid27095_ins4_thread_3_6485.csv.gz, /home/cas/rename-test/2021_11_24/21/2021_11_24_21_15_Gi_pid27095_ins4_thread_3_6485.csv.gz)
rename(/home/cas/rename-test/2021_11_24_21_15_Gi_pid681_ins5_thread_4_6457.csv.gz, /home/cas/rename-test/2021_11_24/21/2021_11_24_21_15_Gi_pid681_ins5_thread_4_6457.csv.gz)
rename(/home/cas/rename-test/2021_11_25_20_55_Gi_pid29741_ins5_thread_4_7540.csv.gz, /home/cas/rename-test/2021_11_25/20/2021_11_25_20_55_Gi_pid29741_ins5_thread_4_7540.csv.gz)
rename(/home/cas/rename-test/2021_11_25_20_55_Gi_pid30842_ins3_thread_2_7489.csv.gz, /home/cas/rename-test/2021_11_25/20/2021_11_25_20_55_Gi_pid30842_ins3_thread_2_7489.csv.gz)
rename(/home/cas/rename-test/2021_11_25_20_55_Gi_pid30842_ins4_thread_3_7488.csv.gz, /home/cas/rename-test/2021_11_25/20/2021_11_25_20_55_Gi_pid30842_ins4_thread_3_7488.csv.gz)
rename(/home/cas/rename-test/2021_11_25_20_55_Gi_pid30842_ins5_thread_4_7489.csv.gz, /home/cas/rename-test/2021_11_25/20/2021_11_25_20_55_Gi_pid30842_ins5_thread_4_7489.csv.gz)

BTW, it would be trivial to modify this so that it moved the files to a completely different path - just make it do something like my $dir = "/my/new/path/$1/$2"; instead of my $dir = "$path/$1/$2";

The key thing to understand about how the perl rename utility works is that iff the rename script modifies the $_ variable then rename will attempt to rename the file to the new value of $_. If $_ is unchanged, it will not try to rename it. This is why you can use any perl code to rename files - has to do is change $_. Most often you'll probably use very simple sed-like rename scripts (e.g. rename 's/ +/_/g' * to rename spaces in filenames to an underscore) but the rename algorithm can be as complex as you need it to be.

$_ is a very important variable in perl - it's used as the default variable to hold input from file handles and iterators for loops if the programmer doesn't specify one. It's also used as the default operand for several operators (like m//, s///, tr///) and as the default argument for many (but not all) functions. See man perlvar and search for $_ (you'll need to escape that in less as \$_).


BTW, one thing I didn't mention about rename earlier is that it can take filenames either as arguments on the command line or from stdin. It defaults to newline-separated input from stdin (so it won't work with filenames that contain newlines - an annoying but completely valid possibility). You can use the -0 argument to make it use NUL separated input instead of newline-separated...so, it can work with any filenames, taking input from anything that can generate a list of NUL-separated filenames (e.g. find ... -print0, but it's probably better to just use find's -exec ... {} + option).

rename will also refuse to rename a file over an existing file unless you use its -f or --force option.

cas
  • 78,579
  • Thank you @cas. Amazing answer. I was not aware of prename as I'm new to linux system. Could you please explain the substitution regex s=^=$dir= ? And also how the code would would change if I'm to put in the path. Thanks again for the brilliant answer :) – nidooooz Apr 11 '22 at 08:36
  • The substitution regex just inserts the new directory at the start (^) of the filename, which causes rename to rename the file. Obviously, this won't work if the start of the "filename" is actually a path. To change it to cope with full pathnames as input, you'd have to either add the new subdirectory in between the existing path and the file's basename, or (easier) replace the entire path with a newly constructed path string. perl's File::Basename core module would help with this, it can easily split a pathname into dir and basename portions. – cas Apr 11 '22 at 09:23
  • Thank you...... – nidooooz Apr 13 '22 at 02:56
  • Hi @cas, I'm getting the error bash: /bin/find: Argument list too long I'm running the following command for the third version find /home/cas/rename-test/ -type f rename -n 'BEGIN {use File::Path qw(make_path); use File::Basename};my ($file, $path) = fileparse($_);if ($file =~ m/(\d{4}_\d\d_\d\d)_(\d\d)/) {my $dir = "$path/$1/$2";make_path $dir;$_ = "$dir/$file"}' {} \; Hope it works – nidooooz Apr 13 '22 at 03:39
  • With find, you can either pipe the filenames into rename (use -print0 with the find command, and -0 with the rename command for NUL-separated filenames), or you can use find's -exec option (-exec rename ..... {} +). If you use + with -exec, find will try to fit as many filenames as will fit into a max length command line, and will run rename as many times as necessary to process all filenames. If you use -exec ... {} \; instead of -exec ... {} +, it will run rename once per filename. In none of these cases will you ever get an arg list too long error. – cas Apr 13 '22 at 05:14
  • BTW, you seem to have missed the -exec from your find command....I'm assuming that's a copy-paste error. – cas Apr 13 '22 at 05:15
  • ahh yes, find /home/cas/rename-test/ -type f -exec rename -n 'BEGIN {use File::Path qw(make_path); use File::Basename};my ($file, $path) = fileparse($_);if ($file =~ m/(\d{4}_\d\d_\d\d)_(\d\d)/) {my $dir = "$path/$1/$2";make_path $dir;$_ = "$dir/$file"}' {} \; I missed exec while typing :)..Thank you so much – nidooooz Apr 13 '22 at 07:16
  • I'd use + rather than \; to terminate the -exec. Running rename once per several thousand files is much faster than running it once per file (the exact number of files depends on how long each pathname is - Linux currently has a command line length limit, ARG_MAX, of approx 2 million characters. There's a good summary of how it works in the answers to CP: max source files number arguments for copy utility) Running rename has startup overhead each time it's run, which adds up if you're doing it for lots of files. – cas Apr 14 '22 at 01:53
  • In short, it comes down to the difference between running it once to process, for example, 10000 files vs running it 10000 times and processing one file per run. The former will be much faster because it's not wasting so much extra time on startup overhead. – cas Apr 14 '22 at 01:54
  • Hi @cas, thank you for the suggestion. According to the first version I first cd into the folder and run find . -type f -exec prename -n 'if (m/(^\d{4}_\d\d_\d\d)_(\d\d)/) {my ($date,$hour) = ($1,$2);my $dir = "./$date/$hour/";mkdir $date;mkdir $dir;s=^=$dir=}' {} + is not giving any result. When I run find ./* -type f -exec prename -n 'if (m/(^\d{4}_\d\d_\d\d)_(\d\d)/) {my ($date,$hour) = ($1,$2);my $dir = "./$date/$hour/";mkdir $date;mkdir $dir;s=^=$dir=}' {} + it gives bash: /bin/find: Argument list too long. Can you please help? – nidooooz Apr 14 '22 at 02:53
  • you're using find ./* instead of find . with the second one. your shell will expand ./* to all files and dirs in the current dir. – cas Apr 14 '22 at 02:57
  • When I do find . the command just runs for a while without giving any result. find . -type f -exec prename -n 'if (m/(^\d{4}_\d\d_\d\d)_(\d\d)/) {my ($date,$hour) = ($1,$2);my $dir = "./$date/$hour/";mkdir $date;mkdir $dir;s=^=$dir=}' {} + – nidooooz Apr 14 '22 at 03:06
  • do you need to recurse subdirs? do you have many thousands of matching files in the directory? if the answer to both of those is "no", then there's no need to use find. 2. when you use find, the filenames it produces are always prefixed by the file's directory - at minimum, that will be ./. The regexp uses ^ so it only matches files starting with the pattern, but the names from find start with ./. This can never match. Solution: remove ^ from the regex. 3. use the third version, it will work with files in any dir, including current dir.
  • – cas Apr 14 '22 at 03:16
  • By removing ^ doing find . -type f -exec prename -n 'if (m/(\d{4}_\d\d_\d\d)_(\d\d)/) {my ($date,$hour) = ($1,$2);my $dir = "./$date/$hour/";mkdir $date;mkdir $dir;s=^=$dir=}' {} + gives ./2021_12_30_04_56_Diameter_CDR_pid5906_ins3_thread_2_19104.csv.gz -> ./2021_12_30/04/./2021_12_30_04_56_Diameter_CDR_pid5906_ins3_thread_2_19104.csv.gz...the third version does work, I'm asking this so that it would help me better understand what's going on in the command :) Thanks again for your help – nidooooz Apr 14 '22 at 03:37