What is a faster way to extract the year from file names to move them to year based directories than my current approach of using `cut` and `rev`?

Question

I have a web application that access a remote storage running Linux to get some files, the problem is that the remote storage have currently 3 million files , so accessing the normal way is a bit tricky.

So I needed to work on a script that is going to make it a little bit more easy to use , this script is going to reorganize the files into multiple folders depending on their creation date and specially their names,i made the script and it worked just fine, it intended to do what it meant to do, but it was too slow, 12 hours to perform the work completely (12:13:48 to be precise).

I think that the slowness is coming from the multiple cut and rev calls I make.

example :

I get the file names with an ls command that I loop into with for, and for each file I get the parent directory and, depending on the parent directory, I can get the correct year:

 case "$parent" in
                ( "Type1" )
                year=$(echo "$fichier" | rev | cut -d '_' -f 2 | rev );;
            ( &quot;Type2&quot; )
            year=$(echo &quot;$fichier&quot; | rev | cut -d '_' -f 2 | rev);;

            ( &quot;Type3&quot; )
            year=$(echo &quot;$fichier&quot; | rev | cut -d '_' -f 1 | rev | cut -c 1-4);;

            ( &quot;Type4&quot; )
            year=$(echo &quot;$fichier&quot; | rev | cut -d '_' -f 1 | rev | cut -c 1-4);;

            ( &quot;Type5&quot; )
            year=$(echo &quot;$fichier&quot; | rev | cut -d '_' -f 1 | rev | cut -c 1-4);;
            esac

for type1 of files :

the file==>MY_AMAZING_FILE_THAT_IMADEIN_YEAR_TY.pdf

I need to get the year so I perform a reverse cut:

year=$(echo "$file" | rev | cut -d '_' -f 2 | rev );;

for type2 of files :

the file==>MY_AMAZING_FILE_THAT_IMADE_IN_YEAR_WITH_TY.pdf

etc...

and then I can mv the file freely : mv $file /some/path/destination/$year/$parent

and yet this is the simplest example, there are some files that are much more complex, so to get 1 information I need to do 4 operations, 1 echo , 2rev and 1echo.

While the script is running I am getting speeds of 50 files/sec to 100 files\s , I got this info by doing a wc-l output.txt of the script.

Is there anything I can do to make it faster? or another way to cut the files name? I know that I can use sed or awk or string operations but I did not really understand how.

Is there anything i can do to make it faster ? - Sure, but you don't give enough information for a more useful answer. — Satō Katsura, Oct 11 '17 at 13:24
@SatōKatsura what information i need to add ? i'll give them gladly — Kingofkech, Oct 11 '17 at 13:27
What do the "much more complex" filenames look like? What does the directory structure look like? How does $file get its value? What does your current code look like? Please [edit] your question. — Kusalananda, Oct 11 '17 at 13:41
@Kingofkech How are you reading the list of files? How are you renaming them? Most people have a hard time optimizing code they didn't see, you know. — Satō Katsura, Oct 11 '17 at 13:59
You're doing way too many operations per file, and also getting the list of files with ls is not a good idea. You can probably do everything with a single find and Perl rename. But again, you don't give enough information for a full answer. Good luck. — Satō Katsura, Oct 11 '17 at 14:13
i can not use find , because using find to loop over file names is really bad , you can ask @Kusalananda he is the one who recommended it in another question. — Kingofkech, Oct 11 '17 at 14:18
@Kingofkech You can do it with find, no problem. Are the years the only four-digit number in the filenames? — Kusalananda, Oct 11 '17 at 14:19
find detects files with spaces as multiple lines , i've already asked this questions and you've answered it here — Kingofkech, Oct 11 '17 at 14:20
@Kingofkech That's because you probably parse the output of find. I'm currently updating my answer. Are the years the only four-digit number in the filenames? — Kusalananda, Oct 11 '17 at 14:21
Real example of the files
Type 1: FA_ERDXSER_CALSE_RASM_2017047361_YEAR_20170922.pdf

Type 2: FILE_SENT_PAID_1998027890_YEARMMdd.pdf" — Kingofkech, Oct 11 '17 at 14:25
YEAR=Year and the MM is the month and the dd is the day for example 20170223 is the 23 of February 2017, and i need to get the YEAR in orded to move the file — Kingofkech, Oct 11 '17 at 14:42
Please explain what you want to do instead of what you are doing. As you point out, your approach is really not optimal. First because you're using ls which means it will break on weird file names and then because this seems way too complicated. We can't really help you though since you don't clearly explain what you are trying to do with these files. — terdon, Oct 11 '17 at 14:44

Kusalananda · Accepted Answer · 2017-10-11T15:00:56.503

To get the YEAR portion of the filename MY_AMAZING_FILE_THAT_IMADEIN_YEAR_TY.pdf without using external utilities:

name='MY_AMAZING_FILE_THAT_IMADEIN_YEAR_TY.pdf'

year=${name%_*}    # remove everything after the last '_'
year=${year##*_}   # remove everything up to the last (remaining) '_'

After update to the question:

Moving PDF files from under topdir to a directory /some/path/destination/<year>/<parent> where <year> is the year found in the filename of the file, and <parent> is the basename of the original directory that the file was found in:

find topdir -type f -name '*.pdf' -exec bash ./movefiles.sh {} +

movefiles.sh is a shell script in the current directory:

#!/bin/bash

destdir='/some/path/destination'

for name; do
    # get basename of directory
    parent=${name%/*}
    parent=${parent##*/}

    # get the year from the filename:
    #  - Pattern:  _YYYY_         (in the middle somewhere)
    #  - Pattern:  _YYYYMMDD.pdf  (at end)
    if [[ "$name" =~ _([0-9]{4})_ ]] ||
       [[ "$name" =~ _([0-9]{4})[0-9]{4}\.pdf$ ]]; then
        year="${BASH_REMATCH[1]}"
    else
        printf 'No year in filename "%s"\n' "$name" >&2
        continue
    fi

    # make destination directory if needed
    # (remove echo when you have tested this at least once)
    if [ ! -d "$destdir/$year/$parent" ]; then
        echo mkdir -p "$destdir/$year/$parent"
    fi

    # move file
    # (remove echo when you have tested this at least once)
    echo mv "$name" "$destdir/$year/$parent"
done

Thank you sir for your answer but are those instructions going to make my code faster then the state he is in right now ? (compared to the cut and double rev) — Kingofkech, Oct 11 '17 at 13:33
@Kingofkech Yes, but it also depends on the context of your code. How does $file get its value, for example? Update your question to give a bit more context to what it is you're doing. — Kusalananda, Oct 11 '17 at 13:34
Thank you so much i think that this is much more optimal , i did not really thought of using find to call a script , that will be more optimal. — Kingofkech, Oct 11 '17 at 14:46

RomanPerekhrest · Answer 2 · 2017-10-11T13:37:41.913

2

You may apply sed approach to extract year value:

year=$(sed -E 's/.*_([0-9]{4})_TY\.pdf/\1/' <<<"$file")

edited Oct 11 '17 at 13:37

answered Oct 11 '17 at 13:27

RomanPerekhrest

30,212

What is a faster way to extract the year from file names to move them to year based directories than my current approach of using `cut` and `rev`?

2 Answers2