Why did my folder names end up like this, and how can I fix this using a script?

Question

Sorry if this has an answer elsewhere, I've no idea how to search for my problem.

I was running some simulations on a redhat linux HPC server, and my code for handling the folder structure to save the output had an unfortunate bug. My matlab code to create the folder was:

folder = [sp.saveLocation, 'run_', sp.run_number, '/'];

where sp.run_number was an integer. I forgot to convert it to a string, but for some reason running mkdir(folder); (in matlab) still succeeded. In fact, the simulations ran without a hitch, and the data got saved to the matching directory.

Now, when the folder structure is queried/printed I get the following situations:

When I try to tab autocomplete: run_ run_^A/ run_^B/ run_^C/ run_^D/ run_^E/ run_^F/ run_^G/ run_^H/ run_^I/
When I use ls: run_ run_? run_? run_? run_? run_? run_? run_? run_? run_? run_?.
When I transfer to my mac using rsync the --progress option shows: run_\#003/ etc. with (I assume) the number matching the integer in sp.run_number padded to three digits, so the 10th run is run_\#010/
When I view the folders in finder I see run_ run_ run_ run_ run_ run_ run_ run_ run_ run_?
Looking at this question and using the command ls | LC_ALL=C sed -n l I get:

run_$
run_\001$
run_\002$
run_\003$
run_\004$
run_\005$
run_\006$
run_\a$
run_\b$
run_\t$
run_$

I can't manage to cd into the folders using any of these representations.

I have thousands of these folders, so I'll need to fix this with a script. Which of these options is the correct representation of the folder? How can I programmatically refer to these folders so I rename them with a properly formatted name using a bash script? And I guess for the sake of curiosity, how in the hell did this happen in the first place?

"When I try to tab autocomplete: ... If I try to type ..." Why type and not let autocomplete complete if for you? Also ^A is not literally ^ followed by A, but Ctrl-A (you can type it using Ctrl-V Ctrl-A since Ctrl-A is generally a shortcut for the shell). — muru, Aug 26 '19 at 02:46
@muru that doesn't work... I get as far as run_ and I have to type something — Bamboo, Aug 26 '19 at 02:49
Sorry commented before I saw your edit, that manages to get me in via cd — Bamboo, Aug 26 '19 at 02:51
You need to configure tab completion to cycle through the names - ^ see dupe. — muru, Aug 26 '19 at 03:01
I've edited the question now. My main problem is not about using cd, but rather that I have a huge number of files so I don't want to do this manually — Bamboo, Aug 26 '19 at 03:21
"How did this happen" is a Matlab programming question you'll probably get better answers on in a more suitable place. Is your goal to rename all the directories to run_1, run_2, etc? If you have thousands of them, how are the ones beyond the first byte's worth represented (| tail is probably good enough)? — Michael Homer, Aug 26 '19 at 03:41
FYI, you can "fix" such directory names with the the perl rename utility. e.g. rename -n 's/([[:cntrl:]])/ord("$1")/eg' run_*/. You may have to use perl's unpack() function instead of ord() if sp.run_number can exceed one 8-bit value (i.e. >255). the -n option in my example is for a dry-run, remove it to actually rename. — cas, Aug 26 '19 at 03:59
@MichaelHomer, I would like to get them back to run_1, run_2 etc, yes. Each simulation only goes up to 10 runs, but I have hundreds of simulations. @cas can you expand on your comment, maybe make it an answer? — Bamboo, Aug 26 '19 at 04:05
BTW, the "some reason" why mkdir in matlab did this is because the ONLY invalid characters in a file or directory name on unix filesystems are NUL and forward-slash /. Any other character is valid, including control characters. I don't know what matlab would have done if sp.run_number was 0 (probably either abort with an error or produce run_, as the NUL byte would terminate the directory name string). Of course, this would be also problematic for 16-bit (or higher) values that had a NUL byte in them, and would also vary according to the endian-ness of the system running matlab. — cas, Aug 26 '19 at 05:19
On my cygwin installation I can do a simple for f in *; do mv "$f" run$i.txt; ((i++)); done. That takes the files (in no particular order, I assume) and renames them as run.txt run1.txt run2.txt etc. Note that echo swallows the non-printable chars, so echo * does not show you the true content of the *expansion. — Peter - Reinstate Monica, Aug 27 '19 at 15:21

cas · Accepted Answer · 2019-08-26T05:02:02.967

You can use the perl rename utility (aka prename or file-rename) to rename the directories.

NOTE: This is not to be confused with rename from util-linux, or any other version.

rename -n 's/([[:cntrl:]])/ord($1)/eg' run_*/

This uses perl's ord() function to replace each control-character in the filename with the ordinal number for that character. e.g ^A becomes 1, ^B becomes 2, etc.

The -n option is for a dry-run to show what rename would do if you let it. Remove it (or replace it with -v for verbose output) to actually rename.

The e modifier in the s/LHS/RHS/eg operation causes perl to execute the RHS (the replacement) as perl code, and the $1 is the matched data (the control character) from the LHS.

If you want zero-padded numbers in the filenames, you could combine ord() with sprintf(). e.g.

$ rename -n 's/([[:cntrl:]])/sprintf("%02i",ord($1))/eg' run_*/ | sed -n l
rename(run_\001, run_01)$
rename(run_\002, run_02)$
rename(run_\003, run_03)$
rename(run_\004, run_04)$
rename(run_\005, run_05)$
rename(run_\006, run_06)$
rename(run_\a, run_07)$
rename(run_\b, run_08)$
rename(run_\t, run_09)$

The above examples work if and only if sp.run_number in your matlab script was in the range of 0..26 (so it produced control-characters in the directory names).

To deal with ANY 1-byte character (i.e. from 0..255), you'd use:

rename -n 's/run_(.)/sprintf("run_%03i",ord($1))/e' run_*/

If sp.run_number could be > 255, you'd have to use perl's unpack() function instead of ord(). I don't know exactly how matlab outputs an unconverted int in a string, so you'll have to experiment. See perldoc -f unpack for details.

e.g. the following will unpack both 8-bit and 16-bit unsigned values and zero-pad them to 5 digits wide:

 rename -n 's/run_(.*)/sprintf("run_%05i",unpack("SC",$1))/e' run_*/

Thanks for the details! I'm trying to test it out with the -n option, but it's telling me its an invalid option - the version information gives me rename from util-linux 2.23.2 so I'mnot sure its the same function — Bamboo, Aug 26 '19 at 04:29
that's why i specified the perl version of the rename utility. util-linux's rename is very different, far less capable, and the command line options are incompatible. if you're running debian or similar, try installing the file-rename package. otherwise install the appropriate package for your distro. it may already be installed, try running prename or file-rename instead of just rename. — cas, Aug 26 '19 at 04:31
Yeah I thought that was the case. I'll see if I can get one of those to work. Thanks again for taking the time to help me out! — Bamboo, Aug 26 '19 at 04:34

score 11 · Answer 2 · edited Jun 11 '20 at 14:16

And I guess for the sake of curiosity, how in the heck did this happen in the first place?
folder = [sp.saveLocation, 'run_', sp.run_number, '/'];
where sp.run_number was an integer. I forgot to convert it to a string, but for some reason running mkdir(folder); (in matlab) still succeeded.

So, it would appear that mkdir([...]) in Matlab concatenates the members of the array to build the filename as a string. But you gave it a number instead, and numbers are what the characters on a computer really are. So, when sp.run_number was 1, it gave you the character with value 1, and then the character with value 2, etc.

Those are control characters, they don't have printable symbols, and printing them on a terminal would have other consequences. So instead, they're often represented by different sorts of escapes: \001 (octal), \x01 (hex), ^A are all common representations for the character with value 1. The character with value zero is a bit different, it's the NUL byte that is used to mark the end of a string in C and in the Unix system calls.

If you went higher than 31, you'd start to see printable characters, 32 is space (not very visible though), 33 = !, 34 = " etc.

So,

run_ run_^A/ run_^B/ — The first run_ corresponds to the one with a zero byte, the string ends there. The others show that your shell likes to use display the control codes with ^A. The notation also hints at the fact that the char with numerical value 1 can be entered as Ctrl-A, though you need to tell the shell to interpret as not as a control character, but as a literal, Ctrl-V Ctrl-A should do that at least in Bash.
ls: run_ run_? run_? — ls doesn't like to print unprintable characters on the terminal, it replaces them with question marks.
rsync: run_\#003/ — that one's new to me, but the idea is the same, the backslash marks an escape, and the rest is the numerical value of the character. It seems to me that the number here is in octal, like in the more common \003.
using the command ls | LC_ALL=C sed -n l ... run_\006$ run_\a$ run_\b$ run_\t$ — \a, \b and \t are C escapes for alarm (bell), backspace and tab, respectively. They have the numerical values 7, 8 and 9, so it should be clear why they come after \006. Using those C escapes is yet another way to mark the control characters. The trailing dollar signs mark the line ends.

As for cd, assuming my assumptions are right, cd run_ should go to that one single directory without an odd trailing character, and cd run_? should give an error since the question mark is a glob character that matches any single character, and there are multiple matching filenames, but cd only expects one.

Which of these options is the correct representation of the folder?

All of them, in a sense...

In Bash, you can use the \000 and \x00 escapes inside $'...' quotes to represent the special characters, so $'run_\033 (octal) or $'run_\x1b' correspond to the directory with the character value 27 (which happens to be ESC). (I don't think Bash supports escapes with decimal numbers.)

cas's answer has a script to rename those, so I won't go there.

If it's GNU ls, there are some quoting options including -b/--escape and --quoting-style=, or the QUOTING_STYLE environment variable, to control how non-printing characters are shown. I don't think there's an option to make it prefer octal escapes over the character versions, though. — Toby Speight, Aug 27 '19 at 16:19

score 3 · Answer 3 · edited Sep 01 '19 at 11:20

Easiest would be to create the wrong filename and the correct filename in the same environment where the mishap happened, and then just move/rename the folders to the correct names.

To avoid collisions between existing names better use another destination folder.

./saveLocationA/wrongname1 -> ./saveLocationB/correctname1
./saveLocationA/wrongname2 -> ./saveLocationB/correctname2
./saveLocationA/wrongname3 -> ./saveLocationB/correctname3

If possible, I would prefer fixing the script and just running it again; fixing some weird bug post mortem probably costs more and can introduce new problems.

Good luck!

Why did my folder names end up like this, and how can I fix this using a script?

3 Answers3