Same file, different filename due to encoding problem?

Question

I was about to diff a backup from it's source to manually verify that the data is correct. Some chars, like åäö, is not shown correctly on the original data but as the clients (over samba) interpret it correctly it's nothing to worry about. The data restored from backup shows the chars correctly, leading diff to not consider them to be the same files (with diffs, but rather completely different files).

md5 sums, same file but different name.

# md5sum /original/iStock_000003637083Large-barn*
e37c34968dd145a0e25692e1cb7fbdb1  /original/iStock_000003637083Large-barn p? strand.jpg

# md5sum /frombackup/iStock_000003637083Large-barn*
e37c34968dd145a0e25692e1cb7fbdb1  /frombackup/iStock_000003637083Large-barn på strand.jpg

Mount options and filesystems

/dev/sdb1 on /original type ext4 (rw,noatime,errors=remount-ro)
/dev/sdc1 on /frombackup type ext4 (rw)

Locale

LANG=sv_SE.UTF-8
LANGUAGE=
LC_CTYPE="sv_SE.UTF-8"
LC_NUMERIC="sv_SE.UTF-8"
LC_TIME="sv_SE.UTF-8"
LC_COLLATE="sv_SE.UTF-8"
LC_MONETARY="sv_SE.UTF-8"
LC_MESSAGES="sv_SE.UTF-8"
LC_PAPER="sv_SE.UTF-8"
LC_NAME="sv_SE.UTF-8"
LC_ADDRESS="sv_SE.UTF-8"
LC_TELEPHONE="sv_SE.UTF-8"
LC_MEASUREMENT="sv_SE.UTF-8"
LC_IDENTIFICATION="sv_SE.UTF-8"
LC_ALL=

od -c

# ls "/original/iStock_000003637083Large-barn p� strand.jpg" | od -c
0000000   /   v   a   r   /   w   w   w   /   m   e   d   i   a   b   a
0000020   n   k   e   n   _   i   m   a   g   e   s   /   k   u   n   d
0000040   i   d   8   0   /   _   B   a   r   n   /   i   S   t   o   c
0000060   k   _   0   0   0   0   0   3   6   3   7   0   8   3   L   a
0000100   r   g   e   -   b   a   r   n       p 345       s   t   r   a
0000120   n   d   .   j   p   g  \n
0000127


# ls "/frombackup/iStock_000003637083Large-barn på strand.jpg" | od -c
0000000   /   d   a   t   a   /   v   a   r   /   w   w   w   /   m   e
0000020   d   i   a   b   a   n   k   e   n   _   i   m   a   g   e   s
0000040   /   k   u   n   d   i   d   8   0   /   _   B   a   r   n   /
0000060   i   S   t   o   c   k   _   0   0   0   0   0   3   6   3   7
0000100   0   8   3   L   a   r   g   e   -   b   a   r   n       p 303
0000120 245       s   t   r   a   n   d   .   j   p   g  \n
0000135

Have sd[bc]1 been populated on the same machine? I.e., w/ the same mount-option and locale settings? — tink, Mar 07 '13 at 19:46
No, good spot. I did however pull it fresh from backup on the same machine just now, and the problem persists. See the output of 'od' added in the edit. — user135361, Mar 08 '13 at 08:34

score 7 · Accepted Answer · edited Apr 13 '17 at 12:36

Unix filesystems tend to be locale-agnostic in the sense that file names consist of bytes and it's the application's business to decide what those bytes mean if they fall outside the ASCII range. The convention on unix today is to encode filenames and everything else in UTF-8, apart from some legacy environments (mostly Asian). Windows filesystems, on the other hand, tend to have an encoding that is specified in the filesystem properties.

If you need to work with filenames in a different encoding, create a translated view of that filesystem with convmvfs. See working with filenames in a different encoding over ssh

It appears that your original system has filenames encoded in latin-1. Your current system uses UTF-8, and the one-byte sequence representing å in latin-1 (\345) is an invalid sequence in UTF-8 which ls prints as ?. Your backup process has somehow resulted in filenames encoded in UTF-8. Samba translates filenames based on its configuration.

To access the original files with your native encoding, make a recoded view:

mkdir /original-recoded
convmvfs -o icharset=LATIN1,ocharset=UTF8 /original /original-recoded
diff -r /original-recoded /frombackup

(You may need other options depending on what permissions and ownership you want to obtain.)

Thanks for the explanation of how it works. Not sure this is really helping me, are you telling me I (likely) have filesystems that have different encodings, and thus I need to create a translated view of .. etc? — user135361, Mar 08 '13 at 08:33
@user135361 You have data sets where filenames have different encodings. I've expanded my answer. — Gilles 'SO- stop being evil', Mar 09 '13 at 00:25

score 1 · Answer 2 · answered Mar 07 '13 at 19:47

In Unix/Linux, a filename can contain any character except '\0' (ASCII NUL) and '/' (slash, directory separator). In particular, if you want to give your files names in Kanji in some weird encoding, just go ahead. You will probably only see gibberish on ls(1) or other commands, but nothing bad will happen. That is what you are seeing, the på is rendered as p?, the '?' here is a common shortcut for "unknown/non-ASCII character".

Try running both filenames through od -c, i.e. do something like:

ls /the/dir/offending/fi* | od -c

(the glob is to filter out non-relevant names, adjust to taste).

Only if the output differ would I start worrying. But given your Svedish setup, I suspect the correct name is på. Perhaps the other one is a name in Latin-4 leftover from a previous setup?

Although not a solution, I think you provide a valuable explanation of how it works. Additionally, I didn't know of 'od', edited to provide od output. — user135361, Mar 08 '13 at 08:32

Same file, different filename due to encoding problem?

2 Answers2

Linked