pandoc continues complain about non-utf8 character even though it seems there is no non-utf8 character

Question

I am trying to convert a markdown file to pdf using pandoc. Since my markdown contains Chinese characters, I use the following command to produce the pdf:

pandoc --pdf-engine=xelatex -V CJKmainfont=KaiTi test.md -o test.pdf

But pandoc complains that the file contains non-utf8 characters that it can not handle, the exact error message is:

Error producing PDF.
! Undefined control sequence.
pandoc.exe: Cannot decode byte '\xbd': >Data.Text.Internal.Encoding.streamDecodeUtf8With: Invalid UTF-8 stream

According to what I have find in the internet. This is largely due to the encoding of the markdown file and may have nothing to do with pandoc. My file contains a lot of chinese characters and English characters. I have converted it to utf-8 encoding.

Things I have tried but without success

I tried to transfer my file to my CentOS server and find where the invalid characters are or just remove the invalid characters. But without success.

Grep for the non-utf8 character

Following the instruction here and here(In fact, I have tried multiple top answers in the two post, but they do not work). I have verified that the system locale is set to UTF-8, output of localectl status is:

   System Locale: LANG=en_US.UTF-8
       VC Keymap: us
      X11 Layout: us

I tried to grep for non-utf8 character. Command used is grep -axv '.*' test.md. But the command output nothing. (I thought that means there are no invalid characters which can not be decoded by utf-8.)

Try to discard invalid characters

I followed the instruction here trying to remove non-utf8 characters from my file. The command I use is:

iconv -f utf-8 -t utf-8 -c test.md > output.md

After that, When I tried to convert output.md to pdf using pandoc. I still met the same error message, which suggests that the file still contains non-utf8 characters.

My question

I am surprised that the above methods does not work. How can I pinpoint which part of file is causing the problem or how to really remove the non-utf8 character from the file so that I can compile it without error?

Other information

You can find the markdown file here.
If you are using Linux system, you may need to set CJKmainfont to other valid Chinese font name in your system.
on Linux system, it seems the the command to produce pdf from markdown with Chinese text should be (change the font to the valid font):

pandoc --latex-engine=xelatex -V CJKmainfont=KaiTi test.md -o test.pdf

... why are you posting a question about Windows here instead of on [su]? — muru, Dec 26 '17 at 07:17
Cross-posted: https://stackoverflow.com/questions/47954642/how-to-find-the-where-the-non-utf8-character-is-in-my-file — muru, Dec 26 '17 at 07:18
Yes, it is posted by me, because someone there suggest I should go for other stackexchange site. — jdhao, Dec 26 '17 at 07:28
I want to find which command on the Linux system can locate where the invalid character is in my file. I think it is perfectly related to Linux. — jdhao, Dec 26 '17 at 07:34
Dupe of https://unix.stackexchange.com/questions/6516/filtering-invalid-utf8, then. You have tried two answers, there are others. — muru, Dec 26 '17 at 07:37
This is not a duplicate of that question, muru, because as is pointed out in this question, the tests conclude that the file does not contain invalid UTF-8, and so this question is Why is pandoc complaining about a UTF-8 problem that is not there?. Or it would be. I have attempted to duplicate the results in the question. I have replicated several of the tests saying that the file is valid UTF-8. But I am unable to replicate the questioner's original problem. On Debian Linux, with a UTF-8 locale, pandoc converted the file to PDF for me just fine, with no complaint. — JdeBP, Dec 26 '17 at 09:53
(It converted it to something, I should say. I just looked for pandoc complaining about UTF-8, which it had no complaints about, and didn't inspect the output file.) — JdeBP, Dec 26 '17 at 10:11
On both my Windows and CentOS system. It just complains Invalid UTF-8 stream. BTW, when can you successfully convert markdown to pdf, before or after you have transformed the markdown file. — jdhao, Dec 26 '17 at 10:17

jdhao · Answer 1 · 2017-12-29T07:24:36.270

Ok, after long hours of wrestling with the problems and digging. I finally find the root cause of the problem.

The cause

The problem is that in the test.md, texts starting with backslash exist in several places which should really be taken as literals. For example,

* 一般现在时\过去时\将来时，simple present\past\future
* 现在(过去\将来)进行时，present(past\ future) continuous
* 现在（过去\将来）完成时，present(past\future) perfect
* 现在（过去\将来）完成进行时，present(past\future) perfect continuous

Backslashes in the above paragraph are just intended as a separator for different situations. It is valid markdown. But they are unfortunately processed as command by pandoc.

Solution

Use the following command instead:

pandoc -f markdown-raw_tex --pdf-engine=xelatex -V CJKmainfont=KaiTi test.md -o test.pdf

Or warp the text starting with backslash using backticks (but this is not always desired) or just use two backslashes.

Some thought

The error message from Pandoc is misleading as the problem is not related to UTF-8 decoding. I have no idea why the error message is like that.

Also, it seems that the error messages for this issue is not consistent. For example, for the above text containing backslashes. If you compile it using

pandoc -f markdown --pdf-engine=xelatex -V CJKmainfont=KaiTi test.md -o test.pdf

The error message will be something like:

Error producing PDF.
! Undefined control sequence.
l.75 一般现在时\过去时

Then it will be much easier to find where the problem is instead of digging up about utf-8 related problems.

Follow-ups

This is indeed a bug in xelatex. It may produce invalid utf-8 bytes when it encounters invalid control sequences. But pandoc just assumes that what it receives is valid utf-8 sequence. Hence the error. For a more detailed explanations, see this post.

update 2017.12.29
With the release of Pandoc 2.0.6, this behaviour is handled more properly:

Allow lenient decoding of latex error logs, which are not always properly UTF8-encoded

Now, it is easier to debug this kind of issues.

One reason a lot of software uses backslashes to introduce commands is that the backslash is not used in English — it was only introduced to ASCII so that /\ and / could be used in ALGOL for "and" and "or" — Fox, Dec 29 '17 at 21:29

score 0 · Answer 2 · answered Dec 27 '17 at 02:54

0

pandoc is complaing about byte \xbd (hexadecimal "bd"), so grep for that. e.g.

grep -n $'\xbd' file

e.g. if I create a small file with 4 lines, one of which contains the \xbd character:

a
b
c½
d

then grep -n will tell me it's on line 3:

$ grep -n $'\xbd' file 
3:c½

NOTE: the $'\xbd' requires a unix shell like bash. See man bash and search for "QUOTING" for details.

BTW, the \xbd character is an extended ascii character. It may be a broken unicode sequence (many unicode characters have 0xbd as one of their byte values). On my screen it displays as a '1/2' fraction. Here's what ascii has to say about it:

$ ascii bd
ASCII 11/13 is decimal 189, hex bd, octal 275, bits 10111101: meta-=

answered Dec 27 '17 at 02:54

cas

78,579

Thanks for your answer, but the grep command does not work. I have tried it. I have found the real reason and post an answer later. – jdhao Dec 27 '17 at 03:08
the grep command does work. I even copy-pasted an example from my own terminal session where I tested it (and an explanation of what the 0xbd character is) that shows it working. It answers your question as posted. If your question does not accurately reflect what the problem is, then you need to fix your question so that it is answerable. – cas Dec 27 '17 at 03:13
I mean the grep command can not find 0xbd in my file. There is no 0xbd in my file, but pandoc just complains. BTW, can you grep 0xbd in my uploaded file? – jdhao Dec 27 '17 at 03:30
@cas You said: "It may be a broken unicode sequence". That's exactly what this is all about. If you look at the file in question, you will find it is full of 0xbd bytes, and also lots of other "extended ascii" bytes, which is perfectly normal for UTF-8 text. Pandoc complains about one 0xbd byte that is out of sequence, i.e. is not part of a valid UTF-8 byte sequence. Nobody seems to have been able to find this invalid UTF-8 sequence in the file, so the hypothesis is now that Pandoc is wrong. – Johan Myréen Dec 27 '17 at 03:39
that's pretty much what i was just about to post a comment saying - it's a problem with pandoc. it seems to be a different version than what i have on debian (1.19.24) because --pdf-engine=xelatex is not a valid option. changing it to --latex-engine=xelatex and removing the -V CJKmainfont=KaiTi (font is non-existent on my system), results in pandoc complaining ! Undefined control sequence. followed by l.721 ...�度有问题，应该把\textwidth换成. Running grep -n '度有问题，应该把' test.md shows line 614. that may be worth taking a closer look at. – cas Dec 27 '17 at 03:45
In pandoc 2.0, --latex-engine is changed to --pdf-engine. The root problem is that text starting with backslash should be escaped, as suggest here. But the error message is really misleading. This file is valid utf-8 file, with no such invalid characters. – jdhao Dec 27 '17 at 03:51
line 614 of test.md seems to be the first line with tex commands like \textwidth embedded - i can't understand the text but I presume that they're meant to be literal strings, not tex commands. they probably need to be written as \\textwidth etc. – cas Dec 27 '17 at 03:52
in other words, garbage-in garbage-out. this Q can be closed as "off-topic: problem went away when a typo was corrected". – cas Dec 27 '17 at 03:56
No, this is not a typo. This is a problem with pandoc I think. It output the wrong error message. The correct error message should be something like undefined command. – jdhao Dec 27 '17 at 03:57
1

the problem was caused by you failing to correctly escape the \\ in your input text. there's the "garbage in". it would be xelatex complaining about the 'undefined command', not pandoc. pandoc then has to deal with that (the garbage out). By analogy: you're feeding Finnish text into a Swahili to Russian translator and the translator responds with "Que?" or perhaps "out of cheese error". – cas Dec 27 '17 at 04:02