4

How do you configure tidy to parse XML instead of HTML?

Explanation:
A while ago, a co-worker showed me a trick to use tidy to clean up XML.

Apparently, you create a tidyrc file like so:

input-xml: yes
quiet: yes
indent: yes
indent-attributes: yes
indent-spaces: 4
char-encoding: utf8
wrap: 0
wrap-asp: no
wrap-jste: no
wrap-php: no
wrap-sections: no

Even after adding this to ~/.tidyrc , tidy is still attempting to parse as the default HTML, and not XML:

$ cat -v foo.out | tidy > foo.xml
line 3 column 1 - Error: <data> is not recognized!
line 3 column 1 - Warning: missing <!DOCTYPE> declaration
line 3 column 1 - Warning: discarding unexpected <data>

I've tried various permissions:

[root@mongo-test3 tmp]# ls -ial ~
 51562 -rw-------  1 root root 11550 Jul 16 02:17 .bash_history
 50973 -rw-r--r--  1 root root    18 May  1 00:40 .bash_logout
 51538 -rw-r--r--  1 root root   176 May  1 00:40 .bash_profile
 51537 -rw-r--r--  1 root root   124 May  1 00:40 .bashrc
 51561 -rwxr-xr-x  1 root root   164 Jul 16 22:16 .tidyrc

I've tried naming the file .tidyrc and just tidyrc

Versions:
I've tried this on both MacOS and Cent 6.4

Mac OSX 10.8.4

Darwin spuders-macbook-pro 12.4.0 Darwin Kernel Version 12.4.0: Wed May 1 17:57:12 PDT 2013; root:xnu-2050.24.15~1/RELEASE_X86_64 x86_64

CentOS 6.4

Linux mongo-test3 2.6.32-279.22.1.el6.x86_64 #1 SMP Wed Feb 6 03:10:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Research:
Normally I would ask the person who taught me this trick, but they are incommunicable.

Workaround:
As a work around, I can use the -xml flag, but I would prefer to get the tidyrc working:

$ cat -v foo.out | tidy -xml foo.xml
slm
  • 369,824
spuder
  • 18,053

1 Answers1

2

If you look through the man page for tidy you'll notice a comment that says the following:

Name of the default configuration file. This should be an absolute path, since you will probably invoke tidy from different directories. The value of HTML_TIDY will be parsed after the compiled-in default (defined with -DTIDY_CONFIG_FILE), but before any of the files specified using -config.

So it would appear that tidy has a compile time option where it can be hard coded to look for specific configuration files, as you're attempting to do.

Looking through some of tidy's online documentation on Raggett's page I came across this blurb:

Alternatively, you can name the default config file via the environment variable named "HTML_TIDY". Note this should be the absolute path since you are likely to want to run Tidy in different directories. You can also set a config file at compile time by defining CONFIG_FILE as the path string, see platform.h.

So after downloading the source for tidy and looking inside the file platform.h I found the following lines:

/* #define TIDY_CONFIG_FILE "/etc/tidy_config.txt" */ /* original */
/* #define TIDY_CONFIG_FILE "/etc/tidyrc" */
/* #define TIDY_CONFIG_FILE "/etc/tidy.conf" */

/*
  Uncomment the following #define if you are on a system
  supporting the HOME environment variable.
  It enables tidy to find config files named ~/.tidyrc if 
  the HTML_TIDY environment variable is not set.
*/
/* #define TIDY_USER_CONFIG_FILE "~/.tidyrc" */

If you know C/C++, all these lines are commented out, so in effect the tidy that I have has all the options to make use of a config file disabled. I also double checked the package that was built for my Fedora 14 system to make sure that the package file form which the package was built (tidy.spec) didn't have any configure commands that would override the above configurations in the platform.h. I found no such overrides.

Therefore it would appear that the stock tidy doesn't have the ability to look for a configuration file of any sort.

So what are your options?

Well you can still provide tidy the configuration file as part of the command line:

$ ... | tidy -config ~/.tidyrc > foo.xml

Additionally you could make use of another feature of tidy that may have gone unnoticed above, its ability to make use of an environment variable HTML_TIDY. It needs to be a absolute path, so you can't use "~/.tidyrc" but you could do this:

$ export HTML_TIDY="$HOME/.tidyrc" $ cat -v foo.out | tidy > foo.xml

If you want to make that variable permanent, just add it to your $HOME/.bashrc file:

export HTML_TIDY="$HOME/.tidyrc"

References

slm
  • 369,824
  • Thanks for the thorough answer. I actually noticed the HTML_TIDY parameter you mentioned just minutes before you posted this. Adding that to my path did the trick! – spuder Jul 17 '13 at 02:59
  • That's an amazingly thorough answer. – Thufir Jan 02 '19 at 23:15