1

I use Firefox and I don't have any issues viewing and reading English text on the loaded websites.

If I click Save in Firefox and save the web page in question as a text file I can read everything in the text file - all characters are readable.

However, when I use Downthemall to save these same web pages and save them as .html - which appears to be the only way with Dta - there are characters in the saved HTML files that are unreadable - and the kicker is that they are the critical lines I'm interested in reading and extracting. Firefox view source shows the same unreadable output.

Basically I'm trying to scrub a site (yunfile.com) to gather file names and download links - everything would be fine except I CANNOT read the file names.

Here's an example link: http://page3.dfpan.com/file/syg65488/0141cd27 The problem I'm having is with the file name line where it says Downloading:

HTML file text reads: ¡£¢¢£¥£¢½ãòá碽áòá

In Firefox the same text reads: 20110601.part1.rar

Is there a program and a command I can run to convert these HTML files?

Any suggestions would be greatly appreciated.

1 Answers1

2

This isn't an encoding problem. What's happening is that the server returns HTML with the file names mangled, and there's a bit of Javascript to unmangle them.

Fortunately the mangling is performed by Javascript that isn't hidden or obfuscated, so it's easy to undo it. The JS code is

function codeAndEncode(_key,_str){
     var keyUnicodeSum=0;
     var codedStr = "";
     for( j = 0; j<_key.length; j++ ){
          keyUnicodeSum += _key.charCodeAt( j );
     }
     for( i = 0; i<_str.length; i++ )
     {
          var _strXOR = _str.charCodeAt(i) ^ keyUnicodeSum;
          codedStr += String.fromCharCode( _strXOR );
     }
     return codedStr;
}

var filename = codeAndEncode("111", "ëúòüýúòý¡£¢¢£¥£¢½ãòá碽áòá");

That's pretty simple: calculate a value and xor it with each character of the string. The mangling and unmangling operation are the same. You can translate this into whatever language you're using for your scraper. For example, here's some Perl code that undoes the mangling:

$ perl -CA -l -we 'my $sum = 0; $sum += ord foreach split //, $ARGV[0]; print $ARGV[1] ^ (chr($sum) x length($ARGV[1]))' 111 "ëúòüýúòý¡£¢¢£¥£¢½ãòá碽áòá"
xiaonian20110601.part1.rar

The mangler uses the DOM accesses document.getElementById("file_show_filename") and document.getElementById("file_down_filename") to identify the nodes in the HTML tree that needs to be unmangled. You can adapt that too to whatever HTML parser your scraper uses.

The aim of mangling the file names is to make scraping more difficult, so it's likely that the site administrators will make the mangling harder to reproduce over time. If you want to keep the mangled file names no matter what tricks the site pulls, you can run Firefox in an automated environment. See Are there any good tools besides SeleniumRC that can fetch webpages including content post-painted by JavaScript? and How can I run Firefox on Linux headlessly (i.e. without requiring libgtk-x11-2.0.so.0)?