1

Possible Duplicate:
Does anybody here have experience in automating some tasks in web applications using curl?

Here is what I need to do? Wondering what platform is most suited - easy to understand and easy to code in. I may have to outsource it as this may be way above my skill level.

Some background:

I have access to some information databases and websites through my library. The way the databases and websites are accessed is by first loading a library webpage. Entering my library card number in the dialogbox and clicking on the Submit link. Then opens the authenticated (I presume by cookies or such) webpage for the service where I want to obtain data from.

What I want to achieve:

I want to create a compilation of suitably named Pdf files in a folder. Alternatively, and preferably, would like to create one PDF file, which contains all the pages saved, which pages are hyper linked from an index page in the One PDF file.

These pages are to be sourced from multiple websites. Access to the sites is either free, or with a password or Library based access (which requires as far as I can tell screen based interaction).

Also, on one of these websites which can be accessed via the library based access, the webpage address in the address bar does not change each time I go to a different page (terrible). So the many pages that I want downloaded for offline review, do not lend themselves to a simple Wget kind of command. As far as I can tell, it needs some way on clicking the right tabs on the website, so that the page loads, and once the page loads, it needs to be printed as a PDF file with a suitable name, and compiled into the One PDF file.

Wondering what platform to use to have this mini application / script developed?

Can somebody help me decide on what platform is ideally suited for this kind of application? Ideally, I would like the solution to be function calling oriented, so that if I have to add a webpage after a month of having this developed, I do not have to go running to the developer for such "configuration" changes.

The platform does not have to be Unix, although I think using a Unix platform creates the maximum flexibility. I can run it off my Mac, or off a host online, or on my Raspberry Pi :)

Thank you!!


Update:

I just heard from an IT savvy friend that http://seleniumhq.org/ or http://scrapy.org/ may be good options. Will study them also.

jim70
  • 173

2 Answers2

2

Ok so I did some research after I received the link to scrapy and got the idea that I am talking of a web scraper. For anybody else who may care, here is some information that I collected.

Still not sure how to move forward, but it sounds like BeautifulSoup and Mechanize might be the easiest way forward. Twill also looks quite good due to its simplicity. Any thoughts?


Compilation of links from my research

a presentation Overview of python web scraping tools http://www.slideshare.net/maikroeder/overview-of-python-web-scraping-tools

mechanize http://wwwsearch.sourceforge.net/mechanize/

Beautiful Soup: We called him Tortoise because he taught us. http://www.crummy.com/software/BeautifulSoup/

twill: a simple scripting language for Web browsing http://twill.idyll.org/

Selenium - Web Browser Automation http://seleniumhq.org/

PhantomJS: Headless WebKit with JavaScript API http://phantomjs.org/


Mechanize is my favorite; great high-level browsing capabilities (super-simple form filling and submission).

Twill is a simple scripting language built on top of Mechanize

BeautifulSoup + urllib2 also works quite nicely.

Scrapy looks like an extremely promising project; it's new.

Anyone know of a good Python based web crawler that I could use? - Stack Overflow https://stackoverflow.com/questions/419235/anyone-know-of-a-good-python-based-web-crawler-that-i-could-use


PycURL Home Page http://pycurl.sourceforge.net/


Scrapy assessment - BeautifulSoup + Mechanize it seems may be simpler (my comment from here) with evenlet to get concurrency

python - Is it worth learning Scrapy? - Stack Overflow https://stackoverflow.com/questions/6283271/is-it-worth-learning-scrapy


Refine, reuse and request data | ScraperWiki https://scraperwiki.com/


jim70
  • 173
0

I've always used LWP (libwww-perl) or WWW:Mechanize for jobs like this - there are several kinds of programming tasks I'd use python for, but I prefer perl for anything involving involving text processing.

Probably the most complicated one I wrote was several years ago when my partner and I owned a small bookshop - she needed a program to extract information about books from a book distributor's web site (keyed on ISBN or barcode) and insert relevant details into her (postgresql) stock database.

Note that writing web-scrapers can be tedious and time-consuming - you spend a lot of time reading the HTML source of various web pages and figuring out how to identify and extract only the information you're looking for.

It's not particularly difficult but it does require a good knowledge of HTML and at least mid-level programming skills.

It is likely that you'll have to write a different scraper for each database site, rather than one that does them all - although you could then write a wrapper script that either included them as functions or called separate scripts depending on the site.

Web sites tend to change, too. A scraper that's been working perfectly well for six or twelve months may suddenly stop working because the site has been redesigned and it no longer works the way your script expects it to.

So, if any of the databases have some kind of API for programmatic access (e.g. using REST or SOAP or even RSS) then use that rather than scraping HTML. Unfortunately, this is quite unlikely for the kind of databases available through libraries (the db owners tend to be have pre-web attitudes to data and are more interested in controlling and restricting access than anything else). They don't want to make it easy for anyone to access their data via a program rather than a browser, and some take significant efforts to obfuscate their sites to make the HTML code difficult to understand, or require a javascript interpreter to extract links and other data.

For a good example of this, look no further than TV listing web sites - some of them really do not want people using their data to automate recording schedules for programs like MythTV, so there's an ongoing technology war between the site developers and site-scraper authors.

There are javascript interpreters for perl (including one for use with WWW::Mechanize called WWW::Scripter, but sometimes it's easier to examine the site's javascript code yourself, figure out what it's doing to obfuscate the HTML, and write perl code to de-obfuscate the data without a js interpreter.

cas
  • 78,579
  • Craig, thanks a lot for the detailed explanation for my options! As of this point, I am not looking to extracting data and data tables. I am simply thinking of printing the entire pages into PDFs for now. Data extraction will likely lead to breakages more frequently because of how websites might change format. And fixing data extraction breakages will take time and I will not be able to do it. On the other hand, changing or adding the pages that I want to add in a PDF download should be possible even for me. So for this application is Python better? – jim70 Sep 09 '12 at 15:37
  • Whichever language you are most comfortable with is 'better', both perl and python have more than adequate web-scraping libraries available (but IMO it's a task that falls more naturally into perl's problem-domain). 2. Even with just printing entire pages to PDF, you'll still need to parse the HTML, just to find the A HREF links (and possibly de-obfuscate them too). 3. It sounds like selenium may be a better fit for your needs than either perl or python, although depending on exactly how selenium works you may have to settle for 1 PDF file per page (just a guess, i have no exp. with it).
  • – cas Sep 09 '12 at 22:12