Automating tasks on a website on a headless server

Question

I want to automate a task which can only be done on a website (with prior login) on my debian server. There is no public API available, so I can't use one.

Is there a way to do so? I thought about a text-based browser or something similar.

We would need to know how the login is performed, what the task is, what the webpage looks like (simple html, php, javascript etc). You question is not answerable in its current form. — terdon, May 24 '13 at 11:11
@terdon https://play.google.com/apps/publish ;) I guess the issue is, that there is uploading a file involved. — Leandros, May 24 '13 at 11:15
curl has file upload posts, why not just cron a curl script? — lynks, May 24 '13 at 11:40
@Leandros yep, you can write a shell script that uses curl, wget etc, to interact with a website in just about any way you need. The only caveat of course would be with CAPTCHAs. — lynks, May 24 '13 at 13:03
@lynks There aren't any captchas, I just need to login. Could this be an issue? — Leandros, May 24 '13 at 15:15

score 4 · Answer 1 · answered May 24 '13 at 11:12

4

Have a look at WWW::Mechanize (Examples at http://search.cpan.org/dist/WWW-Mechanize/lib/WWW/Mechanize/Examples.pod). It takes your webpage as object and makes all elements accessible via methods.

For example

$m->get("https://lists.ccs.neu.edu/bin/admindb/$listname");
$m->set_visible( $password );
$m->click;

There are ports for (al least) ruby and python, too.

answered May 24 '13 at 11:12

krissi

553

You can use the Firefox Live HTTP Headers plugin to record a session between your browser and the website, so that you understand the full path interaction, too. What cookies do you need to store? Present? What forms are called with what variables and hidden variables, etc. Once you have all that, it should be achievable to automate the task. – Tim Kennedy May 24 '13 at 12:30

Anthon · Accepted Answer · 2013-05-25T13:11:41.063

You can run Selenium on a headless installation on your server, e.g. by programming the actions in python using pyvirtualdisplay.

pyvirtualdisplay allows you to use a xvfb, xepher or xvnc screen so you can do screenshot (or take a remote peek to see what is going on).

On Ubuntu 12.04 install:

sudo apt-get install python-pip tightvncserver xtightvncviewer
sudo pip install selenium pyvirtualdisplay

and run the following (this is using the newer Selenium2 API, the older API is still available as well):

import subprocess
from pyvirtualdisplay import Display
from selenium import webdriver

def browse_it(port=None):
    browser = webdriver.Firefox()
    page = browser.get('http://unix.stackexchange.com/questions')
    for question in browser.find_elements_by_class_name('question-hyperlink'):
        print question.text
    if port:
        print '--------\nconnect using:\n  vncviewer ' + \
          'localhost:{}\nand click the xmessage to quit'.format(port)
        subprocess.call(['xmessage', 'click to quit'])
    browser.quit()

def browse_it_hidden(rfbport=5904):
    with Display(backend='xvnc', rfbport=str(rfbport)) as disp:
        browse_it(rfbport)

if __name__ == '__main__':
    browse_it_hidden()

The xmessage prevents the browser to quit, in testing environments you would not want this. You can also call browse_it() directly to test in the foreground.

The results of Selenium's find_element.....() do not provide things like selecting the parent element of an element you just found. Something that you might expect from HTML parsing packages (I read somewhere this is on purpose). These limitations can be kind of hassle if you do scraping of pages you have no control over. When testing your own site, just make sure you generate all of the elements that you want to test with an id or unique class so they can be selected without hassle.

This sounds interesting, is there a documentation for getting started? — Leandros, May 24 '13 at 15:34
@Leandros I extended this with an example. If browsing look for the newer Selenium2 API examples, like I used here. There is an awful lot of selenium (1) examples around that still work, but take somewhat more effort to set up. — Anthon, May 25 '13 at 13:03

score 1 · Answer 3 · answered May 24 '13 at 12:15

You could use either of:

Perl with WWW::Mechanize or even roll out your own using their HTTPClient
Selenium/WebDriver
a Google Chrome or Firefox Extension (existing or one that you write)
a shell script using curl and wget (you'll need to save and resend session data)
HtmlUnit
...

Basically any language that lets you query a networked resource would do...

Automating tasks on a website on a headless server

3 Answers3

Linked