Continue monitor web server and download newly added files

Question

What are the best way to continuously monitor a web server (HTTP) and download newly added file with min. delay (ideally < 1min. delay)?

Are you interested in monitoring in the sense of checking the server is up, and complaining to the admin (you) if it isn't, OR only in looking for new files as they appear? — ilkkachu, Aug 24 '16 at 19:28
Giving an idea: First download the list of files from the server as curl <ip_address>. Most probably, it will give you list of contents uploaded on that server. After given timeout, again use the same command and check for difference. [It will work if server is missing index.html] — SHW, Aug 25 '16 at 07:45

score 3 · Answer 1 · edited Jun 11 '20 at 14:16

Monitoring

First of all, for monitoring I recommend you use Nagios, the core source is free, but if you need GUI, you should pay for it, but it's worth paying that much.

You can also use Icinga, PRTG, or whatsoever suits you more.

Collectd (Collection Daemon) is also a free monitoring tool you can download by using yum on RHEL derivatives or apt-get on Debian-based ones. You can read this paper if you wanted to go with Collectd.

Task running every x < 1minute

For the second part of your question, for doing a job every x time where the x is for less than a minute periodically, as you know you cannot use Cronjobs, since you can use some tricks explained by Gilles in this question to do what you want.

It'll be better to have a script for what you need, and run it forever, even at boot if you need. You can have a simple syntax as shown below:

while true; do yourJob; sleep someTime; done

Or you can even go with some more complex scripts depending on what you need.

You can also use watch command. For instance:

watch -n1  command

It will run your command every second and forever.

As you might have guessed, you can also run your shell script with watch if you need to have just a simple script running every x time less than a minute, and not a complex one.

The choice is yours.

score 1 · Answer 2 · answered Aug 23 '16 at 07:17

This depends on a couple of factors.

If you have control on the web server, the easiest thing to do would be to install a (RESTful?) service providing the number of files changed since the last check or download. This minimizes both data transfer and load on both client and server. Even more if the upload/modification of files on the server can be directly tracked, e.g. in the upload script instead of relying on the file system.

If the latter, I'd look into some file monitoring solution such as famd.

If you have no control on the server, then you have to get modifications before being able to download them. The easiest thing would be to use some web mirroring utility such as w3mir, since they already take care of checking/supplying ETag and Last-Modified / If-Modified-Since headers. This means that you'll have to issue fewer calls, and therefore be able to run the utility more often.

As to how to run the utility, it depends on where it runs on. You can use a cron job on a Unix machine or just run it in a loop.

If you do the former, however, you'll be well advised to install some sort of semaphore to keep a mirroring process from starting before the previous instance has terminated. It can be so simple as to create a lock file:

if [ -r /tmp/mirror.lock ]; then
    echo "lock file found" | logger -t webmirror
    exit 0
fi
touch /tmp/mirror.lock
...whatever...
rm /tmp/mirror.lock

But you'll also have to catch any signal that might kill your script, otherwise in case of a temporary error the lock file might be left there and keep all further instances from running even after the error has been solved.

Or you could verify that the lock file isn't older than some reasonable amount, and delete it if it is, or verify how many instances of the script are found by ps (normally one, the current; if more, the current one had better abort), and do without the lock file altogether.

score 0 · Answer 3 · answered Jun 10 '16 at 10:36

0

You could do a simple curl command in a cron job, but I recommend you start using a monitoring solution with web monitoring capabilities. There are plenty of them for free, just google "Open Source web monitoring solutions" and you'll get plenty of them!

answered Jun 10 '16 at 10:36

denwulf

1

1

Cron min. delay is 1min, I want a faster method to sync – Ryan Aug 20 '16 at 08:43

score 0 · Answer 4 · answered Aug 25 '16 at 09:48

If you are really looking at files then you can do HEAD request on the URL and the server should return a key (the 'etag') that will tell you if the file has changed. On an Apache server this is base on the ctime of the file, so the etag may change even if the file did not.

But since the network is likely to me more costly that writing to disk if you download the contents of the file you might as well just store it to disk.

You don't say how many files or how large they are. If there are a large number of files or files take a very long time to download this script or if you want to put a minimum amount of load on the server then this script should be changed so that each query happens once a minute or as often as possible if the download takes more than a minute.

Below is a very simple Ruby script that will do what I think you want:

#!/usr/bin/env ruby

require 'getoptlong'
require 'net/https'
require 'json'
require 'fileutils'

def main(roots, **options)
  cache = Hash.new

  cache = Hash.new
  ok = true
  path = options[:path]
  while (ok)
    roots.each do |root|
      uri = URI.parse(root)
      http = Net::HTTP.new(uri.host, uri.port)
      case uri.scheme
      when 'https'
        http.use_ssl = true
        http.verify_mode = OpenSSL::SSL::VERIFY_NONE
      when 'http'
      else
        raise "unknow type #{uri.to_s}"
      end

      need_get = true
      if (c = cache[uri.request_uri])
        response = http.request(Net::HTTP::Head.new(uri.request_uri))
        if response.code.to_i == 200
          if response['etag'] == c['etag']
            need_get = false
          end
        end
      end

      if need_get
        response = http.request(Net::HTTP::Get.new(uri.request_uri))
        cache[uri.request_uri] = { 'etag' => response['etag'] }
        filename = File.join(path, uri.request_uri)
        need_write = true
        if File.exist?(filename)
          # you could check if the file changed here, but it does not save you much.
        end
        if need_write
          File.open(filename, 'w') { |file| file.write(response.body) }
        end
      end
    end
    sleep 30
  end
end

begin
  main([http://example.com/ten.html, http://example.com/eleven], { path: "/tmp/downloaded_files" })
rescue => error
  puts error
end

I like this solution, but I dont know ruby. does it a recursive crawl and fetch ? for caching, can you create a etag-like if missing ? example filesize and last mod datetime — Massimo, Aug 26 '16 at 08:24
It does not crawl, but could. Almost all the load is in doing a GET rather than a HEAD. If you have done the GET then it trivial to compare the new data with the current file, is trivial. I would suggest that you run this script on some of the links/sites you want to watch and see if they are supplying etag data. — gam3, Aug 26 '16 at 09:53

score 0 · Answer 5 · answered Aug 26 '16 at 06:56

0

As FarazX said there are several monitoring solutions like Nagios, Pandora FMS,... But maybe these tools are too big for your purpose. Perhaps Uptimerobot is enough for you.

Take a look on the proposals and choose the best for you but bear in mind that a monitoring solution with more options gives you more possibilities for your environment.

answered Aug 26 '16 at 06:56

santimviejo

1

I think OP wants more than that, but the suggestion's not bad. – Julie Pelletier Aug 26 '16 at 07:22

Continue monitor web server and download newly added files

5 Answers5

Monitoring

Task running every x < 1minute