0

I am trying to mirror a site to archive.org but using curl is very slow so I wanted to try aria2 instead.

I first make a link map of the site using this command

wget -c -m --restrict-file-names=nocontrol https://www.example.com/

and then run this command using curl

find . -type f -exec curl -v "https://web.archive.org/save/https://{}" ';'

(Actually I used this command to get a good enough log of what I was doing

find . -type f -exec curl -v "https://web.archive.org/save/https://{}" ';' 2> >(grep 'Rebuilt URL' >>/tmp/error ) >/tmp/stdout - included it here for reference)

This was working fine, the find-command produced output such as

./www.example.com/index

and curl magically ignored the leading ./

Well, Aria2 wasn't so smart. This command

find . -type f -exec aria2c -x 16 -s 1 "https://web.archive.org/save/https://{}" ';'

lead to this error:

07/24 23:40:45 [ERROR] CUID#7 - Download aborted. URI=https://web.archive.org/save/https://./www.example.com/index

(Note the extra ./ in the middle of the URL).

I then found this question that helped me modify the output from find

find . -type f -printf '%P\n'

returns

www.example.com/index

(no leading ./)

However, when feeding this to aria2 the concatenated URL still contains ./ in the middle!?!?

find . -type f -printf '%P\n' -exec aria2c -x 16 -s 1 "https://web.archive.org/save/https://{}" ';'

gives this error message

www.example.com/index

07/24 23:52:34 [NOTICE] Downloading 1 item(s)
[#d44753 0B/0B CN:1 DL:0B]                                                                                     
07/24 23:52:35 [ERROR] CUID#7 - Download aborted. URI=https://web.archive.org/save/https://./www.example.com/index
Exception: [AbstractCommand.cc:351] errorCode=29 URI=https://web.archive.org/save/https://./www.example.com/index
  -> [HttpSkipResponseCommand.cc:232] errorCode=29 The response status is not successful. status=502

07/24 23:52:35 [NOTICE] Download GID#d44753fe24ebf448 not complete: 

Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
d44753|ERR |       0B/s|https://web.archive.org/save/https://./www.example.com/index

How do I get rid of the ./ so aria2 is fed proper and correct URLs?

Bonus questions:

  1. It would be great if I could (re)move the pages after processing their URL. That is, move index from ./www.example.com/index to ./processed/www.example.com/index. How do I do that? Something in the exec of the find command? Or does that require a full fledged script?

  2. What are the optimal settings for aria2 for this purpose?

d-b
  • 1,891
  • 3
  • 18
  • 30

2 Answers2

1

The last one doesn't work because the -exec is independent from -printf.

But you can use xargs instead of -exec:

find . -type f -printf '%P\n' \
    | xargs -I{} aria2c -x 16 -s 1 "https://web.archive.org/save/https://{}"

You can also let multiple aria2c instances run in parallel with xargs -P <num>.


An even better option would be to create a file descriptor from find as input for aria2 instead of using pipes and xargs.

aria2c -x 16 -s 1 -i <(find . -type f -printf 'https://web.archive.org/save/https://%P\n')
pLumo
  • 22,565
0

Adding the -printf will just produce output, it will not modify what {} is replaced by.

It seems curl is a bit smarter (or, alternatively, applies more magic) than what aria2 is, and removes the ./. The initial ./ in the found pathname comes from the fact that find will produce pathnames relative to the top level directory that you start the search from.

To call aria2 or curl with an URL that does not contain the initial ./, use

find . -type f -exec sh -c '
    for pathname do
        pathname=${pathname#./}
        aria2c -x 16 -s 1 "https://web.archive.org/save/https://$pathname"
    done' sh {} +

This will call a child shell with a bunch of found pathnames. The child shell will loop over these and remove the initial ./ using a standard parameter expansion before calling, in this case aria2c.

In general:

topdir=/some/directory/path  # no '/' at the end

find "$topdir" -type f -exec sh -c '
    topdir="$1"; shift
    for pathname do
        pathname=${pathname#$topdir/}
        aria2c -x 16 -s 1 "https://web.archive.org/save/https://$pathname"
    done' sh "$topdir" {} +

Related:

Kusalananda
  • 333,661