I am trying to mirror a site to archive.org
but using curl
is very slow so I wanted to try aria2
instead.
I first make a link map of the site using this command
wget -c -m --restrict-file-names=nocontrol https://www.example.com/
and then run this command using curl
find . -type f -exec curl -v "https://web.archive.org/save/https://{}" ';'
(Actually I used this command to get a good enough log of what I was doing
find . -type f -exec curl -v "https://web.archive.org/save/https://{}" ';' 2> >(grep 'Rebuilt URL' >>/tmp/error ) >/tmp/stdout
- included it here for reference)
This was working fine, the find-command produced output such as
./www.example.com/index
and curl magically ignored the leading ./
Well, Aria2 wasn't so smart. This command
find . -type f -exec aria2c -x 16 -s 1 "https://web.archive.org/save/https://{}" ';'
lead to this error:
07/24 23:40:45 [ERROR] CUID#7 - Download aborted. URI=https://web.archive.org/save/https://./www.example.com/index
(Note the extra ./
in the middle of the URL).
I then found this question that helped me modify the output from find
find . -type f -printf '%P\n'
returns
www.example.com/index
(no leading ./
)
However, when feeding this to aria2 the concatenated URL still contains ./
in the middle!?!?
find . -type f -printf '%P\n' -exec aria2c -x 16 -s 1 "https://web.archive.org/save/https://{}" ';'
gives this error message
www.example.com/index
07/24 23:52:34 [NOTICE] Downloading 1 item(s)
[#d44753 0B/0B CN:1 DL:0B]
07/24 23:52:35 [ERROR] CUID#7 - Download aborted. URI=https://web.archive.org/save/https://./www.example.com/index
Exception: [AbstractCommand.cc:351] errorCode=29 URI=https://web.archive.org/save/https://./www.example.com/index
-> [HttpSkipResponseCommand.cc:232] errorCode=29 The response status is not successful. status=502
07/24 23:52:35 [NOTICE] Download GID#d44753fe24ebf448 not complete:
Download Results:
gid |stat|avg speed |path/URI
======+====+===========+=======================================================
d44753|ERR | 0B/s|https://web.archive.org/save/https://./www.example.com/index
How do I get rid of the ./
so aria2 is fed proper and correct URLs?
Bonus questions:
It would be great if I could (re)move the pages after processing their URL. That is, move index from
./www.example.com/index
to./processed/www.example.com/index
. How do I do that? Something in theexec
of thefind
command? Or does that require a full fledged script?What are the optimal settings for aria2 for this purpose?