I need to download hundreds of json
files from a list of urls stored in a csv
.
As it would be impossible to do so manually, I wrote a short bash script to read the urls one by one using Python and pass them to curl
to download.
In addition, with the aid of tee
, the script saves the downloaded data in files naming each one with a standard code. The code is stored in the same csv
and the files are stored in a separate folder named 'Data':
output_dir=Data
python pass_urls_to_std_out.py | xargs curl -k | tee $output_dir/$(python pass_cd_to_std_out.py) > /dev/null
pass_urls_to_std_out.py
is just a short Python script which reads the urls and passes them one by one to the standard output.
pass_cd_to_std_out.py
is an analogous script which just reads the codes one by one.
> /dev/null
is used just to avoid tee
printing all the output of curl
on screen.
Unfortunately, the script works fine for small sets of files, but when the number of files I'm trying to download increases, I get this error messages:
tee: 4208005.json: Too many open files
tee: 4110607.json: Too many open files
tee: 4304903.json: Too many open files
tee: 4217303.json: Too many open files
tee: 4212809.json: Too many open files
tee: 4214003.json: Too many open files
tee: 4208302.json: Too many open files
tee: 4203501.json: Too many open files
....
Is there a way to sequentially redirect output to one file at a time (or 10, 20 at a time), withtou tee trying to open all files at once?
[Edit] As Kamil Maciorowski correctly pointed out, as I wrote it, instead of the output of pass_cd_to_std_out.py
being passed one by one as arguments of tee
, it's expanded and passed as multiple arguments all at once.
I rewrote the script as a for cycle:
#!/bin/bash
output_dir=Data
for url in $(eval python pass_urls_to_std_out.py); do
curl -k $url > $output_dir/$(python pass_cd_to_std_out.py)
done
Unfortunately, $output_dir
is evaluated only once and thus the output looks like:
Data/1200401.json
4205407.json
4106902.json
2304400.json
3304557.json
3205309.json
1600303.json
1400100.json
$output_dir/$(python pass_cd_to_std_out.py)
will expand to multiple arguments. Problems: (1) globbing. (2)Data/
will be prepended to the first one only. (3) Regardless of the number of arguments, there is exactly onetee
, so it will try to save identical data to all destinations. Is this what you want? // All this makes me suspect "the script works fine for small sets of files" is not really true, unless the cardinality of each of your "small sets" is one. – Kamil Maciorowski Nov 28 '22 at 14:47curl
and use--output-dir=Data
with-O
to save all the output in files in theData
directory? – Kusalananda Nov 29 '22 at 10:25