0

I need to download hundreds of json files from a list of urls stored in a csv. As it would be impossible to do so manually, I wrote a short bash script to read the urls one by one using Python and pass them to curl to download. In addition, with the aid of tee, the script saves the downloaded data in files naming each one with a standard code. The code is stored in the same csv and the files are stored in a separate folder named 'Data':

output_dir=Data
python pass_urls_to_std_out.py | xargs curl -k | tee $output_dir/$(python pass_cd_to_std_out.py) > /dev/null

pass_urls_to_std_out.py is just a short Python script which reads the urls and passes them one by one to the standard output.

pass_cd_to_std_out.py is an analogous script which just reads the codes one by one.

> /dev/null is used just to avoid tee printing all the output of curl on screen.

Unfortunately, the script works fine for small sets of files, but when the number of files I'm trying to download increases, I get this error messages:

tee: 4208005.json: Too many open files
tee: 4110607.json: Too many open files
tee: 4304903.json: Too many open files
tee: 4217303.json: Too many open files
tee: 4212809.json: Too many open files
tee: 4214003.json: Too many open files
tee: 4208302.json: Too many open files
tee: 4203501.json: Too many open files
....

Is there a way to sequentially redirect output to one file at a time (or 10, 20 at a time), withtou tee trying to open all files at once?

[Edit] As Kamil Maciorowski correctly pointed out, as I wrote it, instead of the output of pass_cd_to_std_out.py being passed one by one as arguments of tee, it's expanded and passed as multiple arguments all at once.

I rewrote the script as a for cycle:

#!/bin/bash

output_dir=Data

for url in $(eval python pass_urls_to_std_out.py); do curl -k $url > $output_dir/$(python pass_cd_to_std_out.py) done

Unfortunately, $output_dir is evaluated only once and thus the output looks like:

Data/1200401.json
4205407.json
4106902.json
2304400.json
3304557.json
3205309.json
1600303.json
1400100.json
  • 1
    In many shells $output_dir/$(python pass_cd_to_std_out.py) will expand to multiple arguments. Problems: (1) globbing. (2) Data/ will be prepended to the first one only. (3) Regardless of the number of arguments, there is exactly one tee, so it will try to save identical data to all destinations. Is this what you want? // All this makes me suspect "the script works fine for small sets of files" is not really true, unless the cardinality of each of your "small sets" is one. – Kamil Maciorowski Nov 28 '22 at 14:47
  • Thank you Kamil for your advice. My mistake stems from the fact that I tend to always think in a Pythonic way and write one liners as much as possible instead of for cycles. As you correctly pointed out, as I wrote it, instead of the output of pass_cd_to_std_out.py being passed one by one as arguments of tee, it's expanded and passed as multiple arguments. I rewrote the script as a for cycle, but the problem you pointed out as (2) still stands – Francesco Ghizzo Nov 29 '22 at 08:59
  • Edit: it looks like the for loop is NOT the correct way to go and the output_dir variable is always gonna be evaluated once in a for loop (see: http://mywiki.wooledge.org/BashFAQ/001). Still can't understand why this is the case – Francesco Ghizzo Nov 29 '22 at 09:33
  • Can't you just get all the URLs in a single invocation of curl and use --output-dir=Data with -O to save all the output in files in the Data directory? – Kusalananda Nov 29 '22 at 10:25

1 Answers1

0

I found out that, if instead of piping everything on the fly I save every step to file instead, it somehow works:

  1. Check if directory Data exists, or otherwise create directory Data:
    download_dir=Data
    

    if [ ! -d $download_dir ]; then mkdir $download_dir fi

  2. Create a file with all the URLs:
    python pass_urls_to_std_out.py >> urls.txt
    
  3. Create a file with all the filenames:
    python pass_cd_to_std_out.py >> file_names.txt
    
  4. Read each file line by line and recursively download data from url and save to filename:
    paste urls.txt file_names.txt | while IFS= read -r url file_name; 
      do curl -k --output-dir=Data $url > $file_name; 
    done
    

I added the --output-dir option thanks to Kusalananda's suggestion as well.