4

I have a file containing list of URLs (one entry in one line).

After processing it to extract the host- (server-)names with the script below (which works correctly), the host names that appeared multiple times in the input were appearing multiple times in the displayed output.  I want each name to appear only once.  I tried uniq and sort -u, but they didn't help.  Below is the code I had used to extract the hosts:

function extract_parts {
    if [ -f "wget-list" ]; then
        while read a; do
            a=${a:8}
        host=$(echo -e "$a"  | awk -F '/' '{print $1}' | sort -u)
      # host=$(echo -e "$a"  | awk -F '/' '{print $1}' | uniq -iu)

        echo -e ${host}

    done <<< $(cat ./wget-list)
fi

}

where the wget-list contains (as a truncated example):

https://downloads.sourceforge.net/tcl/tcl8.6.12-html.tar.gz
https://downloads.sourceforge.net/tcl/tcl8.6.12-src.tar.gz
https://files.pythonhosted.org/packages/source/J/Jinja2/Jinja2-3.1.2.tar.gz
https://files.pythonhosted.org/packages/source/M/MarkupSafe/MarkupSafe-2.1.1.tar.gz
https://ftp.gnu.org/gnu/autoconf/autoconf-2.71.tar.xz
https://ftp.gnu.org/gnu/automake/automake-1.16.5.tar.xz

Result after the script (only the hosts, without the https:// and path parts):

downloads.sourceforge.net
downloads.sourceforge.net
files.pythonhosted.org
files.pythonhosted.org
ftp.gnu.org
ftp.gnu.org

Desired output (the above, but with no duplicates):

downloads.sourceforge.net
files.pythonhosted.org
ftp.gnu.org
Mcgiwer
  • 61

2 Answers2

4

If you have GNU grep, the default on Linux, you could simplify with :

extract_parts(){
    grep -oP 'https?://\K[^/]+' "$1" | sort -u
}

Output

$ extract_parts wget-list

downloads.sourceforge.net files.pythonhosted.org ftp.gnu.org

correction of your script

Your text manipulation is wrong. You get only https: and you try to sort only unique lines (that makes no sense).

A working copy:

if [[ -f wget-list ]]; then
    while IFS= read -r line; do
        host=$(awk -F '/' '{print $3}' <<< "$line")
        echo "$host"
    done < ./wget-list | sort -u
fi

The sort should englob the whole while loop to work as you want.

  • While this works, it doesn't answer the question (why uniq wasn't working for OP). – marcelm Dec 19 '22 at 12:51
  • Yes, added 'correction of your script' paragraph – Gilles Quénot Dec 19 '22 at 19:20
  • @Gilles Quenot Thanks and I have a question rełated to it because I think it could become optimized for better performance. In the while loop, you can cut the "https://" out before processing the "host=(...)" line. This way, the $3 will become $1 – Mcgiwer Dec 20 '22 at 11:36
4
while read a; do

You're reading one line...

host=$(echo -e "$a"  | awk -F '/' '{print $1}' | sort -u)

and printing that in a pipeline which you then sort and take the unique lines out of. That'll give you one unique line.

Then you're doing the same for the next line, totally separately.

Instead, just pass the whole file through a pipeline, e.g.

$ < ./wget-list sed -e 's,^https://,,'  | awk -F/ '{print $1}' |sort -u
downloads.sourceforge.net
files.pythonhosted.org
ftp.gnu.org
ilkkachu
  • 138,973
  • Thanks, I will check it out. I tryed to to do it without regex because I haven't fully learned it – Mcgiwer Dec 18 '22 at 14:12
  • UPDATE: Ok, thanks. It had helped. Now I need only to find a method with would allow cause that when spliting with the awk, would self set the $.., where ".." is the number of the separated part – Mcgiwer Dec 18 '22 at 14:40
  • 1
    @Mcgiwer: Some general advice: 1. As a rule of thumb, don't while read if you can avoid it. 2. If you must use while read, keep the interior of the loop as simple as possible. 3. Remember that while is one big command, and so you can pipe into/out of it. 4. You don't need to use awk for simple field splitting, as cut can also do that and is more straightforward to use (but awk is probably more useful to learn in the long run). – Kevin Dec 18 '22 at 22:43
  • @ilkkachu I have a question related to your code. In the orginal code, whe n I had wsed thw I had used the while, I had cut out the" https" with the bash code ${variable: 8} wouldn't be possible to use is somehow in your code ? – Mcgiwer Dec 20 '22 at 11:30
  • @Mcgiwer, well. That's the shell's substring operator, so you can't use exactly that unless you do the text processing in the shell. But that's not a really good idea since tools like sed and awk are just better at that (and faster). Consider that your original loop launches three separate processes for each input line (the pipeline in the command substitution), and that's just unnecessary overhead. – ilkkachu Dec 20 '22 at 11:58
  • You could use the equivalent function in awk: substr(string, start, end), though. E.g. < ./wget-list awk -F/ '{$0 = substr($0, 9); print $1}' |sort -u. ($0 is the current line, and assigning back to it has AWK redo the field splitting to $1, $2, ... Also the string indexing starts from 1, and not 0 like in the shell, so to drop the first 8 characters, we need to take the string starting from character 9 (and not 8)) – ilkkachu Dec 20 '22 at 12:00
  • you could also use just < ./wget-list awk -F/ '{ print $3}' |sort -u, and then we didn't even need to care if the protocol is https: or http: or ftp:, as it'd just take the part between the second and third slashes. – ilkkachu Dec 20 '22 at 12:01
  • @lkkachu I had asked because in my code did text processing (using the mentioned bash replacement) to remove the "https://" part and simplicize future extractions (aspecially in case if in the file would appear a longer entry) – Mcgiwer Dec 20 '22 at 17:16