Problem with getting unique output in script

Question

I have a file containing list of URLs (one entry in one line).

After processing it to extract the host- (server-)names with the script below (which works correctly), the host names that appeared multiple times in the input were appearing multiple times in the displayed output. I want each name to appear only once. I tried uniq and sort -u, but they didn't help. Below is the code I had used to extract the hosts:

function extract_parts {
    if [ -f "wget-list" ]; then
        while read a; do
            a=${a:8}
        host=$(echo -e &quot;$a&quot;  | awk -F '/' '{print $1}' | sort -u)
      # host=$(echo -e &quot;$a&quot;  | awk -F '/' '{print $1}' | uniq -iu)

        echo -e ${host}

    done &lt;&lt;&lt; $(cat ./wget-list)
fi

}

where the wget-list contains (as a truncated example):

https://downloads.sourceforge.net/tcl/tcl8.6.12-html.tar.gz
https://downloads.sourceforge.net/tcl/tcl8.6.12-src.tar.gz
https://files.pythonhosted.org/packages/source/J/Jinja2/Jinja2-3.1.2.tar.gz
https://files.pythonhosted.org/packages/source/M/MarkupSafe/MarkupSafe-2.1.1.tar.gz
https://ftp.gnu.org/gnu/autoconf/autoconf-2.71.tar.xz
https://ftp.gnu.org/gnu/automake/automake-1.16.5.tar.xz

Result after the script (only the hosts, without the https:// and path parts):

downloads.sourceforge.net
downloads.sourceforge.net
files.pythonhosted.org
files.pythonhosted.org
ftp.gnu.org
ftp.gnu.org

Desired output (the above, but with no duplicates):

downloads.sourceforge.net
files.pythonhosted.org
ftp.gnu.org

(1) Your a=${a:8} statement strips the first eight characters off $a. This will give undesired results if you ever get a URL beginning with http:// (or ftp://, etc.) instead of https://. (2) You should always quote all shell variable references (e.g., "$host") unless you have a good reason not to, and you’re sure you know what you’re doing. (3) Why are you using the -e option of echo? P.S. printf is better than echo. … (Cont’d) — G-Man Says 'Reinstate Monica', Dec 21 '22 at 22:53
(Cont’d) … (4) As Gilles Quenot showed in his answer, but did not mention: there’s no need to say <<< $(cat ./wget-list) — < ./wget-list is better. (5) See Why is using a shell loop to process text considered bad practice? — G-Man Says 'Reinstate Monica', Dec 21 '22 at 22:53

Gilles Quénot · Answer 1 · 2022-12-20T11:48:33.457

4

If you have GNU grep, the default on Linux, you could simplify with :

extract_parts(){
    grep -oP 'https?://\K[^/]+' "$1" | sort -u
}

Output

$ extract_parts wget-list
downloads.sourceforge.net
files.pythonhosted.org
ftp.gnu.org

correction of your script

Your text manipulation is wrong. You get only https: and you try to sort only unique lines (that makes no sense).

A working copy:

if [[ -f wget-list ]]; then
    while IFS= read -r line; do
        host=$(awk -F '/' '{print $3}' <<< "$line")
        echo "$host"
    done < ./wget-list | sort -u
fi

The sort should englob the whole while loop to work as you want.

edited Dec 20 '22 at 11:48

answered Dec 18 '22 at 11:09

Gilles Quénot

33,867

While this works, it doesn't answer the question (why uniq wasn't working for OP). – marcelm Dec 19 '22 at 12:51
Yes, added 'correction of your script' paragraph – Gilles Quénot Dec 19 '22 at 19:20
@Gilles Quenot Thanks and I have a question rełated to it because I think it could become optimized for better performance. In the while loop, you can cut the "https://" out before processing the "host=(...)" line. This way, the $3 will become $1 – Mcgiwer Dec 20 '22 at 11:36

score 4 · Accepted Answer · answered Dec 18 '22 at 11:11

4

while read a; do

You're reading one line...

host=$(echo -e "$a"  | awk -F '/' '{print $1}' | sort -u)

and printing that in a pipeline which you then sort and take the unique lines out of. That'll give you one unique line.

Then you're doing the same for the next line, totally separately.

Instead, just pass the whole file through a pipeline, e.g.

$ < ./wget-list sed -e 's,^https://,,'  | awk -F/ '{print $1}' |sort -u
downloads.sourceforge.net
files.pythonhosted.org
ftp.gnu.org

answered Dec 18 '22 at 11:11

ilkkachu

138,973

Thanks, I will check it out. I tryed to to do it without regex because I haven't fully learned it – Mcgiwer Dec 18 '22 at 14:12
UPDATE: Ok, thanks. It had helped. Now I need only to find a method with would allow cause that when spliting with the awk, would self set the $.., where ".." is the number of the separated part – Mcgiwer Dec 18 '22 at 14:40
1

@Mcgiwer: Some general advice: 1. As a rule of thumb, don't while read if you can avoid it. 2. If you must use while read, keep the interior of the loop as simple as possible. 3. Remember that while is one big command, and so you can pipe into/out of it. 4. You don't need to use awk for simple field splitting, as cut can also do that and is more straightforward to use (but awk is probably more useful to learn in the long run). – Kevin Dec 18 '22 at 22:43
@ilkkachu I have a question related to your code. In the orginal code, whe n I had wsed thw I had used the while, I had cut out the" https" with the bash code ${variable: 8} wouldn't be possible to use is somehow in your code ? – Mcgiwer Dec 20 '22 at 11:30
@Mcgiwer, well. That's the shell's substring operator, so you can't use exactly that unless you do the text processing in the shell. But that's not a really good idea since tools like sed and awk are just better at that (and faster). Consider that your original loop launches three separate processes for each input line (the pipeline in the command substitution), and that's just unnecessary overhead. – ilkkachu Dec 20 '22 at 11:58
You could use the equivalent function in awk: substr(string, start, end), though. E.g. < ./wget-list awk -F/ '{$0 = substr($0, 9); print $1}' |sort -u. ($0 is the current line, and assigning back to it has AWK redo the field splitting to $1, $2, ... Also the string indexing starts from 1, and not 0 like in the shell, so to drop the first 8 characters, we need to take the string starting from character 9 (and not 8)) – ilkkachu Dec 20 '22 at 12:00
you could also use just < ./wget-list awk -F/ '{ print $3}' |sort -u, and then we didn't even need to care if the protocol is https: or http: or ftp:, as it'd just take the part between the second and third slashes. – ilkkachu Dec 20 '22 at 12:01
@lkkachu I had asked because in my code did text processing (using the mentioned bash replacement) to remove the "https://" part and simplicize future extractions (aspecially in case if in the file would appear a longer entry) – Mcgiwer Dec 20 '22 at 17:16

Problem with getting unique output in script

2 Answers2

Output

correction of your script