I have a file containing list of URLs (one entry in one line).
After processing it to extract the host- (server-)names
with the script below (which works correctly),
the host names that appeared multiple times in the input
were appearing multiple times in the displayed output.
I want each name to appear only once.
I tried uniq
and sort -u
, but they didn't help.
Below is the code I had used to extract the hosts:
function extract_parts {
if [ -f "wget-list" ]; then
while read a; do
a=${a:8}
host=$(echo -e "$a" | awk -F '/' '{print $1}' | sort -u)
# host=$(echo -e "$a" | awk -F '/' '{print $1}' | uniq -iu)
echo -e ${host}
done <<< $(cat ./wget-list)
fi
}
where the wget-list
contains (as a truncated example):
https://downloads.sourceforge.net/tcl/tcl8.6.12-html.tar.gz
https://downloads.sourceforge.net/tcl/tcl8.6.12-src.tar.gz
https://files.pythonhosted.org/packages/source/J/Jinja2/Jinja2-3.1.2.tar.gz
https://files.pythonhosted.org/packages/source/M/MarkupSafe/MarkupSafe-2.1.1.tar.gz
https://ftp.gnu.org/gnu/autoconf/autoconf-2.71.tar.xz
https://ftp.gnu.org/gnu/automake/automake-1.16.5.tar.xz
Result after the script
(only the hosts, without the https://
and path parts):
downloads.sourceforge.net
downloads.sourceforge.net
files.pythonhosted.org
files.pythonhosted.org
ftp.gnu.org
ftp.gnu.org
Desired output (the above, but with no duplicates):
downloads.sourceforge.net
files.pythonhosted.org
ftp.gnu.org
a=${a:8}
statement strips the first eight characters off$a
. This will give undesired results if you ever get a URL beginning withhttp://
(orftp://
, etc.) instead ofhttps://
. (2) You should always quote all shell variable references (e.g.,"$host"
) unless you have a good reason not to, and you’re sure you know what you’re doing. (3) Why are you using the-e
option ofecho
? P.S.printf
is better thanecho
. … (Cont’d) – G-Man Says 'Reinstate Monica' Dec 21 '22 at 22:53<<< $(cat ./wget-list)
—< ./wget-list
is better. (5) See Why is using a shell loop to process text considered bad practice? – G-Man Says 'Reinstate Monica' Dec 21 '22 at 22:53