Compare two URL lists and print newly added URLs to a new file

Question

I am initially producing two files which contain lists of URLs—I will refer to them as old and new. I would like to compare the two files and if there are any URLs in the new file which are not in the old file, I would like these to be displayed in an extra_urls file.

Now, I've read some stuff about using the diff command but from what I can tell, this also analyses the order of the information. I don't want the order to have any effect on the output. I just want the extra URL's in new printed to the extra_urls file, no matter what order they are placed in either of the other two files.

How can I do this?

Barmar · Accepted Answer · 2015-12-17T18:03:01.113

14

You can use the comm command to compare two files, and selectively show lines unique to one or the other, or the lines in common. It requires the inputs to be sorted, but you can sort them on the fly, by using process substitution.

comm -13 <(sort old.txt) <(sort new.txt)

If you're using a version of bash that doesn't support process substitution, it can be emulated using named pipes. An example is shown in Wikipedia.

edited Dec 17 '15 at 18:03

answered Nov 23 '15 at 15:53

Barmar

9,927

Concise but effective- exactly what was needed, excellent bit of code for what I required. – neilH Nov 23 '15 at 16:15
Hmm, but if the input is sorted, then diff will do the same thing, right? – justhalf Nov 24 '15 at 06:52
diff will show all the differences. comm allows you to select whether you want to see the lines from file 1, file 2, or the ones they have in common. – Barmar Nov 24 '15 at 07:02
Hi Barmar, not sure you will check this but just incase, i've moved this script onto my Synology Nas to run from there. Since running my script from the Synology I'm now getting the syntax error: line 60: syntax error: unexpected "(" – neilH Dec 17 '15 at 17:58
What version of bash is it running? It may not support process substitution. – Barmar Dec 17 '15 at 17:59
On my synology I have to use use #!/bin/sh at the top of my scripts rather than bash, if that helps? – neilH Dec 17 '15 at 18:06
I added a link to Wikipedia, which shows how to emulate process substitution with named pipes. Does that help? – Barmar Dec 17 '15 at 18:07
Thanks for that mate, I will check when I get home, literally about to get kicked out of the office, will let you know, thanks – neilH Dec 17 '15 at 18:10
Yes thanks worked Barmar, thanks for your help yet again :) – neilH Dec 18 '15 at 17:02

terdon · Answer 2 · 2015-11-24T10:13:42.673

6

I would just use grep:

grep -vFf old new > extra_urls

Explanation

-f : tells grep to read its search patterns from a file. In this case, old.
-v : tells grep to invert the match, to only print non-matching lines.
-F : tells grep to interpret its search patterns as strings, not regular expressions. That way, the . of the URL will be matched literally.

Combined, these make grep print any lines in new that were not in old. The order of the URLs in the file is irrelevant.

edited Nov 24 '15 at 10:13

answered Nov 23 '15 at 16:03

terdon

242,166

Hi terdon, Thanks for your input. I've just tested this and it produced a blank "extra urls"_file despite there being new urls in the "new" file. – neilH Nov 23 '15 at 16:14
@bms9nmh hmm, that's odd. Please [edit] your question to give an example of your input files. You might also want to come into the site's chat room where we can discuss this further. – terdon Nov 23 '15 at 16:16
2

You'll want to add -F for plain text patterns – glenn jackman Nov 24 '15 at 00:36

score 1 · Answer 3 · answered Nov 23 '15 at 15:31

1

Since order is important to you, use awk

awk '
    NR == FNR {old[$1]=1; next}
    !($1 in old)
' old new > extra

answered Nov 23 '15 at 15:31

glenn jackman

85,964

1

Hi glen, just to clarify, order isn't important. The url's order isn't an issue, just the difference between the two files i.e. the additional url's. I don't want the difference in order to effect the output in any way. – neilH Nov 23 '15 at 15:37
@bms9nmh: you could just change > extra to | sort > extra. or | sort -u > extra if you only want a new url to appear in the output once, regardless how many times it's in the input. The input order is liable to affect the output order unless you do extra work somewhere along the way to prevent it. – Steve Jessop Nov 23 '15 at 22:07
@steve, meh, comm is the best answer for this question, although grep -Fvf is good too – glenn jackman Nov 24 '15 at 00:37

score 0 · Answer 4 · edited Nov 27 '15 at 19:23

0

I have an application called meld. It allows viewing the two (or three) files, side by sides, shows the differences and allows for selective copying from one to the other or deleting characters.

Meld can be installed from a terminal with

sudo apt-get install meld

edited Nov 27 '15 at 19:23

Volker Siegel

17,283

answered Nov 27 '15 at 19:11

krazykyngekorny

1

score 0 · Answer 5 · 2021-03-11T18:07:45.517

Here's a more general solution, that can find and compare URL's in text files containing not just URL's:

#!/bin/sh
diffl.sh
DIFF with Links - a "diff utility"-like .sh script
(dash, bash, zsh compatible) that can find missing
web links in one file compared to a group of files
Please note that: for simplicity, in this script, only
URLs containing "://" are taken into consideration,
although there can be URLs that do not contain it
(such as mailto:user@site.com)
GetOS () {
OS_kernel_name=$(uname -s)

case &quot;$OS_kernel_name&quot; in
    &quot;Linux&quot;)
        eval $1=&quot;Linux&quot;
    ;;
    &quot;Darwin&quot;)
        eval $1=&quot;Mac&quot;
    ;;
    &quot;CYGWIN&quot;*|&quot;MSYS&quot;*|&quot;MINGW&quot;*)
        eval $1=&quot;Windows&quot;
    ;;
    &quot;&quot;)
        eval $1=&quot;unknown&quot;
    ;;
    *)
        eval $1=&quot;other&quot;
    ;;
esac


}
DetectShell () {
    eval $1=&quot;&quot;;
    if [ -n "$BASH_VERSION" ]; then
        eval $1=&quot;bash&quot;;
    elif [ -n "$ZSH_VERSION" ]; then
        eval $1=&quot;zsh&quot;;
    elif [ "$PS1" = '$ ' ]; then
        eval $1=&quot;dash&quot;;
    else
        eval $1=&quot;undetermined&quot;;
    fi
}
PrintInTitle () {
    printf "\033]0;%s\007" "$1"
}
PrintJustInTitle () {
    PrintInTitle "$1">/dev/tty
}
trap1 () {
    CleanUp
    printf "\nAborted.\n">/dev/tty
}
CleanUp () {
#Restore &quot;INTERRUPT&quot; (CTRL-C) and &quot;TERMINAL STOP&quot; (CTRL-Z) signals:
trap - INT
trap - TSTP

#Clear the title:
PrintJustInTitle &quot;&quot;

#Restore initial IFS:
#IFS=$old_IFS
unset IFS

}
DisplayHelp () {
    printf "\n"
    printf "diffl - DIFF by URL web Links\n"
    printf "\n"
    printf "    What it does:\n"
    printf "        - compares the URL web links in the two provided files (<file1> and <file2>) and shows the missing web links that are found in one but not in the other\n"
    printf "    Syntax:\n"
    printf "        <caller_shell> '/path/to/diffl.sh' <file1> <file2> ... <fileN> [flags]\n"
    printf "        - where:\n"
    printf "            - <caller_shell> can be any of the shells: dash, bash, zsh, or any other shell compatible with the &quot;dash&quot; shell syntax\n"
    printf "            - '/path/to/diffl.sh' represents the path of this script\n"
    printf "            - <file1> and <file2> represent the directory trees to be compared\n"
    printf "                       - if more than two files are provided as parameters (<file1>, <file2>, ..., <fileN>): the web links in <file1> are compared with all the web links in <file2>, ... <fileN>\n"
    printf "            - [flags] can be:\n"
    printf "                --help or -h\n"
    printf "                    Displays this help information\n"
    printf "    Output:\n"
    printf "        - lines starting with '<' signify web links from <file1>\n"
    printf "        - lines starting with '>' signify web links from <file2>, ..., <fileN>\n"
    printf "    Notes:\n"
    printf "               - for simplicity, in this script, only URLs containing &quot;://&quot; are taken into consideration, although there can be URLs that do not contain it (such as mailto:user@site.com)\n"
    printf "\n"
}
GetOS OS
#################################################################################
Uncomment the next line if your OS is not Linux or Mac (and eventually
modify the commands used (sed, sort, uniq) according to your system):
#################################################################################
#OS="userdefined"
DetectShell current_shell
if [ "$current_shell" = "undetermined" ]; then
    printf "\nWarning: This script was designed to work with dash, bash and zsh shells.\n\n">/dev/tty
fi
#Get the program parameters into the array "params":
params_count=0
for i; do
    params_count=$((params_count+1))
    eval params_$params_count=&quot;$i&quot;
done
params_0=$((params_count))
if [ "$params_0" = "0" ]; then #if no parameters are provided: display help
    DisplayHelp
    CleanUp && exit 0
fi
#Create a flags array. A flag denotes special parameters:
help_flag="0"
i=1;
j=0;
while [ "$i" -le "$((params_0))" ]; do
    eval params_i=&quot;${params_$i}&quot;
    case "${params_i}" in
    "--help" | "-h" )
        help_flag="1"
    ;;
    * )
        j=$((j+1))
        eval selected_params_$j=&quot;$params_i&quot;
    ;;
    esac
i=$((i+1))

done
selected_params_0=$j
#Rebuild params array:
for i in $(seq 1 $selected_params_0); do
    eval params_$i=&quot;${selected_params_$i}&quot;
done
params_0=$selected_params_0
if [ "$help_flag" = "1" ]; then
    DisplayHelp
else #Run program:
NL=$(printf '%s' &quot;\n\n&quot;); #final NewLine is deleted
#or use:
#NL=$'\n'

error1=&quot;false&quot;
error2=&quot;false&quot;
error3=&quot;false&quot;
{ sed --help &gt;/dev/null 2&gt;/dev/null; } || { error1=&quot;true&quot;; }
{ sort --help &gt;/dev/null 2&gt;/dev/null; } || { error2=&quot;true&quot;; }
{ uniq --help &gt;/dev/null 2&gt;/dev/null; } || { error3=&quot;true&quot;; }
if [ &quot;$error1&quot; = &quot;true&quot; -o &quot;$error2&quot; = &quot;true&quot; -o &quot;$error3&quot; = &quot;true&quot; ]; then
    {
        printf &quot;\n&quot;
        if [ &quot;$error1&quot; = &quot;true&quot; ]; then printf '%s' &quot;ERROR: Could not run \&quot;sed\&quot; (necessary in order for this script to function correctly)!&quot;; fi
        if [ &quot;$error2&quot; = &quot;true&quot; ]; then printf '%s' &quot;ERROR: Could not run \&quot;sort\&quot; (necessary in order for this script to function correctly)&quot;; fi
        if [ &quot;$error3&quot; = &quot;true&quot; ]; then printf '%s' &quot;ERROR: Could not run \&quot;uniq\&quot; (necessary in order for this script to function correctly)&quot;; fi
        printf &quot;\n&quot;
    }&gt;/dev/stderr
    exit
fi

if [ &quot;$OS&quot; = &quot;Linux&quot; -o &quot;$OS&quot; = &quot;Mac&quot; -o &quot;$OS&quot; = &quot;userdefined&quot; ]; then
    #   command1: sed -E 's/([a-zA-Z]*\:\/\/)/\\${NL}\1/g'
    sed_command1='sed -E '&quot;'&quot;'s/([a-zA-Z]*\:\/\/)/'&quot;\\${NL}&quot;'\1/g'&quot;'&quot;;
    #   command2: sed -n 's/\(\(.*\([^a-zA-Z+]\)\|\([a-zA-Z]\)\)\)\(\([a-zA-Z]\)*\:\/\/\)\([^ \t]*\).*/\4\5\7/p'
    sed_command2='sed -n '&quot;'&quot;'s/\(\(.*\([^a-zA-Z+]\)\|\([a-zA-Z]\)\)\)\(\([a-zA-Z]\)*\:\/\/\)\([^ \t]*\).*/\4\5\7/p'&quot;'&quot;
    #   command3: sed -E 's/(.) [0-9]* (.*)/\1 \2/g'
    sed_command3='sed -E '&quot;'&quot;'s/(.) [0-9]* (.*)/\1 \2/g'&quot;'&quot;;
    #   command4: sed -E 's/^1/&gt;/g;s/^0/&lt;/g'
    sed_command4='sed -E '&quot;'&quot;'s/^1/&gt;/g;s/^0/&lt;/g'&quot;'&quot;
else
    printf '\n%s\n\n' &quot;Error: Unsupported OS!&quot;&gt;/dev/stderr
    exit 1
fi

#Get the program parameters into the array &quot;files&quot;:
count=0
for i; do
    count=$((count+1))
    eval files_$count=\&quot;\$i\&quot;
done
files_0=$((count))

error=&quot;false&quot;
if [ &quot;$files_0&quot; -lt &quot;2&quot; ]; then
    printf '\n%s\n' &quot;ERROR: Please provide at least two parameters!&quot;&gt;/dev/stderr
    error=&quot;true&quot;
fi

if [ &quot;$error&quot; = &quot;true&quot; ]; then
    printf &quot;\n&quot;
    exit 1
fi

error=&quot;false&quot;
for i in $(seq 1 $files_0); do
    eval current_file=\&quot;\$files_$i\&quot;
    if [ ! \( -e &quot;$current_file&quot; -a -f &quot;$current_file&quot; \) ]; then
        printf '\n%s\n' &quot;ERROR: File \&quot;$current_file\&quot; does not exist or is not a regular file!&quot;&gt;/dev/stderr
        error=&quot;true&quot;
    fi
done

if [ &quot;$error&quot; = &quot;true&quot; ]; then
    printf &quot;\n&quot;
    exit 1
fi

#Proceed to finding and comparing links:

#Trap &quot;INTERRUPT&quot; (CTRL-C) and &quot;TERMINAL STOP&quot; (CTRL-Z) signals:
trap 'trap1' INT
trap 'trap1' TSTP

old_IFS=&quot;$IFS&quot; #Store initial IFS value
IFS=&quot;
&quot;

{
    PrintJustInTitle &quot;Searching for links [1]...&quot;
    mask=&quot;00000000000000000000&quot;
    {
        count=0
        for link in $(\
            cat &quot;$files_1&quot; |\
            eval $sed_command1 |\
            eval $sed_command2\
        ); do
            count_prev=$count
            count=$((count+1))
            if [ &quot;${#count_prev}&quot; -lt &quot;${#count}&quot; ]; then
                mask=&quot;${mask%?}&quot;
            fi
            number=&quot;$mask$count&quot;
            printf '%s\n' &quot;0 $number $link&quot;
            PrintJustInTitle &quot;Links found [1]: $((count))...&quot;
        done;

        PrintJustInTitle &quot;Sorting results [1]...&quot;
    }|sort -u -k 3

    PrintJustInTitle &quot;Searching for links [2]...&quot;
    mask=&quot;00000000000000000000&quot;
    {
        count=0
        for i in $(seq 2 $files_0); do
            eval current_file=\&quot;\$files_$i\&quot;
            for link in $(\
                cat &quot;$current_file&quot; |\
                eval $sed_command1 |\
                eval $sed_command2\
            ); do
                count_prev=$count
                count=$((count+1))
                if [ &quot;${#count_prev}&quot; -lt &quot;${#count}&quot; ]; then
                    mask=&quot;${mask%?}&quot;
                fi
                number=&quot;$mask$count&quot;
                printf '%s\n' &quot;1 $number $link&quot;
                PrintJustInTitle &quot;Links found [2]: $((count))...&quot;
            done
        done

        PrintJustInTitle &quot;Sorting results [2]...&quot;
    }|sort -u -k 3

    PrintJustInTitle &quot;Searching for unique links [3]...&quot;
}|{\
    sort -k 3|uniq -u -f 2|sort|eval $sed_command3|eval $sed_command4

    PrintJustInTitle &quot;Done&quot;;
}

CleanUp

fi

Syntax:
- <caller_shell> '/path/to/diffl.sh' <file1> <file2> ... <fileN>
What it does:
- this will show the URL web links that <file1> and the group of files <file2>, ..., <fileN> don't have in common
Notes:
- for simplicity, in this script, only URLs containing "://" are taken into consideration

(1) This appears to be similar enough to your other answer that many of my comments there probably apply here, as well. (2) When you post an answer that’s derived from somebody else’s work, you should say so. — G-Man Says 'Reinstate Monica', Mar 13 '21 at 02:06
This script compares only the web links inside files, whereas the other answer compares: date modified, size, path of files... these are different things (and by the way, I am the author of https://unix.stackexchange.com/questions/59336/compare-directories-but-not-content-of-files#621962 ) — , Mar 15 '21 at 13:16

Compare two URL lists and print newly added URLs to a new file

5 Answers5

Explanation

diffl.sh

DIFF with Links - a "diff utility"-like .sh script

(dash, bash, zsh compatible) that can find missing

web links in one file compared to a group of files

Please note that: for simplicity, in this script, only

URLs containing "://" are taken into consideration,

although there can be URLs that do not contain it

(such as mailto:user@site.com)

Uncomment the next line if your OS is not Linux or Mac (and eventually

modify the commands used (sed, sort, uniq) according to your system):