How can I replace multiple substrings from multiple lines in a file matching a pattern from a different file?

Question

I have a file that contains multiple IP addresses and hostnames, and another file that contains some folders with many IP addresses per line.

ip_hostname.txt

host1 10.1.1.1
host2 10.2.2.2
host3 10.3.3.3
host100 10.50.50.50

path_ips.txt

/path1/foo/bar 10.1.1.1 10.2.2.2 10.3.3.3
/path2/foo/bar 10.3.3.3 10.7.7.7
/path3/foo/bar 10.4.4.4 10.8.8.8 10.29.29.29 10.75.75.75
/path100/foo/bar 10.60.60.60

I want to replace the IP addresses from path_ips.txt file with the hostname from ip_hostname.txt file that matches each IP address.

Desired output for path_ips.txt

/path1/foo/bar host1 host2 host3
/path2/foo/bar host3 host7
/path3/foo/bar host4 host8 host29 host75
/path100/foo/bar host60

I tried to do it with sed in a nested while read loop as follows:

#!/bin/sh
while read -r line
do
IP=$(echo $line| awk '{print $1}')
HN=$(echo $line| awk '{print $2}')
    while read -r line2
    do
           sed -i &quot;s/$IP/$HN/g&quot; path_ips.txt
    echo $line2 #to see the progress
    done &lt; path_ips.txt


done < ip_hostname.txt

And it worked well the first time when the list of IP addresses and hostnames is not so large, but when I tried to do it using a larger list in ip_hostname.txt file, it behaves strangely and the outcome is not as desired. It is needless to mention that it takes too long for it to finish.

Is there a better and efficient way to do it?

Does this answer your question? How to perform replacements defined in one file on another file — Philippos, Aug 12 '22 at 07:30
Please read why-is-using-a-shell-loop-to-process-text-considered-bad-practice. — Ed Morton, Aug 12 '22 at 14:11

user000001 · Answer 1 · 2022-08-12T06:06:33.187

6

The problem with your script is that you run a separate sed command for each IP address that is matched, which makes the script really slow when the file is big.

You also have a nested loop, so you get O(N*M) time complexity in your algorithm.

A better approach would be to use awk to do the replacing, so that you can do all the replacements in a single pass:

$ awk 'NR==FNR{h[$2]=$1;next}{for (i=2;i<=NF;i++) if ($i in h) $i = h[$i]}1' ip_hostname.txt path_ips.txt 
/path1/foo/bar host1 host2 host3
/path2/foo/bar host3 host7
/path3/foo/bar host4 host8 host29 host75
...
/path100/foo/bar host60

or in a more readable format

awk '
    NR == FNR {
      h[$2] = $1
      next
    }
    {
      for (i=2; i<= NF; i++)
        if ($i in h)
          $i = h[$i]
    }
    1
' ip_hostname.txt path_ips.txt

This should have a complexity of O((N+M)lon(N)) where N is the size of file ip_hostname.txt and M is the size of file path_ips.txt. Note that ip_hostname.txt should be able to fit in memory for this to work, but it shouldn't be a problem on a modern computer, unless it is several GBs in size.

edited Aug 12 '22 at 06:06

answered Aug 12 '22 at 05:57

user000001

3,635

2

You may want to start the inner for loop at 2 rather than 1, to skip the pathname field. Also, you will have to explicitly assume that the pathname field does not contain whitespaces. – Kusalananda Aug 12 '22 at 06:05
1

@Kusalananda: Thank you for the suggestion about the starting index, regarding the pathname having whitespace, this will only be a problem if one of the words in the pathname is an IP address, but I don't see an easy way out of this. – user000001 Aug 12 '22 at 06:08
Yes, the delimiter is unspecified in the question, so there's an additional issue with pathnames containing double spaces. These would be shrunk to single spaces when the new lines are written out by awk. Assuming that the pathnames are "nice" would resolve this. – Kusalananda Aug 12 '22 at 06:22
1

to anyone new to awk: if confused about NR == FNR {} {} 1, read https://backreference.org/2010/02/10/idiomatic-awk/ 'Two-file processing' part – Puttin Feb 14 '23 at 08:16

score 2 · Answer 2 · answered Aug 12 '22 at 08:44

This can be done entirely in sed, but the awk answer is generally more readable.

#file toggle
1{x;s:^$:<IPs>:;x}
/^EOF$/{x;s:<IPs>:<paths>:;x;d}
#store hostname file
x;/<IPs>/{x;H;d}
#process path file
x;s: :>&:;s:$: :;G
:loop
    s:>( [^ ]+)( .<paths>.)\n([^ ]+)\1: \3>\2\n\3\1:
tloop
s:> .*::p

As shown in the example, the code assumes space to be the file delimiter. That means the answer will most likely be wrong if there are paths containing space.

This was tested with GNU sed, but you might have a different sed version. If this doesn't run, let me know.

Run:

sed -nrf SCRIPT_FILE ip_hostname.txt <(echo EOF) path_ips.txt > output.txt

Note: <(echo EOF) is used to tell my script when the first input file ends.

Ed Morton · Answer 3 · 2022-08-12T14:32:20.830

Using any POSIX awk:

$ cat tst.awk
NR==FNR {
    map[$2] = $1
    next
}
match($0,/([[:space:]]+([0-9]{1,3}\.){3}[0-9]{1,3})+$/) {
    path = substr($0,1,RSTART-1)
    $0 = substr($0,RSTART,RLENGTH)
    for ( i=1; i<=NF; i++ ) {
        $i = ($i in map ? map[$i] : $i)
    }
    $0 = path OFS $0
}
{ print }

$ awk -f tst.awk ip_hostname.txt path_ips.txt
/path1/foo/bar host1 host2 host3
/path3/foo/bar 10.4.4.4 10.8.8.8 10.29.29.29 10.75.75.75
/path100/foo/bar 10.60.60.60

That will work even if the path contains spaces unless the file name part of the path can end with a space followed by a string that looks like an IP address, e.g. /path/foo/bar 1.1.1.1 where the 1.1.1.1 is not an IP address but instead part of the name of the file bar 1.1.1.1. If that can happen then you need some other format for ip_hostname.txt to separate paths from IP addresses.

score 2 · Answer 4 · answered Aug 12 '22 at 21:16

We will use the ip_hostnames file to build a series of sed commands in the BRE regex syntax. These commands are basic s/ip/hostname/. Care is taken to escape the contents that go into the left hand and right hand sides of the s/ip/host/ commands.

These sed commands are then used by a second sed that operates on the ip_paths file and the substitutions are performed there.

< ip_hostname.txt \
sed -e '
  1i s/$/ /
  $a s/ $//
  h;s/ .*//;s:[\&/]:\\&:g
  x;s/.* //;s/[.]/[.]/g;G
  s:\(.*\)\n\(.*\):s/ \1 / \2 /:
' - |  sed -f - path_ips.txt

Sample contents of the sed commands that are generated for this data set

s/$/ /
s/ 10[.]1[.]1[.]1 / host1 /
s/ 10[.]2[.]2[.]2 / host2 /
s/ 10[.]3[.]3[.]3 / host3 /
s/ 10[.]4[.]4[.]4 / host4 /
s/ 10[.]7[.]7[.]7 / host7 /
...
s/ $//

score 1 · Answer 5 · answered Mar 30 '23 at 13:37

The following perl script reads in the first input file (ip_hostname.txt), and builds an associative array (hash) called %IPs, where the keys are the IP addresses and the values are the hostnames.

As a performance optimisation, each key of the %IPs hash is actually a pre-compiled regex (qr//) of the IP address, with word boundary markers (\b) and escaped metacharacters (\Q & \E) so that . means a literal . rather than any character.

Pre-compiling the regexes will minimise the amount of CPU time spent compiling the regex, down from one per IP address per line of path_ips.txt (i.e. number of lines in ip_hostname.txt multiplied by the number of lines in path_ips.txt) to just one per IP address. This will make a significant difference in performance if either or both files are large.

Variable $first is used to keep track of whether the script is reading the first file or not. It is initialised to true (1) before the main while loop and set to false (0) at the end of each input file.

After processing the first file, for each line of the second file (path_ips.txt), it iterates over the %IPs hash to search for each IP address and replace it with the associated host name. Then it prints the (possibly modified) input line.

Only matching IP addresses on each line are changed, everything else (including whitespace) is left as-is.

#!/usr/bin/perl
use strict;
my %IPs;
my $first = 1;
while(<>) {
  if ($first) {
    chomp;                   # strip \n or \r\n line-endings
    my ($host,$ip) = split;  # assumes whitespace delimited input
    $IPs{qr/\b\Q$ip\E\b/} = $host;
} else {
    foreach my $ip (keys %IPs) {
      s/$ip/$IPs{$ip}/g;
    };
    print;
  };
  $first = 0 if eof;
};
#use Data::Dump qw(dd);
#dd %IPs;

Save as, e.g., map-hostnames.pl and make executable with chmod +x.

Sample output (with an ip_hostname.txt file edited to contain mappings for all the IPs/hosts mentioned in your question):

$ ./map-hostnames.pl ip_hostname.txt path_ips.txt 
/path1/foo/bar host1 host2 host3
/path2/foo/bar host3 host7
/path3/foo/bar host4 host8 host29 host75
/path100/foo/bar host60

BTW, if you want to see what the %IPs hash looks like, uncomment the last two lines of the script (requires the Data::Dump module to be installed). It'll look like this, but with the contents of your real ip_hostname.txt file:

{
  "(?^:\\b10\\.1\\.1\\.1\\b)"    => "host1",
  "(?^:\\b10\\.29\\.29\\.29\\b)" => "host29",
  "(?^:\\b10\\.2\\.2\\.2\\b)"    => "host2",
  "(?^:\\b10\\.3\\.3\\.3\\b)"    => "host3",
  "(?^:\\b10\\.4\\.4\\.4\\b)"    => "host4",
  "(?^:\\b10\\.50\\.50\\.50\\b)" => "host100",
  "(?^:\\b10\\.60\\.60\\.60\\b)" => "host60",
  "(?^:\\b10\\.75\\.75\\.75\\b)" => "host75",
  "(?^:\\b10\\.7\\.7\\.7\\b)"    => "host7",
  "(?^:\\b10\\.8\\.8\\.8\\b)"    => "host8",
}

How can I replace multiple substrings from multiple lines in a file matching a pattern from a different file?

5 Answers5