how to turn multiple URLs into domains with command line

Question

I have .csv files with multiple columns and ',' as separator. The url's are in the first column. I need to turn all url's into domains without removing the other columns

Example of the data I have:

https://www.example.com/dog/url/path/cat.php,column2,$3,4
http://www.unix.random.com/index.html,column2,$3,4
http://example.com/dog/cat.php,column2,$3,4
www.example.com/dog/,column2,$3,4
example.com/url/path/cat/dog,column2,$3,4
https://example.com/,column2,$3,4
https://www.unix.random.com,column2,$3,4
http://www.example.com,column2,$3,4
http://example.com,column2,$3,4
www.random.com,column2,$3,4
example.com/,column2,$3,4

I need to turn all urls in column 1 to a domain name without touching the other columns, the other columns contain no '/'. I need to keep subdomains expect for www.

The output need to be:

example.com,column2,$3,4
unix.random.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
unix.random.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
random.com,column2,$3,4
example.com,column2,$3,4

How to do this?

Will the domain names always end in .com? How about .net or .org? — jubilatious1, Jul 27 '22 at 11:29

score 2 · Answer 1 · answered Jul 19 '22 at 14:38

2

I believe this works:

sed -E 's#^(.*://)?(www\.)?##; s#^([^,/]+)[^,]*#\1#'

This first sed command (s#^(.*://)?(www\.)?##) matches the protocol and the 'www.' and replaces it with nothing. The second sed command (s#^([^,/]+)[^,]*#\1#) matches everything up to the first slash and then everything up to the first comma and replaces it with everything up to the first slash, so it essentially removes everything from the first slash until the first comma.

answered Jul 19 '22 at 14:38

DBear

137

first lets understand the elements that make up an URL https:// www. copahost.com /blog/page.html protocol subdomain domain path
you want to "extract" the domain portion from the URL, I suggest doing them one URL at a time

the following sed string is one way of extracting the domain portion.
– Ian Christy Jul 29 '22 at 18:54

score 2 · Answer 2 · edited Jul 29 '22 at 11:06

2

Using sed

$ sed -E 's~(^[^/]*//)?(w+\.)?([^/]*)[^,]*~\3~' input_file
example.com,column2,$3,4
unix.random.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
unix.random.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
random.com,column2,$3,4
example.com,column2,$3,4

edited Jul 29 '22 at 11:06

Philippos

13,453

answered Jul 19 '22 at 15:33

sseLtaH

2,786

Ed Morton · Accepted Answer · 2022-07-19T16:43:59.263

2

Using any awk:

$ awk 'BEGIN{FS=OFS=","} {sub("^([^/:]+://)?(www[.])?","",$1); sub("/.*","",$1)} 1' file
example.com,column2,$3,4
unix.random.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
unix.random.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
random.com,column2,$3,4
example.com,column2,$3,4

edited Jul 19 '22 at 16:43

answered Jul 19 '22 at 16:33

Ed Morton

31,617

Adamq · Answer 4 · 2022-07-19T23:22:25.840

This may not be the answer you are looking for, but sed can be inconsistent across OSs and its syntax hard to read.

This may be even worse, but one other option is to use Node.js on the command line with the -e flag which evaluates a string. The downside to this is that you will have to have Node.js installed on the system.

This code takes everything piped to it from standard input and prints the modified string to standard output:

cat infile.csv | node -e 'const stdin = process.openStdin();
let data = "";
stdin.on("data", chunk => data += chunk);
stdin.on("end", () => {
  console.log(
    data
      .trim()
      .split("\n")
      .filter(Boolean)
      .map((line) => {
        const parts = line.split(",");
        const url = new URL((!/^http(s)?\:\/\//.test(line) ? "https://" : "") + parts.shift());
        return `${url.host.replace(/^www\./,"")},${parts.join(",")}`
      })
      .join("\n"))
});' > outfile.csv

You may have trouble overwriting your input file if that's what you want to do. To solve that, you may pass the file name as an argument after the code instead of using a pipe:

node -e 'const fs = require("fs");         
const infile = process.argv[1]; const data = fs.readFileSync(infile).toString();
const output = data
  .trim()
  .split("\n")
  .filter(Boolean)
  .map((line) => {
    const parts = line.split(",");
    const url = new URL((!/^http(s)?\:\/\//.test(line) ? "https://" : "") + parts.shift());
    return `${url.host.replace(/^www\./,"")},${parts.join(",")}`
  })
  .join("\n");
fs.writeFileSync(infile, output)' file.csv

score 0 · Answer 5 · answered Jul 27 '22 at 16:11

Using Raku (formerly known as Perl_6)

raku -pe 's{ (^ <-[/]>* \/\/ )? (w**3 \.)? (<-[/]>*) <-[,]>* } = "$2";'

[Above is a translation of sed code from @HatLess].

raku -pe 's{ ^ (.* "://" )? (www\.)? } = ""; s{ ^ (<-[,/]>+) <-[,]>* } = "$0";'

[Above is a translation of sed code from @D_Bear].

Sample Output (both cases):

example.com,column2,$3,4
unix.random.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
unix.random.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
random.com,column2,$3,4
example.com,column2,$3,4

https://raku.org

how to turn multiple URLs into domains with command line

5 Answers5