2

I have .csv files with multiple columns and ',' as separator. The url's are in the first column. I need to turn all url's into domains without removing the other columns

Example of the data I have:

https://www.example.com/dog/url/path/cat.php,column2,$3,4
http://www.unix.random.com/index.html,column2,$3,4
http://example.com/dog/cat.php,column2,$3,4
www.example.com/dog/,column2,$3,4
example.com/url/path/cat/dog,column2,$3,4
https://example.com/,column2,$3,4
https://www.unix.random.com,column2,$3,4
http://www.example.com,column2,$3,4
http://example.com,column2,$3,4
www.random.com,column2,$3,4
example.com/,column2,$3,4 

I need to turn all urls in column 1 to a domain name without touching the other columns, the other columns contain no '/'. I need to keep subdomains expect for www.

The output need to be:

example.com,column2,$3,4
unix.random.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
unix.random.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
random.com,column2,$3,4
example.com,column2,$3,4 

How to do this?

5 Answers5

2

I believe this works:

sed -E 's#^(.*://)?(www\.)?##; s#^([^,/]+)[^,]*#\1#'

This first sed command (s#^(.*://)?(www\.)?##) matches the protocol and the 'www.' and replaces it with nothing. The second sed command (s#^([^,/]+)[^,]*#\1#) matches everything up to the first slash and then everything up to the first comma and replaces it with everything up to the first slash, so it essentially removes everything from the first slash until the first comma.

DBear
  • 137
  • first lets understand the elements that make up an URL https:// www. copahost.com /blog/page.html protocol subdomain domain path

    you want to "extract" the domain portion from the URL, I suggest doing them one URL at a time

    the following sed string is one way of extracting the domain portion.

    – Ian Christy Jul 29 '22 at 18:54
2

Using sed

$ sed -E 's~(^[^/]*//)?(w+\.)?([^/]*)[^,]*~\3~' input_file
example.com,column2,$3,4
unix.random.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
unix.random.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
random.com,column2,$3,4
example.com,column2,$3,4
Philippos
  • 13,453
sseLtaH
  • 2,786
2

Using any awk:

$ awk 'BEGIN{FS=OFS=","} {sub("^([^/:]+://)?(www[.])?","",$1); sub("/.*","",$1)} 1' file
example.com,column2,$3,4
unix.random.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
unix.random.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
random.com,column2,$3,4
example.com,column2,$3,4
Ed Morton
  • 31,617
2

This may not be the answer you are looking for, but sed can be inconsistent across OSs and its syntax hard to read.

This may be even worse, but one other option is to use Node.js on the command line with the -e flag which evaluates a string. The downside to this is that you will have to have Node.js installed on the system.

This code takes everything piped to it from standard input and prints the modified string to standard output:

cat infile.csv | node -e 'const stdin = process.openStdin();
let data = "";
stdin.on("data", chunk => data += chunk);
stdin.on("end", () => {
  console.log(
    data
      .trim()
      .split("\n")
      .filter(Boolean)
      .map((line) => {
        const parts = line.split(",");
        const url = new URL((!/^http(s)?\:\/\//.test(line) ? "https://" : "") + parts.shift());
        return `${url.host.replace(/^www\./,"")},${parts.join(",")}`
      })
      .join("\n"))
});' > outfile.csv

You may have trouble overwriting your input file if that's what you want to do. To solve that, you may pass the file name as an argument after the code instead of using a pipe:

node -e 'const fs = require("fs");         
const infile = process.argv[1]; const data = fs.readFileSync(infile).toString();
const output = data
  .trim()
  .split("\n")
  .filter(Boolean)
  .map((line) => {
    const parts = line.split(",");
    const url = new URL((!/^http(s)?\:\/\//.test(line) ? "https://" : "") + parts.shift());
    return `${url.host.replace(/^www\./,"")},${parts.join(",")}`
  })
  .join("\n");
fs.writeFileSync(infile, output)' file.csv
Adamq
  • 21
0

Using Raku (formerly known as Perl_6)

raku -pe 's{ (^ <-[/]>* \/\/ )? (w**3 \.)? (<-[/]>*) <-[,]>* } = "$2";'  

[Above is a translation of sed code from @HatLess].

raku -pe 's{ ^ (.* "://" )? (www\.)? } = ""; s{ ^ (<-[,/]>+) <-[,]>* } = "$0";' 

[Above is a translation of sed code from @D_Bear].

Sample Output (both cases):

example.com,column2,$3,4
unix.random.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
unix.random.com,column2,$3,4
example.com,column2,$3,4
example.com,column2,$3,4
random.com,column2,$3,4
example.com,column2,$3,4 

https://raku.org

jubilatious1
  • 3,195
  • 8
  • 17