3

I have sample:

           "name": "The title of website",
           "sync_transaction_version": "1",
           "type": "url",
           "url": "https://url_of_website"

I want to get the following output:

"The title of website"    url_of_website

I need to remove the protocol prefix from the URL, so that only url_of_website is left (and no http in the front). Problem is I'm not quite familiar with sed reading multiple lines, doing some research reach me https://unix.stackexchange.com/a/337399/256195, still can't produce the result.

A valid json object that I'm trying to parse is Bookmark of google chrome , sample:

{
   "checksum": "9e44bb7b76d8c39c45420dd2158a4521",
   "roots": {
      "bookmark_bar": {
         "children": [ {
            "children": [ {
               "date_added": "13161269379464568",
               "id": "2046",
               "name": "The title is here",
               "sync_transaction_version": "1",
               "type": "url",
               "url": "https://the_url_is_here"
            }, {
               "date_added": "13161324436994183",
               "id": "2047",
               "meta_info": {
                  "last_visited_desktop": "13176472235950821"
               },
               "name": "The title here",
               "sync_transaction_version": "1",
               "type": "url",
               "url": "https://url_here"
            } ]
            } ]
        }
    }
}
MatthewRock
  • 6,986
Tuyen Pham
  • 1,805
  • 3
    Can you post a valid json object? Also jq or json is the proper tool for this, not sed. – jesse_b Nov 29 '18 at 14:49
  • 4
    You don't parse JSON with sed. JSON is a structured document format unsuitable for parsing by anything other than a JSON parser. Doing it with sed would require you to implement a JSON parser in sed that could handle the different entity encoding etc. that could be present in the data (especially in URLs). – Kusalananda Nov 29 '18 at 14:51
  • @Jesse_b: Thanks, I've just added the json object, and if possible jq and json also work if it can solve the issue. – Tuyen Pham Nov 29 '18 at 14:52
  • @Kusalananda: Thanks, I'll edit the title and change content to suit the context. – Tuyen Pham Nov 29 '18 at 14:53

1 Answers1

8

This works on the JSON document given in the question:

$ jq -r '.roots.bookmark_bar.children[]|.children[]|["\"\(.name)\"",.url]|@tsv' file.json
"The title is here"     https://the_url_is_here
"The title here"        https://url_here

This accesses the .children[] array of each .roots.bookmark_bar.children[] array entry and creates a string that is formatted according to what you showed in the question (with a tab character in-between the two pieces of data).

If the double quotes are not necessary, you could change the cumbersome ["\"\(.name)\"",.url] to just [.name,.url].

To trim the https:// off from the URLs, use

.url|ltrimstr("https://")

instead of just .url.

Kusalananda
  • 333,661
  • Thanks, at the end of the file I get this errror: jq: error (at Bookmarks:23397): Cannot iterate over null (null), 23397 is the last line of the file. – Tuyen Pham Nov 29 '18 at 15:08
  • So I've just modified your command, the correct one should be: jq -r '.roots.bookmark_bar.children[]|.children[]?|["\"\(.name)\"",.url]|@tsv' that eliminate the above error. One more question, Is that space or tab between title and url? What if I need to insert tab between them? – Tuyen Pham Nov 29 '18 at 15:17
  • 1
    @TuyenPham, it's a tab. "@tsv" is a jq formatter for tab-separated values. You could also use @csv to get output like "The title here","https://url_here" – glenn jackman Nov 29 '18 at 15:20
  • @TuyenPham I only had the partial document that you provided to look at, so no wonder there were errors. Good work sorting them out! The @tsv command formats the array that it gets as a tab-delimited string. – Kusalananda Nov 29 '18 at 15:20
  • How to trim both http:// and https://? – Tuyen Pham Nov 29 '18 at 15:26
  • I see, it's .url|ltrimstr("https://")|ltrimstr("http://")|ltrimstr("www.") to trim http://, https:// and www. – Tuyen Pham Nov 29 '18 at 15:28