2

I have a text log file

$ cat aaa
673                  20160405 root "/path_to/gis/20160401/20160301_placement_map_org.dbf" ""
673                  20160405 root "/path_to/gis/20160401/20160310_20160401ent_map_org.dbf" ""
790890               20170201 jle  "/path_to/gis/20160401/Pina (Asc) 20160401 Rapid Report.kmz" ""
5883710              20160406 dho  "/path_to/gis/20160401/20160401_Pina_Asc_Rapid_Report_Minesouth.pdf" ""
673                  20160405 dho  "/path_to/gis/20160401/20160310_20160401 placement map org.dbf" ""

Now I have this script output just the full path of the files:

#!/bin/bash

function nodatechk() {
    arr=("$@")
    for ((i=3;i<${#arr[@]};i+=5));
    do
      echo "${i}" "${arr[i]}"
    done
}

r=( $(grep gis aaa) ) 

nodatechk "${r[@]}"

The output is break because of the 3rd line (and 5th line) have a space in the element, although it has double quote.

How can I fix this? (BTW I know I can use awk or cut to print out columns, but in this case I just want use grep.) Thanks.

Gao
  • 21
  • Look at the order of expansions in man bash. $(grep gis aaa) undergoes word splitting, but literal double quotes aren't there, they're a result of the command substitution, so they don't influence the word splitting. – choroba Dec 04 '18 at 23:34
  • Note that awk '/gis/ && n++ % 5 == 3 {print n-1, $0}' < aaa is simpler and would be several orders of magnitude faster for large inputs. You generally don't want to use shell loops to process text. – Stéphane Chazelas Dec 05 '18 at 08:34
  • 1
    You don't seem to be using grep to output any columns. It is unclear what you want to be doing with grep to output the column that you are interested in. awk would be a better choice for that. It's also unclear whether the columns are tab or space separated. – Kusalananda Dec 06 '18 at 17:13
  • I guess you want to print the fourth word (column), i.e., the pathname, from every line in the input file that contains gis (anywhere in the line), treating a string enclosed in quotes as a single word, even if it contains space(s). It would be nice if you (1) said so, (2) showed what output you want, and (3) included some lines in your file that *don't* contain gis; as it is, the whole idea of doing this with grep seems misguided. … (Cont’d) – G-Man Says 'Reinstate Monica' Dec 07 '18 at 00:00
  • (Cont’d) …  Also, it would be nice if you gave the bigger picture.  For example, are the columns separated by spaces or tabs (or a combination)?  If tabs (only), is it a single tab per column, or is it enough tabs to make the columns line up visually (regardless of how long the words are)?  And can the values contain tabs?  From your attempt, I guess you assume that every line will have exactly five columns — there will always be exactly one thing after the pathname.  But will it always be "", or might it be "foo" — or might it be "foo bar" (with embedded spaces)?  … (Cont’d) – G-Man Says 'Reinstate Monica' Dec 07 '18 at 00:00
  • (Cont’d) …  Might the first three columns ever contain quotes?  Will the fourth and fifth columns always begin and end with ", even if they're non-null and don't contain any blanks?  And can the pathname possibly *contain any quote characters? … … … … … … … … … … … P.S. How would you do this with awk and/or cut, given that the column you want contains spaces?* – G-Man Says 'Reinstate Monica' Dec 07 '18 at 00:00
  • @StéphaneChazelas: I agree that the OP's attempt is very bad — so bad that (I believe) you misunderstood it.  I believe that they are just trying to do awk '/gis/ { print $4 }', but without treating spaces between quotes as field separators.  The whole every-fifth-thing logic is there only because they expect $(grep gis aaa) to give them an array of words, with five words per line. – G-Man Says 'Reinstate Monica' Dec 07 '18 at 00:03

3 Answers3

4

The problem has its roots in this line:

 r=( $(grep gis aaa) )

As you will immediately see if you try:

 printf '<%s>\n' $(grep gis aaa)

Which splits on the characters inside "$IFS" (space, tab, newline by default).

And exposes the values from the file to globbing. Which will transform some *, ? and […] (which ones will depend on the list of files on your pwd and the condition of several shell options).

One (not recommended) solution is to change IFS to the split character and disable globbing for the split:

 IFS=$'\n'; set -f; r=( $(grep gis aaa) )

But a simpler solution is to use what the shell already provide:

readarray -t r <(grep gis aaa) 

That will split on newlines (assuming there are no newlines in the pathnames).

Then, to avoid splitting each line again to get each part which could expose the line to whitespace splitting and globbing, lets remove the leading and trailing parts of the lines.

If from each line we remove everything from the beginning up to the "/ (double quote and slash) and everything from the " (double quote and space) to the end, we will get a clean pathname:

 #!/bin/bash

 function nodatechk() {
    for l do
        l="/${l#*\"/}"                # Remove leading text up to `"/`
        l=${l%\" *}                   # Remove trailing text from `" `
        printf '%s\n' "$l"
    done
 }

 readarray -t r < <(grep gis aaa)

 nodatechk "${r[@]}"
0

A grep-only solution is

grep gis aaa | grep -o '^[^"]*"[^"]*"' | grep -o '"[^"]*"$'

The first grep is the same as what you have in the question.  Obviously, it selects lines that contain gis (anywhere in the line).  The second grep,

grep -o '^[^"]*"[^"]*"'

matches everything up through (and including) the first quoted string on the line (i.e., columns 1 through 4), and, because of the -o option, outputs only those words.  The third grep,

grep -o '"[^"]*"$'

matches the last word quoted string on the line (which, at this point, is column 4 from the original line) and outputs only that string.


P.S. If your file has one tab between each pair of columns, and the values don't contain tabs, the simple way to get the fourth column is

awk -F'\t' '/gis/ { print $4 }' aaa
-1

I read this post and I solved the problem by using 'eval'. So changed this line:

r=( $(grep gis aaa) )

to

eval r="( $(grep gis aaa) )"

Gao
  • 21