0

I've been given some code that is supposed to be working, but it doesn't, and i'm trying to understand why that is. I'm trying to learn bash and awk for that reason, but it's quite confusing to me. If someone could help me to understand this awk code I would be really happy.

cvgMids.txt contains many lines of the following format

<http://rdf.freebase.com/ns/g.11b74p1stp>   <http://rdf.freebase.com/ns/type.object.type>   <http://rdf.freebase.com/ns/cvg.video_game_soundtrack>  .
<http://rdf.freebase.com/ns/g.11bc4msmrn>   <http://rdf.freebase.com/ns/type.object.type>   <http://rdf.freebase.com/ns/cvg.cvg_developer>  .
<http://rdf.freebase.com/ns/g.11bxxz28q6>   <http://rdf.freebase.com/ns/type.object.type>   <http://rdf.freebase.com/ns/cvg.computer_videogame> .
  • What is the point of BEGIN{i=0;} I don't see variable i being used in any of the following lines.

  • What is <(cat cvgMids.txt) <(gzip -dc freebase-rdf-latest.gz) > cvg_predicates.txt for? I get that you put the files in the end of awk but it's confusing to me with all these parenthesis etc.

awk 'BEGIN{i=0;}
FNR == NR {
    if($1 in a) next;
    a[$1] = $1;
    next
}
FNR<NR {
    if($1 in a) {print $0;}}' <(cat cvgMids.txt) <(gzip -dc freebase-rdf-latest.gz) > cvg_predicates.txt
Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
  • 1
    What is the code intended to do, and in what way is it failing? it seems to have been written by someone who has little experience with awk – steeldriver Mar 08 '20 at 13:15
  • "Everytime $1 contains the whole line" ... It would contain the first column. Unless the FS is set to something non-default (and I don't see that), it won't contain the whole line. – muru Mar 08 '20 at 13:22
  • after a[$1] = $1; i added a print a[$1]; and the whole line is being printed – prof chaos Mar 08 '20 at 13:23
  • And as for the <(), see https://unix.stackexchange.com/questions/294635/what-is-the-bash-file-contents-syntax-called/294636#294636 – muru Mar 08 '20 at 13:23
  • If you meant print a[$1]; no, it doesn't - not for me - I get the first column as expected. – muru Mar 08 '20 at 13:25
  • @steeldriver I believe it's purpose is to get the first column "for example http://rdf.freebase.com/ns/g.11b74p1stp" and use it to fetch all lines from freebase-rdf-latest.gz that contain it. – prof chaos Mar 08 '20 at 13:25
  • You are right muru i must have made a mistake. It does print the first column. – prof chaos Mar 08 '20 at 13:28

1 Answers1

4

What the snippet appears to do is output the lines from the uncompressed contents of freebase-rdf-latest.gz whose first whitespace-delimited field $1 matches any of the first whitespace-delimited fields from cvgMids.txt. However it could be written more simply.

In particular:

  • as you noted, i is not used anywhere so the BEGIN block may be eliminated

  • the sequence

    if($1 in a) next;
    a[$1] = $1;
    next
    

    could be reduced to

    a[$1];
    next
    

    (the array's values are never used, only its indices and it's almost certainly as efficient to re-assign the index multiple times as to test and conditionally assign it)

  • in the rule-action

    FNR<NR {
        if($1 in a) {print $0;}}
    

    you don't really need FNR<NR since you've already dealt with the case FNR==NR and FNR>NR is not going to happen1. Also, {print $0;} is the default action. So it would be more idiomatic to write

    $1 in a 
    
  • <(cat cvgMids.txt) and <(gzip -dc freebase-rdf-latest.gz) are shell process substitutions. Functionally, the first is equivalent to cvgMids.txt (it's both a Useless Use of cat and a useless use of redirection). Perhaps it was used for aesthetic reasons.

Putting it all together, we get

awk 'FNR == NR {a[$1]; next} $1 in a' cvgMids.txt <(gzip -dc freebase-rdf-latest.gz) > cvg_predicates.txt

However, if the original is not working, the simplified version won't work either.


1 unless your code modifies FNR and/or NR - which is legal, but rarely done in practice.

steeldriver
  • 81,074