0

I don't think this question has been asked before, so I don't know if sed is capable of this.

Suppose I have a bunch of numbers in a sentence that I need to expand into words, a practical example being to swap the numbered citations in a typical essay into MLA format:

essay.txt:

Sentence 1 [1]. sentence two [1][2]. Sentence three[1][3].

Key.txt (this is a tab delimited file):

1   source-one
2   source-two
3   source-three
...etc

Expected Result.txt:

Sentence 1 [source-one]. sentence two [source-one][source-two]. Sentence three[source-one][source-three]

Here's my pseudocode attempt, but I don't understand enough about sed or tr to do it right:

 cat essay.txt | sed s/$(awk {print $1} key.txt)/$(awk {print $2} key.txt)/g

PS: If there's a trick in notepad++ for mass find-and-replace using multiple terms, that'd be great. As it is, it seems like find-and-replace only works for one term at a time, but I need a way to do it en masse for many terms at once.

cuonglm
  • 153,898
Tom
  • 153

4 Answers4

2

You should use perl instead:

$ perl -ne '
  ++$nr;
  if ($nr == $.) {
    @w = split;
    $k{$w[0]} = $w[1];
  }
  else {
    for $i (keys %k) {
      s/(\[)$i(\])/$1.$k{$i}.$2/ge
    }
    print;
  }
  close ARGV if eof;
' key.txt essay.txt
Sentence 1 [source-one]. sentence two [source-one][source-two]. Sentence three[source-one][source-three]
cuonglm
  • 153,898
  • This is an incredible and information dense answer, but I don't know perl conventions very well. I was hoping I could steadily learn more about command line tools for something that is seemingly straightforward hah. I do know a little about perl, would you care to explain what each line does in case I need to troubleshoot it for other things? – Tom Sep 22 '16 at 05:58
1

awk can do effectively the same as perl here a little simpler, although implementations other than GNU may waste a little CPU time unnecessarily splitting the (large?) text file:

awk 'NR==FNR{a["\\["$1"\\]"]="["$2"]";next} {for(k in a) gsub(k,a[k]);print}' key.txt essay.txt

Since you asked for explanation:

  • awk operates by taking a 'script' consisting of pattern-action pairs, then reads one or more files (or standard input) one 'record' at a time where by default each record is a line, and for each record splits it into fields by default at whitespace (which includes tab) and applies the script by in turn (unless directed otherwise) testing each pattern (which often looks at the current record and/or its fields) and if it matches executing the action (which often does something to or with said record and/or fields). Here I specify two files key.txt essay.txt so it reads those two files in that order, line by line. The script can be put in a file instead of on the command line, but here I chose not to.

  • the first pattern is NR==FNR. NR is a builtin variable which is the Number of the Record being processed; FNR is similarly the number of the record within the current input file. For the first file (key.txt) these are equal; for the second file (and any others) they are unequal

  • the first action is {a["\\["$1"\\]"]="["$2"]";next}. awk has 'associative' or 'hashed' arrays; arrayname[subexpr] where subexpr is a string-valued expression reads or sets an element of the array. $number e.g. $1 $2 etc reference the fields, and $0 references the whole record. Per above this action executes only for lines in key.txt so for example on the last line of that file $1 is 3 and $2 is source-three, and this stores an array entry with a subscript of \[3\] and a content of [source-three]; see below for why I chose these values. The "\\[" and "\\]" are string literals using escapes whose actual values are \[ and \] while "[" "]" are just [ ], and string operands with no operator between them are concatenated. Finally this action executes next which means skip the rest of the script for this record, just go back to the top of the loop and start on the next record.

  • the second pattern is empty, so it matches every line in the second file and executes the action {for(k in a) gsub(k,a[k]);print}. The for(k in a) construct creates a loop, much like Bourne-type shells do in for i in this that other; do something with $i; done, except that here the values of k are the subscripts of the array a. For each such value, it executes gsub (global substitute) which finds all matches of a given regular expression and replaces them with a given string; I chose the subscripts and contents in the array (above) so that for example \[3\] is a regular expression which matches the text string [3] and [source-three] is the text string you want to substitute for every such match. gsub operates on the current record $0 by default. After doing this substitution for all values in a it executes print which by default outputs $0 as it now stands, with all the desired substitutions done.

Note: GNU awk (gawk), which is common especially on Linux but not universal, has an optimization where it doesn't actually do the field-splitting if nothing in the patterns or actions executed needs the field values. On other implementations a small amount of CPU time may be wasted, which cuonglm's perl method avoids, but unless your files are humongous this likely won't even be noticeable.

0
bash$ sed -f  <( sed -rn 's#([0-9]+)\s+(.*)#s/\\[\1]/[\2]/g#p' key.txt ) essay.txt

Sentence 1 [source-one]. sentence two [source-one][source-two]. Sentence three[source-one][source-three].
mug896
  • 965
  • 9
  • 12
0

You can use the in-place sed substitution inside a loop to achieve this:

$ cp essay.txt Result.txt
$ while read n k; do sed -i "s/\[$n\]/\[$k\]/g" Result.txt; done < key.txt
$ cat Result.txt 
Sentence 1 [source-one]. sentence two [source-one][source-two]. Sentence three[source-one][source-three].
user9101329
  • 1,004