1

I'm trying to bowdlerize email addresses in a fixed length text file by generating a random string the same length as the input. I'm passing the string as a backreference in sed.

To simplify, this script (temp):

#!/bin/zsh
IFS=$'\n'       # make newlines the only separator
set -f          # disable globbing

#show me the input from the command line echo $1 ${#1}

function randString() { # just echo for demonstration echo $1 ${#1} # this is the bit I really want: # cat /dev/urandom | LC_ALL=C tr -cd "[a-z]" | head -c${#1} }

for line in $(cat $1); do echo $line | sed "s/([a-zA-Z0-9]{2,})@([a-zA-Z0-9]{2,}).([a-zA-Z0-9]{2,})/$(randString \1)@$(randString \2).$(randString \3)/"

done

with this data (temp.txt):

me myemail@someserver.com
you youremail@anotherserver.biz

run like this:

./temp temp.txt

gives me this output:

temp.txt 8
me myemail 2@someserver 2.com 2
you youremail 2@anotherserver 2.biz 2

The problem being that ${#1} returns 2 no matter what string I feed it. How is the proper string coming back from $1 while the length from ${#1} is so very wrong? Is setting IFS for the file loop killing my function?

NOTE: I'm on Mac so no GNU extensions.

2 Answers2

2

A couple of diagnostic bits that show what is happening.

With the line echo rand $1 ${#1} >&2 added to the randString function, this is the output:

temp.txt 8
rand \1 2
rand \2 2
rand \3 2
me myemail 2@someserver 2.com 2
rand \1 2
rand \2 2
rand \3 2
you youremail 2@anotherserver 2.biz 2

By echoing the input to stderr with >&2, we can see that randString is being called with the arguments \1, \2 and \3 (which have a length of 2), and not the strings that those backreferences were supposed to indicate.

The next test is to preface the call to sed with echo so we can see what arguments it is getting. The output from that:

sed s/\([a-zA-Z0-9]\{2,\}\)@\([a-zA-Z0-9]\{2,\}\)\.\([a-zA-Z0-9]\{2,\}\)/\1 2@\2 2.\3 2/

With this, sed is being told to replace the strings with something like \1 2, i.e. a backreference to the input string, followed by a space and the number 2. The strings from the input email address are coming from sed, not from the echo in the function.

This is because the the command substitutions in the string (the $(...) expansions) are processed by zsh before the string is passed to sed as an argument. In order to get the input strings passed to the function, you'll need sed to call the shell function. But there may not be a way to do that in the default the version of sed.


Edited to add: a quick script for munging email addresses that is mostly zsh:

#!/usr/bin/env zsh
setopt extendedglob

coproc cat /dev/urandom | LC_ALL=C tr -cd "[:lower:]"

getRand (){ print -r -- ${1//(#m)[[:alnum:]]/$(read -psk var;echo $var)} }

while read line; do print -r -- ${line/(#m)[:alnum:]@[:alnum:].[:alnum:]/$(getRand ${MATCH})} done < ${1:?}

Gairfowl
  • 523
1
#!/bin/zsh
IFS=$'\n'       # make newlines the only separator
set -f          # disable globbing

zsh -f, like csh -f is to skip reading startup files, not to disable globbing (except when in sh/ksh emulation), for which you need set -o noglob or set +o glob (or variants with setopt/unsetopt).

You'd use set -f in other Bourne-like shells to work around their misfeature by which globbing is performed upon unquoted expansions. But zsh doesn't have that misfeature as the globsubst option is disabled by default (when not in sh/ksh emulation mode).

#show me the input from the command line
echo $1 ${#1}

It should be print -r -- $1 $#1 or echo -E - $1 $#1 or printf '%s\n' "$1 $#1" or that won't work properly with values of $1 that contain \s or some values that start with -.

function randString() {

I'd choose between the randString() ... Bourne syntax or function randString { Korn syntax, but not use that combination (but then it's only a matter of taste).

    # cat /dev/urandom | LC_ALL=C tr -cd "[a-z]" | head -c${#1}

Concatenating a single file makes little sense.

Beware that with most tr implementations, tr -cd "[a-z]" would also produce [ and ] characters.

$ echo '[]123ab' | tr -cd '[a-z]'
[]ab
$ echo '[]123ab' | tr -cd a-z
ab
}

for line in $(cat $1); do

That's not how you process text in shells. See Why is using a shell loop to process text considered bad practice?

    echo $line |
        sed "s/\([a-zA-Z0-9]\{2,\}\)@\([a-zA-Z0-9]\{2,\}\)\.\([a-zA-Z0-9]\{2,\}\)/$(randString \\1)@$(randString \\2).$(randString \\3)/"

In there, the shell will perform the $(...) expansions before calling sed. randString \\1 calls randString with a literal \1 as argument, so you end up calling sed with a s/\([a-zA-Z0-9]\{2,\}\)@\([a-zA-Z0-9]\{2,\}\)\.\([a-zA-Z0-9]\{2,\}\)/\12@\22.\32/ argument.

Also beware that what [a-z] and co. match depends on the locale.

Here, you should rather run one invocation of a text processing utility, preferably one that can generate random strings. Something like:

#! /bin/sh -
exec perl -Tpe 's{\w{2,}@\w{2,}\.\w{2,}}{
  $& =~ s/\w/chr 96 + rand(26)/ger}ge' -- "$@"

Here using sh instead of zsh as sh is more than enough and call perl to do the text processing and generate random strings in a much much simpler and more efficient manner.

Or write a perl script instead:

#! /usr/bin/perl --

while (<<>>) { s{\w{2,}@\w{2,}.\w{2,}}{$& =~ s/\w/chr 96 + rand(26)/ger}ge; print; }

Here using <<>> instead of <>. The -p option implies a <> loop, which allows passing things like ls| to process the output of ls instead of the file called ls|, but that's rather dangerous. Using the -T option somewhat mitigates the security issues, <<>> addresses it.

Doing something similar in zsh internally is possible but would not be pretty.

#! /bin/zsh -
zmodload zsh/mathfunc
zmodload zsh/mapfile
set -o extendedglob

pattern='[:alnum:]@[:alnum:].[:alnum:]' for file do print -rn -- ${mapfile[$file]//(#m)$~pattern/${MATCH//(#m)[[:alnum:]]/${(L)$(( [##36] rand48() * 26 + 9))}}} done

Here using:

  • $mapfile[file] to load the contents of files
  • zsh's own glob operators to do the matching instead of regexps
  • rand48() to generate random numbers, here between 10 and 35 in base 36 to output letters A to Z, converted to lowercase with the L parameter expansion flag.