With sed
if you can format the file in the form of a sed
script you can do it automatically. The following should work w/ a GNU sed
. With a BSD sed
it will work if you do -i '' -e
for the second sed
invocation...
sed -ne's|[]\*&^.$/[]|\\&|g' \
-e's|..*|/^@&",/d|p' <./list.txt |
sed -ie'h;s/[^,]*[^@]*//' -f- -eg ./data.txt
If you do...
-e's|..*|/^@&",/Id|p' ...
...in the second line a GNU sed
will d
elete matches for any line in list.txt
case insensitively, but it will amount to a syntax error with most any other.
It tries to optimize the matches by removing the first field and everything before the first @
in the second field at the head of the script it runs for every line, then does the match checks, and, if the line makes it through all of them, g
ets a copy of the line it saved at the top of the script in h
old space. In that way sed
doesn't need to /^[^,]*,[^,]*.../
for every match. If list.txt
is very long, though, it will not be a fast process regardless. grep -F
should be preferred in that case (and probably in this case).
Both sed
and grep
can be made to perform better - and in many cases signifigantly so - if the charset used is reduced in size. For example, if you are currently in a UTF-8 locale then doing:
( export LC_ALL=C
sed -ne's|[]\*&^.$/[]|\\&|g' \
-e's|..*|/^@&",/Id|p' |
sed -ie'h;s/[^,]*[^@]*//' -f-\
-eg ./data.txt
) <./list.txt
...can make a world of difference in that rather than having to consider some umpteen-thousand different characters as matches, the regex engine need only consider 128 possibilities. It should not affect the results in any way - each char is a byte in the C locale and all will get due consideration.
sed -i
is not a reliable switch to use in the best of cases, and should be avoided if at all possible.
To do this w/ grep
and sed -i
:
( export LC_ALL=C
cut -d\" -f4 | cut -d@ -f2 |
grep -Fixnf ./list.txt |
sed -e's|:*\([0-9]*\).*|:\1|p'\
-e's||\1!{p;n;b\1|p' \
-e's||};n|' |
sed -nif- -e:n -e'p;n;bn' \
./data.txt
) <./data.txt
That is the quickest way I can imagine it might be done with sed
's -i
. It breaks down like this:
cut | cut
- The first two
cut
s reduce your ./data.txt
input lines from/to...
"foxva****omes****","scott@hotmail.com","8*** Rd","Ne***ah","Wi***in","54***","*******"
hotmail.com
grep
grep
can then compare that input against each line in its pattern -f
ile list.txt
using case -i
nsensitive -F
ixed string -x
whole line matches and reporting the line -n
umber at the head of each line of its output.
sed -e
sed
strips grep
's output to nothing but the line numbers and writes out another sed
script which looks like (given hypothetical grep
matches on lines 10 and 20):
:10
10!{p;n;b10
};n
:20
20!{p;n;b20
};n
sed -inf-
The last sed
reads -
stdin as its script and only ever executes it the one time - it doesn't backtrack and execute the script for every input line as is commonly done w/ sed
scripts, but rather as it executes its script the first and only time it works its way through input - and it needs only try a single test per input line.
Given our previous example, for lines 1-9 sed
will do:
- If the current line is
!
not the 10
th, {
then p
rint the current line, overwrite the current line with the n
ext input line, and b
ranch back to the :
label named 10
.
and for the last series of lines sed
will p
rint; then overwrite the current line with the n
ext, the b
ranch to the :n
label until it has consumed all input.
That doesn't work if ./data.txt
is very large because sed
gets stuck trying to process a script input file far larger than it can reliably handle. The way around that is to take input in chunks. This can be done reliably - even in a pipeline - if you use the right kind of reader. dd
is that right kind of reader.
I created a test file like this:
sh -c ' _1=\"foxva****omes****\",\"scott@
_2='\''","8*** Rd","Ne***ah","Wi***in","54***","*******"'\''
n=0
for m do printf "$_1%s$_2\n$_1$((n+=1))not_free.com$_2\n" "$m"
done
' $(cat ~/Downloads/list.txt) >/tmp/data.txt
...where list.txt
is got here per your other question. It does every other line like...
"foxva****omes****","scott@11mail.com","8*** Rd","Ne***ah","Wi***in","54***","*******"
"foxva****omes****","scott@1not_free.com","8*** Rd","Ne***ah","Wi***in","54***","*******"
"foxva****omes****","scott@123.com","8*** Rd","Ne***ah","Wi***in","54***","*******"
"foxva****omes****","scott@2not_free.com","8*** Rd","Ne***ah","Wi***in","54***","*******"
I then brought it up to a little over 80mbs like...
while [ "$(($(wc -c <data.txt)/1024/1024))" -lt 80 ]
do cat <<IN >./data.txt
$( cat ./data.txt ./data.txt)
IN
done
ls -hl ./data.txt
wc -l <./data.txt
-rw-r--r-- 1 mikeserv mikeserv 81M Jul 19 22:22 ./data.txt
925952
...and then I did...
( trap rm\ data.tmp 0; export LC_ALL=C
<./data.txt dd bs=64k cbs=512 conv=block |
while dd bs=64k cbs=512 conv=unblock \
count=24 of=./data.tmp
[ -s ./data.tmp ]
do
<./data.tmp cut -d\" -f4 | cut -d@ -f2 |
grep -Fixnf ./list.txt |
sed -e's|:*\([0-9]*\).*|:\1|p' \
-e's||\1!{p;n;b\1|p' \
-e's||};n|' |
sed -nf- -e:n -e'p;n;bn' ./data.tmp
done 2>/dev/null
)| wc -l
1293+1 records in
7234+0 records out
474087424 bytes (474 MB) copied, 21.8488 s, 21.7 MB/s
462976
You can see right there that the whole process took 22 seconds, and that the output line count is at least correct - 462976 is half of 925952 and the input should have come out halved.
The technique works because dd
's reads and writes can be counted upon to the byte - even over a pipe if you know what you're about. And you can even break input out by line with the same degree of precision if you can reliably conv
ert by a maximum line-length block
-size (which is 512 here, or {_POSIX_LINE_MAX}
).
The imaginative reader might rightly surmise that the same technique could be applied to an in-stream of any kind - even the live-log kind - with only a slight modification here or there (namely, to do it safely, the first dd
's arguments would need to change from bs=
to obs=
). In every case, though, you would need some assurance of the maximum input line size, and, if a line can legitimately end in a <space> character, some additional filter mechanism inserted before the dd
processes to protect against the trailing <spaces> being stripped with dd conv=unblock
(which works by stripping off all trailing blanks for each cbs
-sized conv
ersion block and appending a \n
ewline). tr
and (un|)expand
spring to my mind as likely candidates for such a filter.
This is not the fastest way to do this - for that you'd want to look to a -m
erge sort
operation, I expect, but it is pretty quick, and it will work with your data. It does kind of break the sed -i
thing, though - but I think that will be true no matter which way you go.
grep
be better for this task? – Ed Heal Jul 20 '15 at 02:05sed
will not be faster. – mikeserv Jul 20 '15 at 02:31