How to search for a string inside a lot of .gz files in Amazon S3 bucket subfolder? I tried to mount it via s3fs and zgrep but it's sooooo slow. Do you use any other methods?
Maybe is there any Amazon service which I could use to quickly zgrep them?
How to search for a string inside a lot of .gz files in Amazon S3 bucket subfolder? I tried to mount it via s3fs and zgrep but it's sooooo slow. Do you use any other methods?
Maybe is there any Amazon service which I could use to quickly zgrep them?
I find the quickest way is to copy them locally first then do a local zgrep:
aws s3 cp s3://bucket/containing/the/logs . --recursive
This will copy (cp) all the logs to your current directory (.) and include all sub folders too (--recursive).
Then a local zgrep:
zgrep "search words" *.gz
Or to recursively search sub directories too:
find -name \*.gz -print0 | xargs -0 zgrep "STRING"
(Taken from unix.stackexchange.com.)
It's not grep, but you can now query logs with Athena:
First, create a table from your S3 bucket:
CREATE EXTERNAL TABLE IF NOT EXISTS S3Accesslogs(
BucketOwner string,
Bucket string,
RequestDateTime string,
RemoteIP string,
Requester string,
RequestID string,
Operation string,
Key string,
RequestURI_operation string,
RequestURI_key string,
RequestURI_httpProtoversion string,
HTTPstatus string,
ErrorCode string,
BytesSent string,
ObjectSize string,
TotalTime string,
TurnAroundTime string,
Referrer string,
UserAgent string,
VersionId string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = '([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\") ([^ ]*)$'
)
LOCATION 's3://s3-server-access/logs/'
Then, you can query it with SQL:
SELECT requestdatetime, bucket, remoteip, requesturi_key
FROM s3accesslogs
WHERE bucket IN ('bucket1', 'bucket2')
AND remoteip = '123.45.67.89'
ORDER BY requestdatetime DESC
LIMIT 100;
I had the same problem. I tried writing a python script to download the file and run it through zgrep. But it took 30 sec just to run one grep command. Files were also around 200MB, so overall time was very high. And I had to do it for hundreds of files.
In my case, I wrote a lambda function and ran it parallely for different .gz files. Use the /tmp storage to download the .gz file for a lambda instance and run zgrep.
os.chdir('/tmp/')
os.mkdir('s3_logs')
os.chdir('s3_logs')
s3.download_file(Bucket, s3object, '/tmp/s3_logs/file.gz')
query = 'zgrep String file.gz'
result = os.popen(query).read()
print(result)
Multiple instances of lambda can make it at least some times faster. Although Remember that the tmp storage for lambda has only 500MB storage.
Now in 2023:
Either use "Mountpoint for Amazon S3" then grep; OR
Use cloudgrep with:
cloudgrep --bucket your_bucket --query your_query