8

How to search for a string inside a lot of .gz files in Amazon S3 bucket subfolder? I tried to mount it via s3fs and zgrep but it's sooooo slow. Do you use any other methods?

Maybe is there any Amazon service which I could use to quickly zgrep them?

4 Answers4

8

I find the quickest way is to copy them locally first then do a local zgrep:

aws s3 cp s3://bucket/containing/the/logs . --recursive

This will copy (cp) all the logs to your current directory (.) and include all sub folders too (--recursive).

Then a local zgrep:

zgrep "search words" *.gz

Or to recursively search sub directories too:

find -name \*.gz -print0 | xargs -0 zgrep "STRING"

(Taken from unix.stackexchange.com.)

geedoubleya
  • 4,327
  • Unfortunatelly, as I mentioned in my question, it's too slow (a lot of files to check). I'm looking for faster solution. – Michal_Szulc Sep 26 '16 at 21:09
  • There is a big difference between searching locally on a laptop with ssd and an network mounted partition. When I get a minute I will do a comparison. @Michal_Szulc – geedoubleya Sep 28 '16 at 09:26
  • I tried also copying and zgreping all resourcese to local machine with ssd (using your script)), but this solution requires a lot of disk space. that's why i'm looking for some kind of remote solution (using api or any other AWS service, which will be either fast or do not require a lot of diskspace). – Michal_Szulc Sep 28 '16 at 09:52
  • Thanks, this was really quick and helpful. I just needed to find some text in one of my files in S3, not gzip, and I followed these steps with grep and it worked really quick. – mrgoos Nov 12 '20 at 07:16
6

It's not grep, but you can now query logs with Athena:

First, create a table from your S3 bucket:

CREATE EXTERNAL TABLE IF NOT EXISTS S3Accesslogs(
  BucketOwner string,
  Bucket string,
  RequestDateTime string,
  RemoteIP string,
  Requester string,
  RequestID string,
  Operation string,
  Key string,
  RequestURI_operation string,
  RequestURI_key string,
  RequestURI_httpProtoversion string,
  HTTPstatus string,
  ErrorCode string,
  BytesSent string,
  ObjectSize string,
  TotalTime string,
  TurnAroundTime string,
  Referrer string,
  UserAgent string,
  VersionId string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
 'serialization.format' = '1',
  'input.regex' = '([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\") ([^ ]*)$'
) 
LOCATION 's3://s3-server-access/logs/'

Then, you can query it with SQL:

SELECT requestdatetime, bucket, remoteip, requesturi_key
FROM s3accesslogs
WHERE bucket IN ('bucket1', 'bucket2')
    AND remoteip = '123.45.67.89'
ORDER BY requestdatetime DESC
LIMIT 100;
Curtis Mattoon
  • 181
  • 1
  • 3
  • Matton I'm afraid that my previous comment: https://unix.stackexchange.com/questions/312436/search-inside-s3-bucket-with-logs#comment549965_312439 also fits to your solution. Still searching for a solution! – Michal_Szulc Oct 18 '17 at 20:13
  • Yeah, I'm hitting some painfully slow queries as well. They don't allow indexing either. This might be a use-case for Redshift or something, but I'm not sure yet. – Curtis Mattoon Oct 19 '17 at 19:16
  • See updated syntax for AWS, https://docs.aws.amazon.com/athena/latest/ug/application-load-balancer-logs.html – Per Christian Henden Aug 29 '23 at 16:31
1

I had the same problem. I tried writing a python script to download the file and run it through zgrep. But it took 30 sec just to run one grep command. Files were also around 200MB, so overall time was very high. And I had to do it for hundreds of files.

In my case, I wrote a lambda function and ran it parallely for different .gz files. Use the /tmp storage to download the .gz file for a lambda instance and run zgrep.

os.chdir('/tmp/') os.mkdir('s3_logs') os.chdir('s3_logs') s3.download_file(Bucket, s3object, '/tmp/s3_logs/file.gz') query = 'zgrep String file.gz' result = os.popen(query).read() print(result)

Multiple instances of lambda can make it at least some times faster. Although Remember that the tmp storage for lambda has only 500MB storage.

0

Now in 2023:

  • Either use "Mountpoint for Amazon S3" then grep; OR

  • Use cloudgrep with:

    cloudgrep --bucket your_bucket --query your_query

chris
  • 101