Batch checking contents of an S3 bucket with AWS lambda

A while ago, I faced the challenge to check the contents of a few thousand folders in an AWS bucket to see if they meet certain criteria, i.e. number and type of files per folder, etc.
The best way to achieve this proved to be a node.js script ran on Lambda with an API Gateway in front, for a number of reasons:

  • The aws-sdk is preinstalled for every lambda function and can be required without any installation
  • Granting Lambda read access to a certain bucket is quite easy
  • Lambdas can have a number of triggers, i.e. Any AWS event, HTTP requests through API Gateway or a cron like execution
  • API Gateway makes it easy to provide a simple REST interface for your endpoint or even offer a user-friendly UI (Minimalistic Angular 1 in S3 my case)
  • Most things can be handled by the serverless framework for you, without the need to fiddle around with AWS

The Script

One thing worth mentioning is that AWS limits the number of objects per listObjectsV2 call to 1000. If your bucket contains more elements (in my case up to 25.000), your API response will contain a field called NextContinuationToken which allows you to fire another request that continues where the first request got capped.

We use these tokens to recursive call the getObjects call until the list is finished and call handleObjectList on the elements of each call. An object outside the call scope can be used to collect data of each call and keep it for the onFinsihed function to calculate the final result.

In this example we also provide parameters to ignore certain elements when invoking the execution via REST call or set the bucket name dynamically. This is of course optional, but proved to be quite useful for my use case.

Another thing that I found quite useful was to publish the results to an SNS topic for further processing - this is optional as well, but I nonetheless left the code in the snippet.

The actual script is not really rocket science and looks as follows.

bucket_check.js
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
'use strict';

module.exports.checkBucket = (event, context, callback) => {
const AWS = require('aws-sdk');
const s3 = new AWS.S3();
const sns = new AWS.SNS();

const requestBody = JSON.parse(event.body);

// [OPTIONAL] can be used to store results of the iterations
const resultJSON = {};

// [OPTIONAL] provided queryParams to control script, i.e. &ignore=1,2,3 could be used to ignore elements 1,2 and 3
const scriptParams = {
ignore: requestBody.ignore || []
};

console.log(scriptParams);

const s3Params = {
Bucket: requestBody.bucket,
Prefix: 'your/folder/within/the/bucket' // limits the results to only a certain folder
};

let onFinished = () => {
const result = {
checkedBucket: s3Params.Bucket,
// ..
};

// SNS is optional, but proved quite useful. If not needed, just call callback(null, response) directly
sns.publish({
TopicArn: "arn:aws:sns:<topic>",
Message: JSON.stringify(result)
}, (err, data) => {
console.log(err ? 'error publishing to SNS' : 'Message published to SNS');

// Required if you want to use an AWS API Gateway in front
const response = {
"statusCode": 200,
"headers": {
"Access-Control-Allow-Origin" : "*", // Required for CORS support to work
"Access-Control-Allow-Credentials" : true // Required for cookies, authorization headers with HTTPS
},
"body": JSON.stringify(result)
};


callback(null, response);
});
};

let handleObjectList = (data) => {
// Keys of the form "a/b/c.jpg"
const keys = data.Contents.map(c => c.Key);

keys.forEach(key => {
// do something with the filename, i.e. aggregate the data in resultJSON
});
};

const getObjects = (token) => {
if (token) {
s3Params.ContinuationToken = token;
}

s3.listObjectsV2(s3Params, (err, objectsResponse) => {
if (err) {
console.log(err, err.stack); // an error occurred
}
else {
handleObjectList(objectsResponse);

if (objectsResponse.NextContinuationToken) {
console.log('Continuing Request with Token ', objectsResponse.NextContinuationToken);
// Recursive call with the ContinuationToken of the previous request
getObjects(objectsResponse.NextContinuationToken)
} else {
onFinished();
}
}
});
};

getObjects();
};

Deploy

The easiest way of putting this to work is by getting everything set up by the Serverless framework. In this example, it will

  • Zip, upload and deploy your code into an Lambda function
  • Create an API Gateway as an entry point
  • Wire everything together and set the correct permissions and roles
serverless.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
service: bucket-checker

provider:
name: aws
runtime: nodejs6.10
region: eu-central-1
stage: stage
profile: bucket-check
memorySize: 1536
timeout: 180
iamRoleStatements:
- Effect: "Allow"
Action:
- "s3:Get*"
- "s3:List*"
Resource: "*"
- Effect: "Allow"
Action:
- "sns:*"
Resource: "*"
functions:
checkBucket:
handler: bucket_checker.checkBucket
name: ${self:provider.stage}-checkBucket
description: Checks the contents of a given S3 Bucket
events:
- http:
path: bucket/validate
method: post
cors: true

Regarding runtime

The scripts runtine depends on a number of factors, mostly memory size of the Lambda function, number of items in the bucket and type of processing done on these items.

Using the maximum available memory size, a simple analysis of 25.000 items takes ~30s for my use case.

If you find the runtime to be slow, make sure that you are utilizing the proper amount of memory. Make also sure to properly set the Prefix parameter in the S3 config, as this could greatly influence the number of items that have to be checked.