This is an annotated version of the cloudformation script that kicks off the serverless data lake.
AWSTemplateFormatVersion: '2010-09-09'
Transform: 'AWS::Serverless-2016-10-31'
Description: A serverless datalake workshop.
Resources:
This is the S3 bucket that will contain the S3 Data Lake.
IngestionBucket:
Type: AWS::S3::Bucket
This is the CloudWatch log group that will contain the log data generated by the log file generator.
ApacheLogs:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: !Sub /${AWS::StackName}/apache
RetentionInDays: 1
This is the delivery stream that receives the logs from CloudWatch to Kinesis Firehouse.
Cloudwatch Logs publishes the logs in a compresses JSON format, so there's a Lambda function that extracts the compressed log data and that is what is written to S3. You can use the compressed log data in AWS Glue, but the compression makes the files harder to read and the JSON format adds hierarchy to data in the data lake and obuscates the data. For the purposes of simplifying the workshop, this data is uncompressed and written as a CSV.
ApacheLogsKinesis:
Type: AWS::KinesisFirehose::DeliveryStream
DependsOn: GenerateSampleDataFunction
Properties:
DeliveryStreamType: DirectPut
ExtendedS3DestinationConfiguration:
RoleARN: !GetAtt ApacheLogsServiceRole.Arn
BucketARN: !GetAtt IngestionBucket.Arn
BufferingHints:
IntervalInSeconds: 60
SizeInMBs: 3
CloudWatchLoggingOptions:
Enabled: False
CompressionFormat: UNCOMPRESSED
Prefix: weblogs/live/
ProcessingConfiguration:
Enabled: true
Processors:
- Type: Lambda
Parameters:
- ParameterName: LambdaArn
ParameterValue: !Sub ${TransformKinesis.Arn}
- ParameterName: BufferSizeInMBs
ParameterValue: 3
- ParameterName: BufferIntervalInSeconds
ParameterValue: 60
This is the subsription that publishes the logs from CloudWatch to Kinesis Firehouse delivery stream.
CloudWatchLogsToKinesis:
Type: AWS::Logs::SubscriptionFilter
Properties:
DestinationArn: !Sub ${ApacheLogsKinesis.Arn}
FilterPattern: ""
LogGroupName: !Sub ${ApacheLogs}
RoleArn: !Sub ${LogsToKinesisServiceRole.Arn}
This is IAM role that CloudWatch logs assumes in order to publish the data to kinesis. It needs the authorization to write to kinesis.
LogsToKinesisServiceRole:
Type: AWS::IAM::Role
Properties:
RoleName: !Sub ${AWS::StackName}_logs_kinesis_role
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: !Sub logs.${AWS::Region}.amazonaws.com
Action: sts:AssumeRole
This is IAM policy that CloudWatch logs grants authorization to write to kinesis.
LogsToKinesisRolePolicy:
Type: AWS::IAM::ManagedPolicy
Properties:
ManagedPolicyName: !Sub ${AWS::StackName}_logs_kineis_policy
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- 'firehose:*'
Resource:
- !Sub '${ApacheLogsKinesis.Arn}'
- Effect: Allow
Action:
- 'iam:PassRole'
Resource:
- !Sub '${LogsToKinesisServiceRole.Arn}'
Roles:
- !Ref 'LogsToKinesisServiceRole'
This is IAM role that Kinesis Firehose assumes to call the transformation lambda function and write the result to S3
ApacheLogsServiceRole:
Type: AWS::IAM::Role
Properties:
RoleName: !Sub ${AWS::StackName}_weblog_delivery_role
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: firehose.amazonaws.com
Action: sts:AssumeRole
This is IAM policy that grants Kinesis Firehose access to call the transformation lambda function and write the result to S3
ApacheLogsRolePolicy:
Type: AWS::IAM::ManagedPolicy
Properties:
ManagedPolicyName: !Sub ${AWS::StackName}_weblog_delivery_policy
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- 's3:*'
Resource:
- !Sub '${IngestionBucket.Arn}/*'
- !Sub '${IngestionBucket.Arn}'
- Effect: Allow
Action:
- 'lambda:InvokeFunction'
- 'lambda:InvokeAsync'
Resource:
- !Sub '${TransformKinesis.Arn}'
Roles:
- !Ref 'ApacheLogsServiceRole'
This is the lambda function that transforms the cloudwatch log format into a simple CSV. It uncompresses the log payload and strips away the JSON document that supplies cloudwatch logs metadata.
TransformKinesis:
Type: 'AWS::Serverless::Function'
Properties:
Handler: transformKinesis.handler
Runtime: python2.7
Description: ''
MemorySize: 512
Timeout: 60
CodeUri: ./src
This is lambda function that randomly generates some website data for 1 minute. It is scheduled to run every minute.
GenerateSampleDataFunction:
Type: 'AWS::Serverless::Function'
Properties:
Handler: writelogs.lambda_handler
Runtime: python2.7
Description: ''
MemorySize: 512
Timeout: 60
CodeUri: ./src
Events:
Schedule:
Type: Schedule
Properties:
Schedule: rate(1 minute)
Environment:
Variables:
LOG_GROUP_NAME: !Sub /${AWS::StackName}/apache
This is lambda function that provides the functionality for the LoadSampleData custom resource. It copies the sample data from a public S3 bucket into the IngestionBucket. Upon delete, it will delete all the data in the bucket so it can be removed.
LoadSampleDataFunction:
Type: 'AWS::Serverless::Function'
Properties:
Handler: load-data-files.lambda_handler
Runtime: python2.7
Description: ''
MemorySize: 512
Timeout: 240
Policies:
- S3CrudPolicy:
BucketName: !Ref IngestionBucket
CodeUri: ./src
Environment:
Variables:
BUCKET_NAME: !Ref IngestionBucket
This is a custom CloudFormation resource the pre-populates the sample data and cleans up the IngestionBucket when the stack is turned off.
It also uploads the lab instructions to the bucket and will substitute the actual bucket name into the instructions in order to reduce the number of copy and paste errors.
LoadSampleData:
Type: Custom::LoadSampleData
DependsOn:
- IngestionBucket
Properties:
ServiceToken: !GetAtt LoadSampleDataFunction.Arn
StackName: !Ref AWS::StackName
These outputs are displayed in the AWS console after the stack has been created. The WorkshopInstructionsUrl is a link to the customized instructions for this workshop.
Outputs:
WorkshopInstructionsUrl:
Description: Follow the link for the instructions for the serverless datalake workshop.
Value: !Sub https://s3.${AWS::Region}.amazonaws.com/${IngestionBucket}/instructions/instructions.html