Skip to content
This repository has been archived by the owner on Jul 16, 2024. It is now read-only.

Latest commit

 

History

History
244 lines (217 loc) · 7.66 KB

serverlessdatalake2018.md

File metadata and controls

244 lines (217 loc) · 7.66 KB

Introduction

This is an annotated version of the cloudformation script that kicks off the serverless data lake.

Standard Header

AWSTemplateFormatVersion: '2010-09-09'
Transform: 'AWS::Serverless-2016-10-31'
Description: A serverless datalake workshop.
Resources:

IngestionBucket

This is the S3 bucket that will contain the S3 Data Lake.

  IngestionBucket:
    Type: AWS::S3::Bucket

ApacheLogs

This is the CloudWatch log group that will contain the log data generated by the log file generator.

  ApacheLogs:
    Type: AWS::Logs::LogGroup
    Properties: 
      LogGroupName: !Sub /${AWS::StackName}/apache
      RetentionInDays: 1

ApacheLogsKinesis

This is the delivery stream that receives the logs from CloudWatch to Kinesis Firehouse.

Cloudwatch Logs publishes the logs in a compresses JSON format, so there's a Lambda function that extracts the compressed log data and that is what is written to S3. You can use the compressed log data in AWS Glue, but the compression makes the files harder to read and the JSON format adds hierarchy to data in the data lake and obuscates the data. For the purposes of simplifying the workshop, this data is uncompressed and written as a CSV.

  ApacheLogsKinesis:
    Type: AWS::KinesisFirehose::DeliveryStream
    DependsOn: GenerateSampleDataFunction
    Properties: 
      DeliveryStreamType: DirectPut
      ExtendedS3DestinationConfiguration:
        RoleARN: !GetAtt ApacheLogsServiceRole.Arn
        BucketARN: !GetAtt IngestionBucket.Arn
        BufferingHints:
          IntervalInSeconds: 60
          SizeInMBs: 3
        CloudWatchLoggingOptions:
          Enabled: False
        CompressionFormat: UNCOMPRESSED
        Prefix: weblogs/live/
        ProcessingConfiguration:
          Enabled: true
          Processors: 
          - Type: Lambda
            Parameters:
            - ParameterName: LambdaArn
              ParameterValue: !Sub ${TransformKinesis.Arn}
            - ParameterName: BufferSizeInMBs
              ParameterValue: 3
            - ParameterName: BufferIntervalInSeconds
              ParameterValue: 60

CloudWatchLogsToKinesis

This is the subsription that publishes the logs from CloudWatch to Kinesis Firehouse delivery stream.

  CloudWatchLogsToKinesis:
    Type: AWS::Logs::SubscriptionFilter
    Properties: 
      DestinationArn: !Sub ${ApacheLogsKinesis.Arn}
      FilterPattern: ""
      LogGroupName: !Sub ${ApacheLogs}
      RoleArn: !Sub ${LogsToKinesisServiceRole.Arn}

LogsToKinesisServiceRole

This is IAM role that CloudWatch logs assumes in order to publish the data to kinesis. It needs the authorization to write to kinesis.

  LogsToKinesisServiceRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Sub ${AWS::StackName}_logs_kinesis_role
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: !Sub logs.${AWS::Region}.amazonaws.com
            Action: sts:AssumeRole

LogsToKinesisRolePolicy

This is IAM policy that CloudWatch logs grants authorization to write to kinesis.

  LogsToKinesisRolePolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      ManagedPolicyName: !Sub ${AWS::StackName}_logs_kineis_policy
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action:
              - 'firehose:*'
            Resource:
              - !Sub '${ApacheLogsKinesis.Arn}'
          - Effect: Allow
            Action:
              - 'iam:PassRole'
            Resource:
              - !Sub '${LogsToKinesisServiceRole.Arn}'
      Roles:
        - !Ref 'LogsToKinesisServiceRole'

ApacheLogsServiceRole

This is IAM role that Kinesis Firehose assumes to call the transformation lambda function and write the result to S3

  ApacheLogsServiceRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Sub ${AWS::StackName}_weblog_delivery_role
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: firehose.amazonaws.com
            Action: sts:AssumeRole

ApacheLogsRolePolicy

This is IAM policy that grants Kinesis Firehose access to call the transformation lambda function and write the result to S3

  ApacheLogsRolePolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      ManagedPolicyName: !Sub ${AWS::StackName}_weblog_delivery_policy
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action:
              - 's3:*'
            Resource:
              - !Sub '${IngestionBucket.Arn}/*'
              - !Sub '${IngestionBucket.Arn}'
          - Effect: Allow
            Action: 
                - 'lambda:InvokeFunction'
                - 'lambda:InvokeAsync'
            Resource:
              - !Sub '${TransformKinesis.Arn}'
      Roles:
        - !Ref 'ApacheLogsServiceRole'

TransformKinesis

This is the lambda function that transforms the cloudwatch log format into a simple CSV. It uncompresses the log payload and strips away the JSON document that supplies cloudwatch logs metadata.

  TransformKinesis:
    Type: 'AWS::Serverless::Function'
    Properties:
      Handler: transformKinesis.handler
      Runtime: python2.7
      Description: ''
      MemorySize: 512
      Timeout: 60
      CodeUri: ./src

GenerateSampleDataFunction

This is lambda function that randomly generates some website data for 1 minute. It is scheduled to run every minute.

  GenerateSampleDataFunction:
    Type: 'AWS::Serverless::Function'
    Properties:
      Handler: writelogs.lambda_handler
      Runtime: python2.7
      Description: ''
      MemorySize: 512
      Timeout: 60
      CodeUri: ./src
      Events:
        Schedule:
          Type: Schedule
          Properties:
            Schedule: rate(1 minute)
      Environment:
        Variables:
          LOG_GROUP_NAME: !Sub /${AWS::StackName}/apache

LoadSampleDataFunction

This is lambda function that provides the functionality for the LoadSampleData custom resource. It copies the sample data from a public S3 bucket into the IngestionBucket. Upon delete, it will delete all the data in the bucket so it can be removed.

  LoadSampleDataFunction:
    Type: 'AWS::Serverless::Function'
    Properties:
      Handler: load-data-files.lambda_handler
      Runtime: python2.7
      Description: ''
      MemorySize: 512
      Timeout: 240
      Policies:
      - S3CrudPolicy:
          BucketName: !Ref IngestionBucket
      CodeUri: ./src
      Environment:
        Variables:
          BUCKET_NAME: !Ref IngestionBucket

LoadSampleData

This is a custom CloudFormation resource the pre-populates the sample data and cleans up the IngestionBucket when the stack is turned off.

It also uploads the lab instructions to the bucket and will substitute the actual bucket name into the instructions in order to reduce the number of copy and paste errors.

  LoadSampleData:
    Type: Custom::LoadSampleData
    DependsOn:
      - IngestionBucket
    Properties: 
      ServiceToken: !GetAtt LoadSampleDataFunction.Arn
      StackName: !Ref AWS::StackName

Outputs

These outputs are displayed in the AWS console after the stack has been created. The WorkshopInstructionsUrl is a link to the customized instructions for this workshop.

Outputs:
  WorkshopInstructionsUrl:
    Description: Follow the link for the instructions for the serverless datalake workshop.
    Value: !Sub https://s3.${AWS::Region}.amazonaws.com/${IngestionBucket}/instructions/instructions.html