log analysis at scale

Logging at Scale

Alex Smith - @alexjs

Solutions Architect

April 2016

Logging is difficult

I thought I knew this

No Users

5.2m users

(~80k rps)

It is really difficult

Problems

• Storage (Temporary)

• Capture

• Storage (Permanent)

• Visualisation

Stealing Content…

‘Your First 10m Users’

ARC301 – re:Invent 2015

http://bitly.com/2015arc301

- Joel Williams

AWS Solutions Architect

>1 User

• Amazon Route 53 for DNS

• A single Elastic IP

• A single Amazon EC2 instance

• With full stack on this host

• Web app

• Database

• Management

• And so on…

Amazon

instance

Elastic IP

UserAmazon

Route 53

ARC301

>1 User

• A single place to read logs

Amazon

instance

Elastic IP

UserAmazon

Route 53

ARC301

>1 User

• A single place to read logs from

Amazon

instance

Elastic IP

UserAmazon

Route 53

ARC301

@alexjs hacks – top URLs

# awk -F\" '{print $2'} access_log \

| awk '{print $2}' \

| sort | uniq -c | sort –rn

11208 /

3287 /2016/04/23/welcome

@alexjs hacks – HTTP response codes

# awk '{print $9}' access_log \

| sort | uniq -c | sort –rn

19307 200

1239 404

120 503

@alexjs hacks - top User-Agents

# awk -F\" '{print $6'} access_log | sort | uniq -c | sort -rn

3774 Mozilla/5.0 (compatible; MSIE 10.0; Windows Phone 8.0; Trident/6.0; IEMobile/10.0; ARM; Touch; Microsoft; Lumia 640 XL)

2949 Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36

2928 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36

2900 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.5.2171.95 Safari/537.36

@alexjs hacks – requests per second (realtime)

# tail -F access_log \

perl -e 'while (<>) {$l++;if (time > $e) {$e=time;print"$l\n";$l=0}}’

http://bitly.com/bashlps

Users >1000

Instance

RDS DB Instance

Active (Multi-AZ)

Availability Zone Availability Zone

Instance

RDS DB Instance

Standby (Multi-AZ)

Balancer

UserAmazon

Route 53

Real Life

Users >1 million+

RDS DB Instance

Active (Multi-AZ)

Availability Zone

Balancer

RDS DB Instance

Read Replica

RDS DB Instance

Read Replica

Amazon

Route 53User

Amazon S3

Amazon

CloudFront

DynamoDB

Amazon SQS

ElastiCache

Worker

Instance

Worker

Instance

Amazon

CloudWatch

Internal App

Instance

Internal App

Instance Amazon SES

Lambda

ARC301

Instance

Amazon

instance

Elastic IP

UserAmazon

Route 53

ARC301

Users >1 million+

RDS DB Instance

Active (Multi-AZ)

Availability Zone

Balancer

RDS DB Instance

Read Replica

RDS DB Instance

Read Replica

Instance

Amazon

Route 53User

Amazon S3

Amazon

CloudFront

DynamoDB

Amazon SQS

ElastiCache

Worker

Instance

Worker

Instance

Amazon

CloudWatch

Internal App

Instance

Internal App

Instance Amazon SES

Lambda

ARC301

Instance

Users >1 million+

RDS DB Instance

Active (Multi-AZ)

Availability Zone

Balancer

RDS DB Instance

Read Replica

RDS DB Instance

Read Replica

Instance

Amazon

Route 53

Amazon S3

Amazon

CloudFront

DynamoDB

Amazon SQS

ElastiCache

Worker

Instance

Worker

Instance

Amazon

CloudWatch

Internal App

Instance

Internal App

Instance Amazon SES

Lambda

ARC301

Instance

Users >1 million+

RDS DB Instance

Active (Multi-AZ)

Availability Zone

Balancer

RDS DB Instance

Read Replica

RDS DB Instance

Read Replica

Instance

Amazon

Route 53User

Amazon S3

Amazon

CloudFront

DynamoDB

Amazon SQS

ElastiCache

Worker

Instance

Worker

Instance

Amazon

CloudWatch

Internal App

Instance

Internal App

Instance Amazon SES

Lambda

ARC301

Instance

Problems

• Capture

• Visualisation

Problems

• Capture

• Visualisation

Problems

• Capture

• Visualisation

When the logs are written (AWS)

• Local memory

• Ephemeral Volumes

• EBS Volumes

• gp2

• st1/sc1

Problems

• Capture

• Visualisation

• Insight

Problems

• Capture

• Visualisation

• Insight

Three Problems of Persistence

• Somewhere to stage

• Somewhere to live

• Somewhere to search

To NoSQL, or not to NoSQL?

- Joel

Some folks won’t like this,

but…

Start with SQL databases(even MPP SQL)

Why start with SQL?

• Established and well-worn technology.

• Lots of existing code, communities, books, and tools.

• You aren’t going to break SQL DBs in your first 10 million

users. No, really, you won’t.*

• Clear patterns to scalability (especially in analytics)

*Unless you are doing something SUPER peculiar with the data or you have MASSIVE amounts of it.

…but even then SQL will have a place in your stack.

Ah ha! You said

massive!

- Joel (again)

Why might you need NoSQL?

• Super low-latency applications

• Metadata-driven datasets

• Highly nonrelational data

• Need schema-less data constructs*

• Massive amounts of data (again, in the TB range)

• Rapid ingest of data (thousands of records/sec)

*Need!= “It’s easier to do dev without schemas”

Log Dispatcher Architecture Revisited

App Server App Server App Server App Server

Kinesis

Firehose

Log IndexElasticSearch

Visualisation

Amazon

Amazon S3

• Simple Storage Service

• Canonical logging target for ELB, CloudFront, etc.

• Virtually unlimited amounts of storage

• Support for Lambda operations

• Very fast – ideal for feeding other services (Redshift,

EMR/Hadoop)

• Data can be automatically pushed here from Amazon

Firehose

Amazon

• Long tail

Redshift

• PostgreSQL based MPP

database

• Petabyte scale data

warehousing

• Choice of nodes

• Dense compute

• Dense storage

• Already compatible with

your existing BI tools

compute node

storage node

Amazon

Redshift

Up to 128 nodes at 2PB

~256PB/cluster

(streaming data)

Amazon ElasticSearch Service

• ElasticSearch

• Popular/Open Source

• Commonly used for log

and clickstream

• Managed Solution

• We prepackage Kibana

• Integrated with IAM,

Firehose, etc

Amazon

Elasticsearch Service

Amazon

Kinesis

Firehose

(streaming data)

Demo: Storage!

ElasticSearch Index Mappingcurl -XPUT 'https://search-loggingatscale-demo-[...].us-east-1.es.amazonaws.com/blog-apache-combined' -d '

"mappings": {

"blog-apache-combined": {

"properties": {

"datetime": {

"type": "date",

"format": "dd/MMM/yyyy:HH:mm:ss Z”

"agent": {

"type": "string",

"index": "not_analyzed”

}, [...]

ElasticSearch Index Mappingcurl -XPUT 'https://search-loggingatscale-demo-[...].us-east-1.es.amazonaws.com/blog-apache-combined' -d '

"mappings": {

"blog-apache-combined": {

"properties": {

"datetime": {

"type": "date",

"format": "dd/MMM/yyyy:HH:mm:ss Z”

"agent": {

"type": "string",

"index": "not_analyzed”

}, [...]

Problems

• Capture

• Visualisation

How do I get my data in anyway?

Logging Architecture

Aggregator(Kafka/Kinesis/MQ)

Index/Persist(ElasticSearch, etc)

Visualisation

Logging Architecture

Aggregator(Kafka/Kinesis/MQ)

ElasticSearch ElasticSearch

Visualisation

Amazon Kinesis

• Firstly, a massively

scalable, low cost way to

send JSON objects to a

’stream’ hosted by AWS

• Users can write applications

(using KCL) to take data

from it and parse/evaluate

• Apps can be written in Java,

Lambda (Node, Python, Java),

Kinesis Streams

• What was previously Kinesis

• Still very customisable, for

innovative stream workloads

• Users still write app to parse

data from the stream

Amazon Kinesis: New Features (re:Invent 2015)

Kinesis Firehose

• Fully managed data ingest

service

• Provision end point

• Send data to end point

• ???

• Data!

• Outputs to S3, Redshift,

ElasticSearch Service

• (And can do two at once)

Amazon Kinesis: New Features (Apr 2016)

Amazon Kinesis Agent

• Standalone Java application from AWS

• Collect and send logs to Kinesis Firehose

• Built-in:

• File rotation

• Failure retries

• Checkpoints

• Integrated with CloudWatch for alerting

Amazon Kinesis Agent

• Multiple input options

• SINGLELINE

• CSVTOJSON

• LOGTOJSON

• Hoorah!

Demo: Local Capture + Dispatch

ElasticSearch

Problems

• Storage (Temp)

• Capture

• Storage (Perm)

• Visualisation

Kibana

• Pre-packaged with Amazon ElasticSearch Service

• Easy to manage with freeform data

• Dashboards!

Your existing BI tools

• As before – your data exists on S3 (JSON)

• S3 -> Redshift

• Commission a Redshift cluster with IAM roles

• Write a manifest of the files to load (JSON)

• Issue a load

• Redshift is PgSQL compatible

• Drivers exist for many tools

Demo: visualisation!(Kibana)

Problems

• Capture

• Visualisation

• Insight

Recap / Lessons / Next

• Logging is really hard.

• Use tools like AWS Firehose, Kinesis Agent and

ElasticSearch Service to make it easier

• Reuse data, tools and people where possible

Lessons

Don’t be big data dog

Use the right tools at the right

Twitter

@alexjs

https://sg.linkedin.com/in/alexjs

alexjs@amazon.com

Thank You!

log analysis at scale

Technology

large scale click-streaming and tranaction log mining

2015 mortgage log analysis report - scstatehouse.gov...

web log analysis for performance troubleshooting · web log...

hspa log files analysis

log frame analysis

semi-log analysis

reloganalysis well log analysis - comport...

log aggregation and analysis

design of large scale log analysis...

comparing epa swtr baffling factors to full scale tracer...

basic well log analysis

4_2g log files analysis

log data analysis platform

distance, m (log scale)

qualitatif log analysis-slim.pdf

5_3g log files analysis

large scale log collection using logstash & mongodb

log frame-analysis

web log data analysis: converting unstructured web log

large scale virtual machine log collector (project-report)