log analytics with amazon elasticsearch...
Post on 17-Apr-2020
9 Views
Preview:
TRANSCRIPT
Log Analytics with Amazon Elasticsearch
Service
Christoph Schmitter (csc@amazon.de)
What we'll cover
• Understanding Elasticsearch capabilities• Elasticsearch, the technology• Aggregations; ad-hoc analysis• Amazon Elasticsearch Service is a drop-in
replacement for self-managed Elasticsearch• Q&A
Understanding Elasticsearch capabilities
Scenario: Log data analytics
• Application monitoring and event diagnosis
• You need to monitor the performance of your application, web servers, and hardware
• You need easy to use, yet powerful data visualization tools to detect issues in near real-time
• You want the ability to dig into your logs in an intuitive, fine-grained way
• Kibana provides fast, easy visualization
Scenario: Batch data analytics
• Reporting and Analysis
• You are a mobile app developer• You have to monitor/manage users
across multiple app versions• You want to analyze and report on
usage and migration between app versions
• Use Kibana for dashboarding. Use the query API for deeper analysis
Scenario: Full-text search
• Traditional search
• Your application or website provides search capabilities over diverse documents
• You are tasked with making searchable this knowledge base and accessible
• You need key search features including text matching, faceting, filtering, fuzzy search, auto complete, and highlighting
• Use the query API to support application search
CloudTrail delivers API calls to you
• AWS API call monitoring
• You need to understand the changing landscape of your AWS resources
• You need to do security analysis and compliance auditing
• You want the ability to dig into your logs in an intuitive, fine-grained way
How Elasticsearch can help
• Combined with Kibana, Elasticsearch provides a tool for search, real-time analytics, and data visualization
Demo Architecture
Amazon CloudWatch
Logs
Amazon Elasticsearch Service
CloudTrailLogs
AWS Resources
Log lines
Demo:
Log Analytics
Elasticsearch the technology
Elasticsearch is like a database
SearchValueField
DocumentIndex
Cluster
Queries
DatabaseValueColumnRowTableDatabase
SQL
Documents are the core entityID
F1 Value
F2 Value
{"eventVersion": "1.03","eventTime": "2016-06-01T00:16:19Z","eventSource": "dynamodb.amazonaws.com","eventName": "DescribeStream","awsRegion": "eu-west-1","sourceIPAddress": "52.51.24.XX","userAgent": "leb-kcl-580935a6-5f94-4ce0-ac69-cdeb609ba16a,amazon-
kinesis-client-library-java-lambda_1.2.1, aws-internal/3","requestParameters": {
"streamArn": "arn:aws:dynamodb:eu-west-1:17816119XXXX:table/restaurant/stream/2016-04-08T18:07:53.837"
},"responseElements": null,"requestID": "KC608PH8POAF2I184E2SL1PS2FVV4KQNSO5AEMVJF66Q9ASUAAJG","eventID": "49b56379-903b-4f04-8ce5-d21bbfcf8ab3","eventType": "AwsApiCall","apiVersion": "2012-08-10","recipientAccountId": "17816119XXXX","userIdentity": {
"type": "AssumedRole","principalId":
"AROAJBQVRM7LN25CAHX7Y:awslambda_338_20160531233813522","arn": "arn:aws:sts::178161197791:assumed-role/geospatial-rec-
engine-ApplicationExecutionRole-9LPKB77QMR97/awslambda_338_20160531233813522", ...
Lucene provides text analysis and indexing
0 quick 1,3,51 brown 2,3,4,62 fox 1,7,93 lazy 2,84 dog 24
Term ID Term Postings
IndexWriter
IndexSearcher
Segment
Elsaticsearch query processing
Query
quickbrownfoxlazy
loremipsumdolorsit
Index Lookup
id: 216id: 305id: 486id: 713
Matches
Querylogic and post-filtering Scoring,
aggs
id: 713id: 305id: 486id: 216
Sorted matches(results)
Aggregations; ad-hoc analysis
Faceting: basic aggregation
• Query: shirt
Facets Carhartt (1092) Russell Athletic (1087) Dickies (954) RALPH LAUREN (823) Wrangler (701) Doublju (259) Levi's (12)
ID
F1 Value
F2 Value
Elasticsearch Aggregations
• Buckets – a collection of documents meeting some criterion
• Metrics – calculations on the content of buckets.
Bucket: time
Met
ric: c
ount
A more complicated aggregation
Bucket: ARNBucket: RegionBucket: eventNameMetric: Count
More kinds of aggregations
Buckets• Date histogram• Histogram• Range• Terms• Filters• Significant terms
Metrics• Count• Average• Sum• Min• Max• Std. Dev• Unique Count• Percentiles
Setting up your cluster
Shard 1 Shard 2 Shard 3{ { { { Shard 4
Shards: independent collections of documents
Id Id Id . . .
Documents
{ Index/Type
Deployment of indices to a cluster
• Index 1– Shard 1– Shard 2– Shard 3
• Index 2– Shard 1– Shard 2– Shard 3
Amazon ES cluster
123
123
123
123
Primary Replica
1
3
3
1
Instance 1,Master
2
1
1
2
Instance 2
3
2
2
3
Instance 3
Determining storage
• Data:Index ratio is typically close to 1:1• Add a replica, double the storage• Figure out data node count based on storage
– Current limits; 10T EBS, 32T instance store
Determining instance type
• Instance type is workload-dependent• T2; dev, test, QA• M3; solid performance• R3; heavier queries, aggs• I2; largest storage option
Best practices
• Take the minimum number of shards for 50G max data per shard
• Number of replicas = 1• For all prod workloads: use 3 dedicated masters• Use the _bulk API. Some ingest mechanisms do
this automatically• Increase index.refresh_interval for higher
throughput
Indexing strategy
Indexing strategy for streaming data
• Use an index per time period, typically index-per-day, high volume can go to index-per-hour
• Shard the index according to data size; use 50GB as a soft limit per shard
• Master nodes increase cluster stability
Index settings control sharding and more
curl -XPUT <endpoint>/<index>/_settings -d '{"number_of_shards" : 5,"number_of_replicas" : 1,"refresh_interval": "5s"
}'
Mappings control how data is indexed
curl -XPUT <endpoint>/<index> -d '{"mappings" : {
<type> : {"properties" : {
"eventName" : {"type" : "string", "index" : "not_analyzed" } } } }
}'
Index templates simplify mapping creation
curl -XPUT <endpoint>/_template/<name> -d '{"template" : "<wildcard e.g. cwl-*>","settings" : { "number_of_shards" : 2 },"mappings" : {
<type, e.g. _default_> : {"dynamic_templates" : [ {
<name> : { "index" : "not_analyzed" } } ]"properties" : {
"@timestamp" : { "type" : "date" } } }
}'
Don't forget the query API!
Direct access to the Elasticsearch API
• $ curl -XPUT https://<endpoint>/blog -d '{• "settings" : { "number_of_shards" : 3, "number_of_replicas" : 1 } }'
• $ curl -XPOST http://<endpoint>/blog/post/1 -d '{• "author":"jon handler",• "title":"Amazon ES Launch" }'
• $ curl -XPOST https://<endpoint>/blog/post/_bulk -d '• { "index" : { "_index" : "blog", "_type" : "post", "_id" : "2"}}• {"title":"Amazon ES for search", "author": "carl meadows"},• { "index" : { "_index":"blog", "_type":"post", "_id":"3" } }• { "title":"Analytics too", "author": "vivek sriram"}'
• $ curl -XGET http://<endpoint>/_search?q=ES• {"took":16,"timed_out":false,"_shards":{"total":3,"successful":3,"failed":0
},"hits":{"total":2,"max_score":0.13424811,"hits":[{"_index":"blog","_type":"post","_id":"1","_score":0.13424811,"_source":{"author":"jon handler", "title":"Amazon ES Launch" }},{"_index":"blog","_type":"post","_id":"2","_score":0.11506981,"_source":{"title":"Amazon ES for search", "author": "carl meadows"},}]}}
Elasticsearch is a full-featured search engine
• Built on Lucene, the popular, open-source library• Search structured and unstructured data with
complex, boolean queries• Supports common search features: geo search,
aggregations, highlighting, search suggestions, and more
Challenges with self-managed Elasticsearch
• Easy to get started, challenging to scale• Scaling ingest pipelines is difficult• Undifferentiated heavy lifting
Amazon Elasticsearch Service
Amazon ES overview
Amazon Route 53
Elastic LoadBalancingIAM
CloudWatch
Elasticsearch API
CloudTrail
Easy cluster configuration and reconfiguration
AWS
• Elasticsearch Version• Data nodes, count and type• Master nodes, count and type• Storage option – EBS/instance• HA option• Advanced options
High availability with Zone Awareness
Amazon ES cluster
1
3
Instance 1
2
1 2
Instance 2
3
2
1
Instance 3
Availability Zone 1 Availability Zone 2
2
1
Instance 4
3
3
Monitor with CloudWatch metrics
• FreeStorageSpace – monitor and alarm before the cluster runs out of space
• CPUUtilization – alarm at 80% CPU to signal the need to scale up
• ClusterStatus.yellow – check whether replication requires additional nodes
• JVMMemoryPressure – check instance type and count for sufficient resources
• MasterCPUUtilization – monitoring for master nodes is separated from data nodes
Logstash
REST
CWL Agent
EC2 Instances
Amazon Kinesis
AmazonRDS
AmazonDynamoDB
AmazonSQS
Queue
LogstashCluster
Amazon Elasticsearch
Service
Amazon CloudWatch
AWSLambda
AWSCloudTrail
Access Logs
Amazon VPC Flow
Logs
Amazon S3 bucket
AWS IoT
Amazon Kinesis Firehose
Integration with the AWS ecosystem
Amazon ECS
Security with IAM{
"Version": "2012-10-17","Statement": [{
"Sid": "","Effect": "Allow","Principal": {"AWS": "arn:aws:iam:123456789012:user/susan"
},"Action": [ "es:ESHttpGet", "es:ESHttpPut", "es:ESHttpPost",
"es:CreateElasticsearchDomain","es:ListDomainNames" ],
"Resource": "arn:aws:es:us-east-1:###:domain/logs-domain/<index>/*"
} ] }
Pay for compute and storage you use
• With Amazon Elasticsearch Service, you pay only for the compute and storage resources you use. AWS Free Tier for qualifying customers.
Wrap up
• Combined with Kibana, Elasticsearch provides search and visualization for streaming data and full-text use cases.
• Elasticsearch is based on Lucene, which reads and writes search indices
• Aggregations allow you to analyze your data, splitting into Buckets and computing Metrics
• Amazon Elasticsearch Service makes it easy to set up and manage your Elasticsearch cluster on AWS
• Amazon ES is a great way to get started with Elasticsearch!
Q&A
• Christoph Schmitter: csc@amazon.deSolutions Architect
• https://run.qwiklab.com/searches/elasticsearch
Demo Screenshots
top related