data hub performance optimization...13 june 2019© marklogic corporation data hub performance...

30
13 June 2019 © MARKLOGIC CORPORATION Data Hub Performance Optimization JAMES CLIPPINGER VP Strategic Accounts ERIN MILLER Senior Manager, Performance and Reliability Engineering

Upload: others

Post on 04-Feb-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

  • 13 June 2019© MARKLOGIC CORPORATION

    Data Hub Performance Optimization

    JAMES CLIPPINGERVP Strategic Accounts

    ERIN MILLERSenior Manager, Performance and

    Reliability Engineering

  • What to check when your Data Hub is slow

    Code or Infrastructure?

    - Finding infrastructure problems

    - Tracking down resource bottlenecks

    - Debugging slow harmonization

    - Figuring out a slow production application

    Agenda

  • MarkLogic version

    Performance expectations

    MarkLogic system ErrorLog

    Find the resource bottleneck using Meters and testing

    If there is no bottleneck:

    - Increase workload

    - Isolate workload

    Infrastructure performance checklist

  • Are you running the most recent version of MarkLogic?

    Yes, upgrading can be a pain, but:

    - Performance issues are getting fixed all the time

    - Metrics and performance monitoring are improving all the time

    Do you want to spend time chasing a problem that’s been fixed or that could be detected?

    Don’t kick it old school

  • Five key resources used by MarkLogic

    - Disk bandwidth, disk space, CPU, RAM, network bandwidth

    Resource needs of ingest and harmonization generally easy to predict based on design

    Can infrastructure meet the resource needs? Test and do the math.

    Don’t be surprised by ingest, harmonization, or reindexing requirements

    Reasonable performance expectations

  • Recent releases of MarkLogic tell you much more about infrastructure performance issues

    - “Slow” messages: write, read, fsync, and background

    - “Memory” messages and warnings

    - “Hung” messages

    Check ErrorLog.txt

  • 2018-12-12 09:00:14.510 Info: Forest f1 state changed from open to error

    2018-12-12 09:00:14.510 Info: Database DB-1 is offline

    2018-12-12 09:00:14.512 Alert: XDMP-FORESTERR: Error in merge of forest f1: XDMP-MERGESPACE: Not merging due to disk space limitations,need=17039MB, have=14696MB

    Analyzing logs

  • 2017-07-06 12:10:33.553 Warning: Slow fsync/var/opt/MarkLogic/Forests/doc-stress-F4/Label, 2.044 sec2017-07-06 12:11:50.868 Notice: Slow open /var/opt/MarkLogic/Forests/doc-stress-F10/00000183/Label, 1.659 sec

    2017-07-06 12:11:59.734 Warning: Slow utime/var/opt/MarkLogic/Forests/doc-stress-F10/Label, 2.006 sec

    Analyzing logs

  • 2017-02-21 10:24:47.093 Debug: Retrying xdmp:invoke 8616837058919558659 Update 1 because XDMP-DEADLOCK: Deadlock detected locking /documents/1.xml

    2017-02-21 10:24:47.990 Debug: Retrying xdmp:invoke 8616837058919558659 Update 2 because XDMP-DEADLOCK: Deadlock detected locking /documents/1.xml

    2017-02-21 10:24:48.851 Debug: Retrying xdmp:invoke 8616837058919558659 Update 3 because XDMP-DEADLOCK: Deadlock detected locking /documents/1.xml

    ….

    Analyzing logs

  • Disk bandwidth

    - Cloud: check expected bandwidth

    - On prem: test using fio to establish storage performance baseline (tough for shared)

    CPU

    - 80-85% utilization tends to be maximum for responsive system

    Lock Wait Load: how does it compare to expectations?

    Resource bottlenecks visible in Meters

  • Crank up the workload

    - More client threads

    - More MarkLogic app server/task server threads

    Work to get to a bottleneck

    - Isolate workloads: ingest, harmonize, and user queries separately

    If request rate does not go up and there’s no bottleneck, contact Support

    If there is no bottleneck

  • MarkLogic Data Hub Platform

  • Transformation from “as-is” to “curated data”

    What are your SLAs?

    But the ingest is so fast! Managing expectations

    It’s just a black box—how do I figure this out?!

    Challenge: Harmonization is slow

  • USE methodology (Gregg, http://www.brendangregg.com/usemethod.html)

    For every component in your architecture, monitor and analyze:

    - Utilization

    - Saturation

    - Errors

    For MarkLogic: use Meters to isolate bottlenecks

    - Disk space/storage; IOPS; CPU; RAM

    Code or Infrastructure?

  • Meet your friend, Request Monitoring!

    - New in 9.0-7

    - Fine-grained data about every query run by app server

    - https://docs.marklogic.com/guide/performance/request_monitoring

    - Output goes to /var/opt/MarkLogic/Logs/APP-SRV-PORT_RequestLog.txt

    Challenge: Slow transformations

  • I just want to capture stats about all requests that are > 1 second

    Configure request on the data-hub-FINAL and STAGING app servers

    - In the root directory of the data-hub-MODULES database, place a .api file that contains info about the metrics you want to capture and the constraints (if any)

    Request Monitoring with thresholds

  • My transform job

  • {"time":"2019-04-25T16:53:41Z", "url":"/v1/resources/ml:sjsFlow?rs=job-id=e9576f6b-7aa3-4243-bd1a-edc634343631=rs=flow-name=join=rs=target-database=data-hub-FINAL=rs=options=%7B dhf.collection = source=small%2F00%2F %2C entity = WebPage %2C flow = join %2C flowType= harmonize %7D=rs=entity-name=WebPage=rs=identifiers=...200-urls here for docs harmonized--=database=data-hub-STAGING", "user":"admin", "elapsedTime":112.311659, "requests":1, "listCacheHits":1344, "listCacheMisses":200, "listSize":811708, "inMemoryListHits":201, "expandedTreeCacheHits":369, "expandedTreeCacheMisses":631, "compressedTreeCacheHits":623, "compressedTreeCacheMisses":8, "compressedTreeSize":292496, "valueCacheHits":78, "valueCacheMisses":2943, "regexpCacheHits":408, "regexpCacheMisses":7, "filterHits":600, "fragmentsAdded":200, "dbProgramCacheHits":1002, "fsLibraryModuleCacheHits":203, "dbLibraryModuleCacheHits":26, "readLocks":48470, "writeLocks":200, "lockTime":4.382538, "commitTime":0.000703, "runTime":112.312699, "indexingTime":1.89982}

    Monitoring on a DHF endpoint: Output

  • In DHF 4.x, harmonization code is in plugins

    - Main, Collector, Content, Header, Triples all run in Query mode

    - Writer runs in Update mode

    Thought experiment—what happens if I do a search in my Writer plugin?

    Good news: in DHF 5.0, we give you better guardrails!

    DHF and transactions

  • My writer plugin

  • Characterizing the problem

    - Lots of possible reasons: data growth over time, resource bottlenecks, optimized code

    - When did it start? What’s the pattern?

    Different ingest/harmonization flows impacting database search performance

    “When I run ingest and transform, my search application slows down”

    Challenge: Production application slows down over time

  • This assumes that you’ve followed USE and Clip’s suggestions and found an infrastructure bottleneck

    Look for hockey stick—try to provision more infrastructure resources before you get there

    - What to expand when you’re expanding?

    - IOPS, CPU, RAM? Forests/hosts?

    Solution: Cluster Expansion

  • Remember, all your data uses resources—memory and storage, impacts on search, term list sizes, etc. If you’re not using the data and don’t need it, archive

    Use Tiered Storage for less frequently accessed data

    - HDFS

    Archive unused data

    - S3 and Azure BLOB storage

    - Even if you are an on-prem customer, this is a cheap and effective storage mechanism

    Solution: Archive strategy

  • Keep requests independent

    - Isolate your workload by request. DHF does this for you

    Watch your locks

    Avoid unnecessary bottlenecks

    - Don’t create a serial number generator

    Putting it together: writing scalable code

  • Limit the request’s resources

    - Using SQL or SPARQL? Use cts or Optic search clauses to limit scope

    - Big result set? Paginate

    - Don’t write queries where result set grows with size of data

    - i.e., give me all the trades in the database—what happens as DB grows?

    - If you need to do this, batch!

    Putting it together: writing scalable code

  • Realistically, bottlenecks are often a combination of un-optimized code and resource limits

    To figure out what’s what, use Utilization Saturation Errors methodology

    Best process to efficiently scale:

    - First, optimize your code as best you can

    - Then, look at expansion—add RAM, add hosts, scale out and/or up

    Putting it together: USE to resolve bottlenecks

  • [email protected]. Really. Email Support. You can email us: [email protected] and [email protected], but Support is monitored 24/7

    Trying to figure out what those logs mean? https://help.marklogic.com/Knowledgebase/

    Oh look! Erin wrote a whitepaper about this:

    - https://www.marklogic.com/resources/performance-testing-marklogic/

    And another:

    - https://www.marklogic.com/resources/understanding-system-resources/

    Resources

  • Thank you

    Data Hub Performance OptimizationAgendaInfrastructure performance checklistDon’t kick it old schoolReasonable performance expectationsCheck ErrorLog.txtAnalyzing logsAnalyzing logsAnalyzing logsResource bottlenecks visible in MetersSlide Number 11Slide Number 12If there is no bottleneckMarkLogic Data Hub PlatformChallenge: Harmonization is slowCode or Infrastructure?Challenge: Slow transformationsRequest Monitoring with thresholdsMy transform jobMonitoring on a DHF endpoint: OutputDHF and transactionsMy writer pluginChallenge: Production application slows down over timeSolution: Cluster ExpansionSolution: Archive strategyPutting it together: writing scalable codePutting it together: writing scalable codePutting it together: USE to resolve bottlenecksResourcesSlide Number 30