best practices: troubleshooting your couchbase application: couchbase connect 2015
TRANSCRIPT
![Page 1: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032618/55b419b8bb61eb9e298b4753/html5/thumbnails/1.jpg)
TROUBLESHOOTING COUCHBASE
David HaikneyHead of Technical Support Couchbase EMEA
![Page 2: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032618/55b419b8bb61eb9e298b4753/html5/thumbnails/2.jpg)
©2015 Couchbase Inc. 2
What Can Possibly Go Wrong?
Help! My Node is Down Why Did My Operation Fail?
Some Handy Tools
The following presentation is based on actual events…
![Page 3: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032618/55b419b8bb61eb9e298b4753/html5/thumbnails/3.jpg)
Help! My Node is Down
![Page 4: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032618/55b419b8bb61eb9e298b4753/html5/thumbnails/4.jpg)
©2015 Couchbase Inc. 4
Why Does A Node Go Down?...
Node 4
Node 1
Node 2
Node 3
![Page 5: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032618/55b419b8bb61eb9e298b4753/html5/thumbnails/5.jpg)
©2015 Couchbase Inc. 5
…Because Heartbeats Have Gone Missing
Node 4
Node 1
Node 2
Node 3
![Page 6: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032618/55b419b8bb61eb9e298b4753/html5/thumbnails/6.jpg)
©2015 Couchbase Inc. 6
04 15:
![Page 7: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032618/55b419b8bb61eb9e298b4753/html5/thumbnails/7.jpg)
©2015 Couchbase Inc. 7
1. Server Itself is Down
Server (or VM) going offline takes Couchbase with it
Occurs during planned or “unscheduled” maintenance
What matters is server status at the time of the event (VMs) check status in management console Check server’s uptime Couchbase’s own uptime: cbstats|grep uptime
Troubleshooting Tips!
![Page 8: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032618/55b419b8bb61eb9e298b4753/html5/thumbnails/8.jpg)
©2015 Couchbase Inc. 8
2. Server is Unreachable
Server unavailable on network Usually related to wider datacenter event Our assailant may have fled the scene!
Verify connectivity from other cluster nodes (ping/ ssh)
Check system and network logs around time of first report
Standard network monitoring should be deployed
Troubleshooting Tips!
![Page 9: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032618/55b419b8bb61eb9e298b4753/html5/thumbnails/9.jpg)
©2015 Couchbase Inc. 9
3. Couchbase Service is Offline
Server is available but Couchbase service is not running
Relatively rare since babysitter restarts failed processes Check which Couchbase processes are running Check dmesg for Linux’s OOM Killer:
May need to reduce the server quota Attempt to restart the service Possibly warming up - monitor for progress with cbstats
Troubleshooting Tips!
May 14 04:26:49 cgs1 kernel: memcached invoked oom-killer
![Page 10: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032618/55b419b8bb61eb9e298b4753/html5/thumbnails/10.jpg)
©2015 Couchbase Inc. 10
4. Couchbase Too Slow to Respond
Service is running but failed to send timely heartbeats Common trigger of autofailover
THP, Sizing, Swap, Hypervisor Disturbance
Transparent Huge Pages (THP) disabled on Linux systems
Follow Sizing Best Practice Check for system swap usage Avoid VM Over-commit or migration Increase the autofailover timeout
Troubleshooting Tips!
![Page 11: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032618/55b419b8bb61eb9e298b4753/html5/thumbnails/11.jpg)
©2015 Couchbase Inc. 11
Sizing Matters
8 Buckets60 Design Docs10 XDCR Streams
8 CPU Cores
See Perry Krug’s session at 5:15pm today!
![Page 12: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032618/55b419b8bb61eb9e298b4753/html5/thumbnails/12.jpg)
Why Did My Operation Fail?
![Page 13: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032618/55b419b8bb61eb9e298b4753/html5/thumbnails/13.jpg)
©2015 Couchbase Inc. 13
The lifecycle of a simple get() Operation
Couchbase
Server Node 1
Client
Your Applicati
on
Couchbase
SDKNetwork
Couchbase
Server Node 2
![Page 14: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032618/55b419b8bb61eb9e298b4753/html5/thumbnails/14.jpg)
©2015 Couchbase Inc. 14
Possible Pitfalls: Operation couldn’t be dispatched
?Client
Your Applicati
on
Couchbase
SDK
Use a singleton pattern Check node health on cluster Check vbucket map from
server If necessary, restart client Tune Garbage Collection
Settings
Troubleshooting Tips!
Client unsuccessfully initialised Server Connections exhausted Stop the World Garbage Collection Consecutive Failovers
![Page 15: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032618/55b419b8bb61eb9e298b4753/html5/thumbnails/15.jpg)
©2015 Couchbase Inc. 15
Possible Pitfalls: Operation Did Not Complete
Couchbase
Server Node
Client
Your Applicati
on
Couchbase
SDK
Look for a pattern as to which clients and servers are affected
Code defensively and retry (at least once)
Check network and server health When all else fails, tcpdump /
wireshark…
Troubleshooting Tips!
Did operation arrive at server? Firewalls often intervene!
Did server have difficulty responding?
Did client receive a response?
![Page 16: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032618/55b419b8bb61eb9e298b4753/html5/thumbnails/16.jpg)
©2015 Couchbase Inc. 16
tcpdump and Wireshark
Wireshark is Couchbase aware
Uses tcpdump packet capture
Can be noisy so filter wisely
specific client Specific server Port 11210
![Page 17: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032618/55b419b8bb61eb9e298b4753/html5/thumbnails/17.jpg)
©2015 Couchbase Inc. 17
Possible Pitfalls: Not The Response I Expected
Simplest explanation is often the correct one Test your application code in node down, failover and rebalance
scenarios Defensive Coding for all Couchbase operations
E.g. for Temporary Out Of Memory Have a simple client for quick tests to isolate the problem
< 10 lines of python!
Troubleshooting Tips!
Customer: My operation failed with a “Key Does Not Exist” ErrorCB Support: OK, are you sure the key exists?Customer: Yes, our logs show the key being created and we don’t
do deletes.CB Support: The delete counter on your cluster is increasing….
![Page 18: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032618/55b419b8bb61eb9e298b4753/html5/thumbnails/18.jpg)
Some Other Handy Tools
![Page 19: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032618/55b419b8bb61eb9e298b4753/html5/thumbnails/19.jpg)
©2015 Couchbase Inc. 19
Scenario: Troubleshooting a 2.2.0 XDCR case
5 Million docs went swimming one day,
Off to a cluster far away,XDCR said quack-quack-quack-
quack,But only 4,999,999 docs came
back
![Page 20: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032618/55b419b8bb61eb9e298b4753/html5/thumbnails/20.jpg)
©2015 Couchbase Inc. 20
Finding the Needle in the Haystack
Identify a single vbucket that exhibits the problem Immediately narrows the problem space to 1/1024th of the
data set cbstats vbucket-details
Interrogate the files on disk to find the discrepancy couch_dbdump --no-body --json --by-id <file> jq tool is very useful for CLI json parsing!
Diff the source cluster and the destination cluster Easier on a static data set but still feasible on a live
cluster With ops in flight, perform the diff twice and take the
intersection
![Page 21: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032618/55b419b8bb61eb9e298b4753/html5/thumbnails/21.jpg)
©2015 Couchbase Inc. 21
Final Thoughts You now know the common causes of Node and Operation
Failures
Troubleshooting requires taking a logical path through the scenarios
Tools exist to help you isolate the problem
We are an open kitchen: issues.couchbase.com
Support team available for 24 x 7 x 365 emergency assistance…. Co-located with developers for fastest response time
… but we hope you’ll never need us!
![Page 22: Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015](https://reader030.vdocuments.mx/reader030/viewer/2022032618/55b419b8bb61eb9e298b4753/html5/thumbnails/22.jpg)
Thank you.