challenges in end-to-end performance

18
Challenges in end-to-end performance Dr. Tim Chown, Jisc, UK [email protected] 19/10/20 16

Upload: jisc

Post on 09-Jan-2017

254 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Challenges in end-to-end performance

Challenges in end-to-end performanceDr. Tim Chown, Jisc, UK [email protected]

19/10/2016

Page 2: Challenges in end-to-end performance

2

Overview

»There are many new use cases emerging in the field of data-intensive research, particularly in the sciences› These are placing an increasing requirement on the network to transfer

large volumes of data to/from compute facilities or to/from storage/archive, while achieving the best possible end-to-end performance

»This challenge exists for existing research fields› e.g. astrophysics, particle physics, genomics, …

»But also for new fields, and new types of networked scientific equipment› e.g., electron microscopy, where there may be no local compute facility,

but a fast turnaround on processing significant data volumes is required

»We’ll hear more about examples of both of these today30/09/2016Challenges in Achieving Optimal End-to-End Network Performance

Page 3: Challenges in end-to-end performance

3

Factors affecting end-to-end performance?

»Achieving optimal end-to-end performance is a multi-faceted, nuanced problem. » It includes:

› Appropriate provisioning between the end sites (campuses, national e-Infrastructure facilities, and cloud resources) by Jisc and other ISPs

› Properties of the local campus network (at each end), including their Janet connectivity capacity, internal LAN design, the performance of firewall/IDS systems, and the configuration of other network devices on the path

› End system configuration and tuning; e.g., network stack buffer sizes, disk I/O, memory management, etc.

› The choice of software tools used to transfer data, and the underlying network protocols they use

»Note: It’s not practical to expect researchers to understand these issues in detail, but an appreciation of them at a high level would be useful

30/09/2016Challenges in Achieving Optimal End-to-End Network Performance

Page 4: Challenges in end-to-end performance

4

Goals for today» To help you form a strategy for supporting data-intensive research, by bringing

together networking people from many organisations to hear about and discuss the issues

» With regards to end-to-end performance, today’s workshop focuses on network engineering in the ‘last mile’ within end-site campuses› This is where we currently see most of the problems› What approaches should we take?

» Today’s presentations cover a range of related topics, with time set aside for discussion› Important to raise awareness of the issues, the challenges, and their context› See examples of current good practices on local network engineering› Consider related topics, such as network performance measurement, and the

application of appropriate security policies30/09/2016Challenges in Achieving Optimal End-to-End Network Performance

Page 5: Challenges in end-to-end performance

5

Janet end-to-end performance initiative

» Jisc has set up its Janet e2e performance initiative to help progress the challenges» The E2EPI aims to:

› Promote dialogue between Jisc, Janet-connected campus/site computing service groups, and research communities

› Engage with existing and emerging data-intensive research communities › Hold workshops, facilitating discussion on e-mail lists, etc.› Help researchers manage expectations› Establish and share best practices in identifying and rectifying causes of poor

performance› Include a diverse set of applications, e.g. low-latency applications such as LOLA

» More information:› https://www.jisc.ac.uk/rd/projects/janet-end-to-end-performance-initiative

30/09/2016Challenges in Achieving Optimal End-to-End Network Performance

Page 6: Challenges in end-to-end performance

6

Supporting data-intensive science

» Today’s campus networks are generally designed to support day-to-day application traffic; web, email, social media, video on demand, etc.

» A campus connection to Janet will be sized for this traffic› Traffic volume tends to grow organically, and fairly predictably

» The firewall or IDS architecture is designed for thousands of concurrent short traffic flows› With usually, by policy, all campus traffic flowing through it

» The challenge is how we adapt our network architectures and design choices to also support very high throughput flows, to/from the campus› Noting that some of these flows may place significant step changes on our

campus connectivity requirements› And that we’d prefer not to rate-limit, or use resilient links for science data› Implies you need to conduct regular networking ‘future looks’ – do you?

30/09/2016Challenges in Achieving Optimal End-to-End Network Performance

Page 7: Challenges in end-to-end performance

7

The local network engineering context

»Appropriate local campus network engineering is an important part of the end-to-end picture

»But it’s important not to overlook other elements, for example:› Janet is well-provisioned, as we’ll hear shortly; most problems have tended

to be towards the edges, but it's not unknown for problems to exist in the backbone

› The performance of the end systems; a fast network path to a slow disk I/O subsystem is of little practical use

› Researchers may choose the transfer tools they use, or have these imposed by the project partners they’re working with; the tools might not be optimal

› Researchers may try (say) scp, get poor performance, and give up› Applications will behave differently, especially with respect to packet loss,

depending on whether they use TCP or UDP»You should keep these aspects in mind, to best support your researchers30/09/2016Challenges in Achieving Optimal End-to-End Network Performance

Page 8: Challenges in end-to-end performance

8

Example issue: TCP or UDP?

»There are TCP and UDP-based transfer tools available

»TCP-based applications can be very sensitive to packet loss› A small fraction of 1% packet loss can have a significant effect› Thus we need to minimise packet loss for TCP applications › GridFTP can mitigate this by using multiple parallel TCP streams› Google’s recent work on TCP-BBR is very promising; now in the Linux

kernel

»UDP-based applications are less sensitive to loss› Aspera is gaining some popularity as a commercial data transfer tool› But UDP is not considerate of TCP applications; TCP flows will back off in

the presence of competing UDP; consider your traffic as a whole30/09/2016Challenges in Achieving Optimal End-to-End Network Performance

Page 9: Challenges in end-to-end performance

9

Researcher expectations?» We need to think about how we might help set researcher expectations» Noting some researchers may have little idea of what the network can do for them» One way is by example:

› To send 1PB of data over a 100Gbps link would take just under a day (0.93 days)› Or on a 10Gbit/s link you can move 50GB in under a minute

» The snag is that in practice it’s hard to get the maximum theoretical throughput, for a variety of reasons, including:› Competing traffic on the same path› Limitations in network devices› Choice of transfer tools used (scp, GridFTP, Aspera, …)› The impact of packet loss› Limited buffer sizes for large round-trip time links

» Ultimately, good communication between IT staff and researchers is vital

30/09/2016Challenges in Achieving Optimal End-to-End Network Performance

Page 10: Challenges in end-to-end performance

10

Good news!» The good news is that a lot of good work has already been done» For example:

› The Worldwide LHC Computing Grid and NRENs (like Janet) have established an optical network (LHCOPN) and an overlay network (LHCONE) to handle the transfer of Large Hadron Collider data between sites

› International work, especially by ESnet, including their FasterData resource, and their publication of the ‘Science DMZ’ design pattern

› The Data Transfer Zone (DTZ) deployment at RAL› We’ll hear about these, and others, later…

» By building on this work we can more widely enable new types of data workflows, with virtual co-location of data and compute, analysis of live streamed data (rather than locally stored data), etc.› Fully exploit the capacity of the Janet backbone› Increase the potential for the UK’s research output

30/09/2016Challenges in Achieving Optimal End-to-End Network Performance

Page 11: Challenges in end-to-end performance

11

The ESnet ‘Science DMZ’ approach

»ESnet published the Science DMZ design pattern in 2012/13› See https://www.es.net/assets/pubs_presos/sc13sciDMZ-final.pdf

» It comprises four key elements:› Network architecture; avoiding local bottlenecks› Network performance measurement; deployment of perfSONAR› An appropriate, tailored security model› High performance data transfer node (DTN) design and configuration

»The NSF’s Campus Cyberinfrastructure Program has funded this model ($60m total) in over 100 US universities, and continues to offer awards in similar areas:› See http://www.nsf.gov/pubs/2016/nsf16567/nsf16567.htm › But there is no current funding equivalent in the UK

30/09/2016Challenges in Achieving Optimal End-to-End Network Performance

Page 12: Challenges in end-to-end performance

12

The Pacific Research Platform»The PRP is an example of what

becomes possible as the ‘last mile’ issues are addressed.

»Seeded by $5m of NSF funding› Sites connected at 10-100Gbit/s

»Science-driven, by an initial set of 15 teams of scientists across multiple disciplines and organisations› Particle physics, astronomy,

astrophysics, biomedicine, genomics, structural biology, earth sciences, climate modeling, scalable visualistion, …

»See http://prp.ucsd.edu/ 30/09/2016Challenges in Achieving Optimal End-to-End Network Performance

Page 13: Challenges in end-to-end performance

13

Extending Science DMZ?» The Science DMZ specifies a design pattern» It turns out that a number of campuses/sites in the UK have already deployed

elements of the pattern, without knowing it

» An interesting question is how we might extend the existing Science DMZ principles› On-demand provision? › To specific end systems?› Using SDN? (perhaps with tools like Cisco’s ACI)› With IPv6? (noting for example that GridPP are aiming to support IPv6-only

resources)

» We might also consider the potential for a UK equivalent to the PRP» And maybe with a better name than ‘Science DMZ’ ?30/09/2016Challenges in Achieving Optimal End-to-End Network Performance

Page 14: Challenges in end-to-end performance

14

Useful resources

» Janet E2EPI mail list:› Open for anyone to join; currently around 70 list members› See https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=E2EPI

»ESnet FasterData knowledgebase:› Lots of good material on host and network tuning, transfer tools, network

expectations, and Science DMZ principles› See http://fasterdata.es.net/

»SWITCH’s TCP throughput calculator› Includes the impact on TCP throughput for a given loss rate, and a

bandwidth-delay product (BDP) calculator› See https://www.switch.ch/network/tools/tcp_throughput/

30/09/2016Challenges in Achieving Optimal End-to-End Network Performance

Page 15: Challenges in end-to-end performance

15

GÉANT eduPERT

»eduPERT is a collaborative effort by a variety of campus and NREN participants to document and share experiences in end-to-end performance problems› See http://services.geant.net/edupert

» Includes the searchable eduPERT knowledgebase (a wiki), which contains entries added over the last 10 years› See http://kb.pert.geant.net/PERTKB/WebHome

»Originally designed to be a coordination point between Performance Enhancement Response Teams (PERTs)

» In practice, it’s open to anyone to register and contribute› To join the mail list: https://lists.geant.org/sympa/info/pert-discuss

30/09/2016Challenges in Achieving Optimal End-to-End Network Performance

Page 16: Challenges in end-to-end performance

16

GÉANT SIG-PMV

»The Special Interest Group for Performance Monitoring and Verification (SIG-PMV) is a a new, open group studying the use of appropriate performance monitoring and measurement tools by researcher, campus and NREN groups› Its initial activity will be to conduct surveys of communities to identify the

existing tools being used, and potential gaps that may exist› Includes consideration of small node perfSONAR and WiFiMon

»See https://wiki.geant.org/display/PMV/SIG-PMV

»Next meeting: November 3rd 2016 at SWITCH offices in Zurich› An eduPERT hands-on training event follows on the 4th November; this will

be focused on deploying perfSONAR› Details and registration: https://eventr.geant.org/events/2494

»To join the mail list: https://lists.geant.org/sympa/info/pmv-discuss

30/09/2016Challenges in Achieving Optimal End-to-End Network Performance

Page 17: Challenges in end-to-end performance

17

Today’s talks

» Our high-level schedule for the day…

» This morning:› The Janet perspective› An example of an emerging data-intensive science use-case› An overview of perfSONAR› Use of perfSONAR and Science DMZ principles to resolve problems at

Diamond

» In the afternoon:› A set of community talks on network engineering practices› Rounded off by ESnet talks (remote)› With time at the end to see what we have consensus on…

30/09/2016Challenges in Achieving Optimal End-to-End Network Performance

Page 18: Challenges in end-to-end performance

jisc.ac.uk

18

We want to hear from you - do get in touch!

Dr Tim Chown

Senior Network Services DeveloperJisc, [email protected]

30/09/2016Challenges in Achieving Optimal End-to-End Network Performance