the science dmz – perfsonar & network monitoringrich/...oin-sciencedmz-2-perfsonar.pdf · the...
TRANSCRIPT
The Science DMZ – perfSONAR & Network Monitoring
Jason Zurawski - ESnet Engineering & Outreach
Operating Innovative Networks (OIN)
October 3th & 4th, 2013
With contributions from S. Balasubramanian, E. Dart, B. Johnston, A. Lake, E. Pouyoul, L. Rotman, B. Tierney and others @ ESnet
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Overview Part 1 (Today):
• What is ESnet? • Science DMZ Introduction & Motivation • Science DMZ Architecture
Part 2 (Today): • PerfSONAR • Science DMZ Security Best Practices
Part 3 (Today & Tomorrow): • The Data Transfer Node • Data Transfer Tools • Conclusions & Discussion
2 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
The Data Transfer Trifecta: The “Science DMZ” Model
Dedicated Systems for
Data Transfer
Network Architecture
Performance Testing &
Measurement
Data Transfer Node • High performance • Configured for data
transfer • Proper tools
perfSONAR • Enables fault isolation • Verify correct operation • Widely deployed in
ESnet and other networks, as well as sites and facilities
Science DMZ • Dedicated location for DTN • Proper security • Easy to deploy - no need to
redesign the whole network
3 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Test and Measurement – Keeping the Network Clean
The wide area network, the Science DMZ, and all its systems can be functioning perfectly
Eventually something is going to break • Networks and systems are built with many, many
components • Sometimes things just break – this is why we buy
support contracts Other problems arise as well – bugs, mistakes, whatever We must be able to find and fix problems when they occur Why is this so important? Because we use TCP!
4 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Where Are The Problems?
Source Campus
Backbone
S
NREN
Congested or faulty links between domains
Congested intra- campus links
5 – ESnet Science Engagement ([email protected]) - 10/2/13
D
Destination Campus
Latency dependant problems inside domains with small RTT
Regional
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Source Campus
R&E Backbone
Regional
D S
Destination Campus
Regional
Performance is good when RTT is < ~10 ms
Performance is poor when RTT exceeds ~10 ms
Switch with small buffers
Local Testing Will Not Find Everything
6 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Soft Network Failures
Soft failures are where basic connectivity functions, but high performance is not possible.
TCP was intentionally designed to hide all transmission errors from the user:
• “As long as the TCPs continue to function properly and the internet system does not become completely partitioned, no transmission errors will affect the users.” (From IEN 129, RFC 716)
Some soft failures only affect high bandwidth long RTT flows.
Hard failures are easy to detect & fix • soft failures can lie hidden for years!
One network problem can often mask others
7 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Network Monitoring
• All networks do some form monitoring. • Addresses needs of local staff for understanding state of the
network o Would this information be useful to external users? o Can these tools function on a multi-domain basis?
• Beyond passive methods, there are active tools. o E.g. often we want a ‘throughput’ number. Can we automate that
idea? o Wouldn’t it be nice to get some sort of plot of performance over
the course of a day? Week? Year? Multiple endpoints?
perfSONAR = Measurement Middleware
8 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
perfSONAR
All the previous network diagrams have little perfSONAR boxes everywhere
• The reason for this is that consistent behavior requires correctness • Correctness requires the ability to find and fix problems - You can’t fix what you can’t find - You can’t find what you can’t see - perfSONAR lets you see
Especially important when deploying high performance services • If there is a problem with the infrastructure, need to fix it • If the problem is not with your stuff, need to prove it - Many players in an end to end path - Ability to show correct behavior aids in problem localization
9 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
What is perfSONAR?
perfSONAR is a tool to:
• Set network performance expectations
• Find network problems (“soft failures”)
• Help fix these problems
All in multi-domain environments
• These problems are all harder when multiple networks are involved
perfSONAR is provides a standard way to publish active and passive monitoring data
• This data is interesting to network researchers as well as network operators
10 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
The “perfSONAR Toolkit” is an open source implementation and packaging of the perfSONAR measurement infrastructure and protocols from ESnet and Internet2
http://psps.perfsonar.net
All components are available as RPMs, and bundled into a CentOS 6-based “netinstall” and a “Live CD”
• perfSONAR tools are much more accurate if run on a dedicated perfSONAR host, not on the DTN
Very easy to install and configure • Usually takes less than 30 minutes
perfSONAR Toolkit
11 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
The best source of informa1on is here: • h3p://code.google.com/p/perfsonar-‐ps/wiki/pSPerformanceToolkit331
There are two use cases for configura1on: • Diagnos1c - Burn CD, insert, boot, Done! - You can’t configure regular tes1ng, but you can test to this/log on and test with it
• Permanent - Couple of steps to install the Linux Distro
Hands On – Configuration of a pSPT
12 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Hands On – Configuration of a pSPT
13 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Hands On – Configuration of a pSPT
14 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
We will be using VMs: • perfsonar-‐ws-‐2.internet2.edu – perfsonar-‐ws-‐10.internet2.edu
• Note – Some of you have to share, pair up!
Visit your VM in a web browser first, e.g.:
• h3p://perfsonar-‐ws-‐XX.internet2.edu (where XX is your number)
Hands On – Configuration of a pSPT
15 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Click on “Enabled Services” • Note you may need to ‘ok’ a security warning
Username: “root”
Password: “psworkshop”
Hands On – Enabling SSH
16 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Hands On – Via the Web Interface …
17 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Click ‘SSH’ to enable the SSH service
Click “Save” • A progress bar will appear • When done “Configura1on Saved And Services Restarted” will appear
• Note: If you are sharing, only one of you will need to make this change
SSH is now available on your host
Hands On – Setting up SSH
18 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Open a terminal
SSH to root@perfsonar-‐ws-‐XX.internet2.edu (where XX) is your number):
Hands On – Configuration of a pSPT
19 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
• Do this first, otherwise a lot of other stuff won’t work. • Authen1ca1on is required • Always remember to save when you are done.
Hands On – Administrative Info
20 – ESnet Science Engagement ([email protected]) - 10/2/13
20 – 10/2/13, © 2013 ESnet, Internet2 J. Zurawski – [email protected]
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Click on ‘edit’ to edit (of course):
Hands On – Administrative Info
21 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Press “OK” and “Save” when done:
Hands On – Administrative Info
22 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
• Do this second. Note that it may take a day to fully stabilize the clock
• Pick 4 – 5 Close servers for NTP • We have a fast way to do this, or you can
manually select • Can also add your own servers if you don’t like
ours • Note: Clocks are stable, no one should ‘save’, but feel free to play around and select closer ones if you want.
Hands On – NTP
23 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Press “select closest” to run a selec1on
Hands On – NTP
24 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Add in servers manually
Hands On – NTP
25 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
• Services should be enabled/disabled from this screen (don’t use chkconfig, we overwrite that with each save…)
• Shortcuts to enable bandwidth only vs latency only
• SSH is disabled by default! • Note: Don’t ‘save’ aler this part either, but feel free to see what the bu3ons do.
Hands On – Services
26 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
• Select/de-‐select via bu3ons. Pick a use case as well
Hands On – Services
27 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
• All regular tes1ng follows the same pa3ern: - Select a Type - Select Parameters - Add Hosts - Save
• Will only go over BWCTL here
Hands On – Regular Testing
28 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
• Ini1al
Hands On – Regular Testing
29 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Create test parameters
Hands On – Regular Testing
30 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Add Hosts
Hands On – Regular Testing
31 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Enter a new host
Hands On – Regular Testing
32 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Lets use these: • Test to 1 or 2 of your neighbors (perfsonar-‐ws-‐X.internet2.edu )
• Test to Internet2 - Ping/OWAMP: owamp.losa.net.internet2.edu, owamp.chic.net.internet2.edu, owamp.hous.net.internet2.edu, owamp.salt.net.internet2.edu
- Traceroute/BWCTL: bwctl.losa.net.internet2.edu, bwctl.chic.net.internet2.edu, nms-‐bwctl.hous.net.internet2.edu, nms-‐bwctl.salt.net.internet2.edu
Set up Latency, BW, Ping, and Traceroute tests
Hands On – Regular Testing
33 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
• perfSONAR interface is meant to be simple (e.g. so easy even an Engineer Scien1st CIO could do it)
• Enabling this on campus is the first step to seeing a simula1on of performance for a bulk data tool. Ideally you would place the perfSONAR server where the users are (e.g if they are traversing a firewall s1ll, why don’t you learn their pain)?
• Configuring regular tests is systema1c – pick regional and far away des1na1ons.
• Dust of nenlow, and see where the data is going – configure tests to those loca1ons too.
Transition – What did we just do?
34 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Use the correct tool for the Job • To determine the correct tool, maybe we need to start with what we want to accomplish …
What do we care about measuring? • Packet Loss, Duplica1on, out-‐of-‐orderness (transport layer)
• Achievable Bandwidth (e.g. “Throughput”) • Latency (Round Trip and One Way) • Ji3er (Delay varia1on) • Interface U1liza1on/Discards/Errors (network layer) • Traveled Route • MTU Feedback
The Metrics
35 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
perfSONAR Toolkit Services
PS-Toolkit includes these measurement tools:
• BWCTL: network throughput
• OWAMP: network loss, delay, and jitter
• traceroute
Test scheduler:
• runs bwctl, traceroute, and owamp tests on a regular interval
Measurement Archives (data publication)
• SNMP MA – router interface Data
• pSB MA -- results of bwctl, owamp, and traceroute tests
Lookup Service: used to find services
PS-Toolkit includes these web100-based Troubleshooting Tools
• NDT (TCP analysis, duplex mismatch, etc.)
• NPAD (TCP analysis, router queuing analysis, etc) 36 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Toolkit Web Interface
37 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Deployment By The Numbers • Last updated early Sept 2013. Adoption trend increases with each
release. CC-NIE and innovation platform helped as well.
38 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
World-Wide perfSONAR-PS Deployments: 950+ as of October 2013
39 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Adoption = A Checkbox?
• Can say that about other technologies like IPv6 too …
• Much like a car insurance policy, most will continue to pay the premiums even though they believe they drive ‘safely’
• Most that have adopted have done so for a specific reason (e.g. it works) • ~35 Countries • ~205 Domains • ~950 Instances • 30% have made the upgrade to the latest version so far (~ 2 month out from
release)
• Other macro trends: • Those that deploy, deploy more than 1 • Huge uptick in Europe and Asia. • Network Providers, Campuses, and Vos • Not just the “usual” suspects - Commercial entities, African NRENs, non-DOE government – many are IPv6
only (!) 40 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
We can’t wait for users to report problems and then fix them (soft failures can go unreported for years!)
Things just break sometimes • Failing optics • Somebody messed around in a patch panel and kinked a fiber • Hardware goes bad
Problems that get fixed have a way of coming back • System defaults come back after hardware/software upgrades • New employees may not know why the previous employee set
things up a certain way and back out fixes
Important to continually collect, archive, and alert on active throughput test results
Importance of Regular Testing
41 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
perfSONAR Dashboard: http://ps-dashboard.es.net
42 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
perfSONAR Dashboard: http://ps-dashboard.es.net
43 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Adding Attenuator to Noisy Link
44 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Host Tuning Example
• Host Configuration – spot when the TCP settings were tweaked…
• Example Taken from REDDnet (UMich to TACC, using BWCTL measurement) • Host Tuning: http://fasterdata.es.net/fasterdata/host-tuning/linux/
45 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Regular perfSONAR Tests
We run regular tests to check for two things • TCP throughput
• One way delay and packet loss
perfSONAR has mechanisms for managing regular testing between perfSONAR hosts
• Statistics collection and archiving
• Graphs
• Dashboard display
• Integrate with NAGIOS
This infrastructure is deployed now – perfSONAR hosts at facilities can take advantage of it
At-a-glance health check for data infrastructure
46 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Throughput Detail Graph
• Temporary drop in performance was due to re-route around a fiber cut • Latency increase • Clean otherwise (performance stayed high)
• Other than that, it’s stable, and performs well (over 2Gbps per stream) • This is a powerful tool for expectation management
47 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
What are you going to measure? • Achievable bandwidth - 2-3 regional destinations - 4-8 important collaborators - 4-8 (more if you are willing, especially to start) times per day to each
destination - 20-30 second tests within a region, longer across oceans and
continents • Loss/Availability/Latency - OWAMP: ~10-20 collaborators over diverse paths
• Interface Utilization & Errors (via SNMP) What are you going to do with the results?
• NAGIOS Alerts • Reports to user community • Dashboard
Develop a Test Plan
48 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
http://psps.perfsonar.net/toolkit/hardware.html
Dedicated perfSONAR hardware is best • Server class is a good choice • Desktop/Laptop/Mini (Mac, Shuttle) can be problematic, but work in a
diagnostic capacity
Other applications will perturb results Separate hosts for throughput tests and latency/loss tests is preferred
• Throughput tests can cause increased latency and loss
• Latency tests on a throughput host are still useful however
1Gbps vs 10Gbps testers • There are a number of problem that only show up at speeds above 1Gbps
Virtual Machines do not always work well as perfSONAR hosts (use specific)
• Clock sync issues are a bit of a factor
• throughput is reduced significantly for 10G hosts
• VM technology and motherboard technology has come a long way, YMMV
• NDT/NAGIOS/SNMP/1G BWCTL are good choices for a VM, OWAMP/10G BWCTL are not
Host Considerations
49 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
perfSONAR Deployment Locations
Critical to deploy such that you can test with useful semantics
perfSONAR hosts allow parts of the path to be tested separately • Reduced visibility for devices between perfSONAR hosts • Must rely on counters or other means where perfSONAR can’t go
Effective test methodology derived from protocol behavior • TCP suffers much more from packet loss as latency increases • TCP is more likely to cause loss as latency increases • Testing should leverage this in two ways - Design tests so that they are likely to fail if there is a problem - Mimic the behavior of production traffic as much as possible
• Note: don’t design your tests to succeed - The point is not to “be green” even if there are problems - The point is to find problems when they come up so that the
problems are fixed quickly 50 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Sample Site Deployment
51 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
ATLAS Dashboard
52 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Trouble ticket comes in: “I’m getting terrible performance from site A to site B”
If there is a perfSONAR node at each site border: • Run tests between perfSONAR nodes - performance is often clean
• Run tests from end hosts to perfSONAR host at site border - Often find packet loss (using owamp tool) - If not, problem is often the host tuning or the disk - If not that, suspect a switch buffer overflow problem
• These are the hardest to prove
If there is not a perfSONAR node at each site border - Try to get one deployed - Run tests to other nearby perfSONAR nodes
Common perfSONAR Use Case
53 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
WAN Test Methodology – Problem Isolation
Segment-to-segment testing is unlikely to be helpful • TCP dynamics will be different • Problem links can test clean over short distances • An exception to this is hops that go thru a firewall
Run long-distance tests • Run the longest clean test you can, then look for the shortest dirty test
that includes the path of the clean test
In order for this to work, the testers need to be already deployed when you start troubleshooting
• ESnet has at least one perfSONAR host at each hub location. - Many (most?) R&E providers in the world have deployed at least 1
• If your provider does not have perfSONAR deployed ask them why, and then ask when they will have it done
54 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Network Performance Troubleshooting Example
10GE
10GE
10GE
Nx10GE
10GE
10GE
perfSONARperfSONARBorder perfSONAR Science DMZ perfSONAR
perfSONARBorder perfSONAR
perfSONARScience DMZ perfSONAR
PoorPerformance
WAN
University CampusNational Labortory
55 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Wide Area Testing – Full Context
10GE
10GE
10GE10GE 10GE10GE
10GE10GE
10GE
10GE
Nx10GE
Nx10GE
100GE
100GE
10GE
10GE
10GE
10GE
10GE
100GE100GE
100GE
perfSONAR
perfSONAR
perfSONARBorder perfSONAR Science DMZ perfSONAR
perfSONAR
perfSONARperfSONAR perfSONAR perfSONAR
perfSONAR
10GE
perfSONAR
perfSONARBorder perfSONAR
perfSONARScience DMZ perfSONAR
Internet2 path~15 msec
ESnet path~30 msec
RegionalPath
~2 msec
Campus~1 msecLab
~1 msec
PoorPerformance
56 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Wide Area Testing – Long Clean Test
10GE
10GE
10GE10GE 10GE10GE
10GE10GE
10GE
10GE
Nx10GE
Nx10GE
100GE
100GE
10GE
10GE
10GE
10GE
10GE
100GE100GE
100GE
perfSONAR
perfSONAR
perfSONAR
48 msec
Border perfSONAR Science DMZ perfSONAR
perfSONAR
perfSONARperfSONAR perfSONAR perfSONAR
perfSONAR
10GE
perfSONAR
perfSONARBorder perfSONAR
perfSONARScience DMZ perfSONAR
Internet2 path~15 msec
Clean,FastClean,
Fast
ESnet path~30 msec
RegionalPath
~2 msec
Campus~1 msecLab
~1 msec
57 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Wide Area Testing – Poorly Performing Tests Illustrate Likely Problem Areas
10GE
10GE
10GE10GE 10GE10GE
10GE10GE
10GE
10GE
Nx10GE
Nx10GE
100GE
100GE
10GE
10GE
10GE
10GE
10GE
100GE100GE
100GE
perfSONAR
perfSONAR
perfSONAR
48 msec
Border perfSONAR Science DMZ perfSONAR
perfSONAR
perfSONARperfSONAR perfSONAR perfSONAR
perfSONAR
10GE
perfSONAR
perfSONARBorder perfSONAR
perfSONARScience DMZ perfSONAR
49 msec
49 msec
Internet2 path~15 msec
Clean,Fast
Clean,FastClean,
Fast
Dirty,Slow
Dirty,Slow
Clean,Fast
ESnet path~30 msec
RegionalPath
~2 msec
Campus~1 msecLab
~1 msec
58 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Lessons From This Example
This testing can be done quickly if perfSONAR is already deployed Huge productivity
• Reasonable hypothesis developed quickly • Probable administrative domain identified • Testing time can be short – an hour or so at most
Without perfSONAR cases like this are very challenging Time to resolution measured in months
In order to be useful for data-intensive science, the network must be fixable quickly, because it will break
The Science DMZ model allows high-performance use of the network, but perfSONAR is necessary to ensure the whole kit functions well
59 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
perfSONAR-‐PS is working to build a strong user community to support the use and development of the solware.
perfSONAR-‐PS Mailing Lists
• Announcement Lists: - h3ps://mail.internet2.edu/wws/subrequest/perfsonar-‐ps-‐announce - h3ps://mail.internet2.edu/wws/subrequest/performance-‐node-‐announce
• Users List: - h3ps://mail.internet2.edu/wws/subrequest/performance-‐node-‐users
perfSONAR Community
60 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
More on perfSONAR
http://psps.perfsonar.net/ https://code.google.com/p/perfsonar-ps/
61 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Overview Part 1 (Today):
• What is ESnet? • Science DMZ Introduction & Motivation • Science DMZ Architecture
Part 2 (Today): • PerfSONAR • Science DMZ Security Best Practices
Part 3 (Today & Tomorrow): • The Data Transfer Node • Data Transfer Tools • Conclusions & Discussion
62 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
State of the Campus Show of hands – is there a firewall on your campus?
• Do you know who ‘owns’ it? Maintains it? Is it being maintained? • Have you ever asked for a ‘port’ to be opened? White list a host? Does
this involve an email to ‘a guy’ you happen to know? • Has it prevented you from being ‘productive’?
In General … • Yes, they exist. • Someone owns them, and probably knows how to add rules – but the
‘maintenance’ question is harder to answer. - Like a router/switch, they need firmware updates too…
• Will it impact you – ‘it depends’. Yes, it will have an effect on your traffic at all times, but will you notice? - Small streams (HTTP, Mail, etc.) – you won’t notice slowdowns, but you will notice
blockages - Larger streams (Data movement, Video, Audio) – you will notice slowdowns
63 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Say Hello to your Frienemy: The Campus Firewall
To be 100% clear – the firewall is a useful tool:
• A layer or protection that is based on allowed, and disallowed, behaviors
• One stop location to install instructions (vs. implementing in multiple locations)
• Very necessary for things that need ‘assurance’ (e.g. student records, medical data, protecting the HVAC system, IP Phones, and printers from bad people, etc.)
To be 100% clear again, the firewall delivers functionality that can be implemented in different ways
• Filtering ranges can be implemented via ACLs
• Port/Host blocking can be done on a host by host basis
• IDS tools can implement near real-time blocking of ongoing attacks that match heuristics
64 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
The role of Campus Firewalls
I am not here to make you throw away the Firewall
• The firewall has a role; it’s 1me to define what that role is, and is not
• Policy may need to be altered (pull out the quill pens and parchment)
• Minds may need to be changed
I am here to make you think cri1cally about campus security as a system. That requires:
• Knowledge of the risks and mi1ga1on strategies
• Knowing what the components do, and do not do
• Humans to implement and manage certain features – this may be a shock to some (lunch is never free)
65 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
When Security and Performance Clash What does a firewall do?
• Streams of packets enter into an ingress port – there is some buffering • Packet headers are examined. Have I seen a packet like this before? - Yes – If I like it, let it through, if I didn’t like it, goodbye. - No - Who sent this packet? Are they allowed to send me packets? What port did
it come from, and what port does it want to go to? • Packet makes it through processing and switching fabric to some egress
port. Sent on its way to the final destination. Where are the bottlenecks?
• Ingress buffering – can we tune this? Will it support a 10G flow, let alone multiple 10G flows?
• Processing speed – being able to verify quickly is good. Verifying slowly will make TCP sad
• Switching fabric/egress ports. Not a huge concern, but these can drop packets too
• Is the firewall instrumented to know how well it is doing? Could I ask it?
66 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Causes of Jitter
• Processing Delay: Time to process a packet • Queuing Delay: Time spent in ingress/egress queues to device • Transmission Delay: Time needed to put the packet on the wire • Propagation Delay: Time needed to travel on the wire
67 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
When Security and Performance Clash
Lets look at two examples, that highlight two primary network architecture use cases:
• Totally protected campus, with a border firewall - Central networking maintains the device, and protects all in/
outbound traffic - Pro: end of the line customers don’t need to worry (as much) about
security - Con: end of the line customers *must* be sent through the disruptive
device
• Unprotected campus, protection is the job of network customers - Central networking gives you a wire and wishes you best of luck - Pro: nothing in the path to disrupt traffic, unless you put it there - Con: Security becomes an exercise that is implemented by all end
customers
68 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Brown University – Firewalls for All
69 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Brown University Example
Results to host behind the firewall:
70 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Brown University Example
In front of the firewall:
71 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Brown Univ. Example – TCP Dynamics Want more proof – lets look at a measurement tool through the firewall.
• Measurement tools emulate a well behaved applica1on ‘Outbound’, not filtered:
• nuttcp -T 10 -i 1 -p 10200 bwctl.newy.net.internet2.edu!• 92.3750 MB / 1.00 sec = 774.3069 Mbps 0 retrans!• 111.8750 MB / 1.00 sec = 938.2879 Mbps 0 retrans!• 111.8750 MB / 1.00 sec = 938.3019 Mbps 0 retrans!• 111.7500 MB / 1.00 sec = 938.1606 Mbps 0 retrans!• 111.8750 MB / 1.00 sec = 938.3198 Mbps 0 retrans!• 111.8750 MB / 1.00 sec = 938.2653 Mbps 0 retrans!• 111.8750 MB / 1.00 sec = 938.1931 Mbps 0 retrans!• 111.9375 MB / 1.00 sec = 938.4808 Mbps 0 retrans!• 111.6875 MB / 1.00 sec = 937.6941 Mbps 0 retrans!• 111.8750 MB / 1.00 sec = 938.3610 Mbps 0 retrans!
• 1107.9867 MB / 10.13 sec = 917.2914 Mbps 13 %TX 11 %RX 0 retrans 8.38 msRTT!
72 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Thru the firewall
‘Inbound’, filtered: • nuttcp -r -T 10 -i 1 -p 10200 bwctl.newy.net.internet2.edu!• 4.5625 MB / 1.00 sec = 38.1995 Mbps 13 retrans!• 4.8750 MB / 1.00 sec = 40.8956 Mbps 4 retrans!• 4.8750 MB / 1.00 sec = 40.8954 Mbps 6 retrans!• 6.4375 MB / 1.00 sec = 54.0024 Mbps 9 retrans!• 5.7500 MB / 1.00 sec = 48.2310 Mbps 8 retrans!• 5.8750 MB / 1.00 sec = 49.2880 Mbps 5 retrans!• 6.3125 MB / 1.00 sec = 52.9006 Mbps 3 retrans!• 5.3125 MB / 1.00 sec = 44.5653 Mbps 7 retrans!• 4.3125 MB / 1.00 sec = 36.2108 Mbps 7 retrans!• 5.1875 MB / 1.00 sec = 43.5186 Mbps 8 retrans!
• 53.7519 MB / 10.07 sec = 44.7577 Mbps 0 %TX 1 %RX 70 retrans 8.29 msRTT!
73 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
tcptrace output: with and without a firewall
firewall
No firewall
74 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
The Pennsylvania State University – Firewalls for Some Unprotected campus, protection is the job of network
customers
75 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
The Pennsylvania State University • Initial Report from network users: performance poor both directions
• Outbound and inbound (normal issue is inbound through protection mechanisms)
• From previous diagram – CoE firewalll was tested • Machine outside/inside of firewall. Test to point 10ms away
(Internet2 Washington) jzurawski@ssstatecollege:~> nuttcp -T 30 -i 1 -p 5679 -P 5678 64.57.16.22!
5.8125 MB / 1.00 sec = 48.7565 Mbps 0 retrans!
6.1875 MB / 1.00 sec = 51.8886 Mbps 0 retrans!
…!
6.1250 MB / 1.00 sec = 51.3957 Mbps 0 retrans!
6.1250 MB / 1.00 sec = 51.3927 Mbps 0 retrans!
!
184.3515 MB / 30.17 sec = 51.2573 Mbps 0 %TX 1 %RX 0 retrans 9.85 msRTT!
76 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
The Pennsylvania State University • Observation: net.ipv4.tcp_window_scaling did not seem to be working
• 64K of buffer is default. Over a 10ms path, this means we can hope to see only 50Mbps of throughput:
• BDP (50 Mbit/sec, 10.0 ms) = 0.06 Mbyte
• Implication: something in the path was not respecting the specification in RFC 1323, and was not allowing TCP window to grow • TCP window of 64 KByte and RTT of 1.0 ms <= 500.00 Mbit/sec. • TCP window of 64 KByte and RTT of 5.0 ms <= 100.00 Mbit/sec. • TCP window of 64 KByte and RTT of 10.0 ms <= 50.00 Mbit/sec. • TCP window of 64 KByte and RTT of 50.0 ms <= 10.00 Mbit/sec.
• Reading documentation for firewall: • TCP flow sequence checking was enabled • What would happen if this was turn off (both directions?
77 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
The Pennsylvania State University jzurawski@ssstatecollege:~> nuttcp -T 30 -i 1 -p 5679 -P 5678
64.57.16.22!
55.6875 MB / 1.00 sec = 467.0481 Mbps 0 retrans!
74.3750 MB / 1.00 sec = 623.5704 Mbps 0 retrans!
87.4375 MB / 1.00 sec = 733.4004 Mbps 0 retrans!
…!
91.7500 MB / 1.00 sec = 770.0544 Mbps 0 retrans!
88.6875 MB / 1.00 sec = 743.5676 Mbps 28 retrans!
69.0625 MB / 1.00 sec = 578.9509 Mbps 0 retrans!
!
2300.8495 MB / 30.17 sec = 639.7338 Mbps 4 %TX 17 %RX 730 retrans 9.88 msRTT!
78 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
The Pennsylvania State University Impac1ng real users:
79 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Goal – disentangle security policy and enforcement for science flows from that of business systems
Rationale • Science flows are relatively simple from a security perspective • Narrow application set on Science DMZ hosts - Data transfer, data streaming packages - Performance / packet loss monitoring tools - No printers, document readers, web browsers, building control
systems, staff desktops, etc. • Security controls that are typically implemented to protect business
resources often cause performance problems • Sizing security infrastructure on designed for business networks to
handle large science flows is expensive
Science DMZ Security
80 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
In Big Data Science, Performance Is a Core Requirement Too
Core information security principles • Confidentiality, Integrity, Availability (CIA)
In data-intensive science, performance is an additional core mission requirement (CIAP)
• CIA principles are important, but if the performance isn’t there the science mission fails
• This isn’t about “how much” security you have, but how the security is implemented
• We need to be able to appropriately secure systems in a way that does not compromise performance
81 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Science DMZ Placement Outside the Firewall The Science DMZ resources are placed outside the enterprise
firewall for performance reasons • The meaning of this is specific – Science DMZ traffic does not
traverse the firewall data plane • This has nothing to do with whether packet filtering is part of the
security enforcement toolkit
Lots of heartburn over this, especially from the perspective of a conventional firewall manager
• Lots of organizational policy directives mandating firewalls • Firewalls are designed to protect converged enterprise networks • Why would you put critical assets outside the firewall???
The answer is that firewalls are typically a poor fit for high-performance science applications
82 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
The Ubiquitous Firewall
The workhorse device of network security – the firewall – has a poor track record in high-performance contexts
• Firewalls are typically designed to support a large number of users/devices, each with low throughput requirements - Data intensive science typically generates a much smaller
number of connections that are much higher throughput
Modern firewalls are far more than a packet filter:
• Decode certain application protocols (IDS/IPS functionality, URL filter, etc.)
• Rewrite headers (e.g. NAT)
• VPN Gateway
None of these are relevant to Science DMZ applications 83 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
What’s Inside Your Firewall?
Vendor: “But wait – we don’t do this anymore!” • It is true that vendors are working toward line-rate 10G firewalls, and
some may even have them now • 10GE has been deployed in science environments for over 10 years • Firewall internals have only recently started to catch up with the 10G
world • 100GE is being deployed now, 40Gbps host interfaces are available now • Firewalls are behind again
In general, IT shops want to get 5+ years out of a firewall purchase • This often means that the firewall is years behind the technology curve • Whatever you deploy now, that’s the hardware feature set you get • When a new science project tries to deploy data-intensive resources, they
get whatever feature set was purchased several years ago
84 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Firewall Capabilities and Science Traffic
Firewalls have a lot of sophistication in an enterprise setting • Application layer protocol analysis (HTTP, POP, MSRPC, etc.) • Built-in VPN servers • User awareness
Data-intensive science flows don’t match this profile • Common case – data on filesystem A needs to be on filesystem Z - Data transfer tool verifies credentials over an encrypted channel - Then open a socket or set of sockets, and send data until done
(1TB, 10TB, 100TB, …) • One workflow can use 10% to 50% or more of a 10G network link
Do we have to use a firewall?
85 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Firewalls vs Router Access Control Lists
When you ask a firewall administrator to allow data transfers through the firewall, what do they ask for?
• IP address of your host • IP address of the remote host • Port range • That looks like an ACL to me – I can do that on the router!
Firewalls make expensive, low-performance ACL filters compared to the ACL capabilities are typically built into the router
86 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Security Without Firewalls Does this mean we ignore security? NO!
• We must protect our systems • We just need to find a way to do security that does not
prevent us from getting the science done Lots of other security solutions
• Host-based IDS and firewalls • Intrusion detection (Bro, Snort, others), flow analysis, … • Tight ACLs reduce attack surface (possible in many but not
all cases) • Key point – performance is a mission requirement, and
the security policies and mechanisms that protect the Science DMZ should be architected so that they serve the mission
87 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
If Not Firewalls, Then What?
• Remember – the goal is to protect systems in a way that allows the science mission to succeed
• There are multiple ways to solve this – some are technical, and some are organizational/sociological
• Note: this is harder than just putting up a firewall and thinking you are done
88 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Other Security Tools
Intrusion Detection Systems (IDS) • One example is Bro – http://bro-ids.org/ • Bro is high-performance and battle-tested - Bro protects several high-performance national assets - Bro can be scaled with clustering:
http://www.bro-ids.org/documentation/cluster.html
• Other IDS solutions are available also
Blackhole Routing to block attacks
Netflow, IPFIX, sflow, etc. can provide visibility
89 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Other Security Tools (2)
Aggressive access lists • More useful with project-specific DTNs • If the purpose of the DTN is to exchange data with a small set of
remote collaborators, the ACL is pretty easy to write • Large-scale data distribution servers are hard to handle this way
(but then, the firewall ruleset for such a service would be pretty open too)
Limitation of the application set • One of the reasons to limit the application set in the Science DMZ
is to make it easier to protect • Keep unnecessary applications off the DTN (and watch for them
anyway using a host IDS – take violations seriously)
90 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Other Security Tools (3)
Using a Host IDS is recommended for hosts in a Science DMZ
There are several open source solutions that have been recommended:
• OSSec: http://www.ossec.net/
• Rkhunter: http://rkhunter.sourceforge.net (rootkit detection + FIM)
• chkrootkit: http://chkrootkit.org/
• Logcheck: http://logcheck.org (log monitoring)
• Fail2ban: http://www.fail2ban.org/wiki/index.php/Main_Page
• denyhosts: http://denyhosts.sourceforge.net/
91 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Using OpenFlow to help secure the Science DMZ
Using OpenFlow to control access to a network-based service seems promising
• E.G.: Sam Russell’s work at REANNZ: - http://pieknywidok.blogspot.com.au/2013/01/thimble-secure-high-
speed-connectivity.html • This could significantly reduce the attack surface for any
authenticated network service
92 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Collaboration Within The Organization
All stakeholders should collaborate on Science DMZ design, policy, and enforcement
The security people have to be on board • Remember: in some organizations security people already have
political cover – it’s called the firewall • If a host gets compromised, the security officer can say they did their
due diligence because there was a firewall in place • If the deployment of a Science DMZ is going to jeopardize the job of
the security officer, expect pushback
The Science DMZ is a strategic asset, and should be understood by the strategic thinkers in the organization
• Changes in security models • Changes in operational models • Enhanced ability to compete for funding • Increased institutional capability – greater science output
93 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Is it possible to get a firewall that can handle 10G flows?
Yes, but just barely, and it will cost around $500K. • Will this $500K give you any added security over router ACLs?
10G host interfaces have been around for 10 years, and true 10G firewalls for only a couple years
How long will it take for there to be a true 40G firewall? Or 100G?
94 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Thought Experiment
• We’re going to do a thought experiment • Consider a network between three buildings – A, B, and C
• This is supposedly a 10Gbps network end to end (look at the links on the buildings)
• Building A houses the border router – not much goes on there except the external connectivity
• Lots of work happens in building B – so much so that the processing is done with multiple processors to spread the load in an affordable way, and results are aggregated after
• Building C is where we branch out to other buildings
• Every link between buildings is 10Gbps – this is a 10Gbps network, right???
95 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Notional 10G Network Between Buildings
WAN
perfSONAR Building A
10GE 10GE
Building B
Building C
1G1G
1G1G
1G 1G1G
1G
1G1G
1G1G1G 1G1G 1G1G 1G1G
1G
10GE
Building Layout
To O
ther
Bui
ldin
gs
10GE
10GE
10GE
96 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Clearly Not A 10Gbps Network
If you look at the inside of Building B, it is obvious from a network engineering perspective that this is not a 10Gbps network
• Clearly the maximum per-flow data rate is 1Gbps, not 10Gbps • However, if you convert the buildings into network elements while
keeping their internals intact, you get routers and firewalls • What firewall did the organization buy? What’s inside it? • Those little 1G “switches” are firewall processors
This parallel firewall architecture has been in use for years • Slower processors are cheaper • Typically fine for a commodity traffic load • Therefore, this design is cost competitive and common
97 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Notional 10G Network Between Devices
WAN
perfSONAR Border Router
10GE 10GE
Firewall
Internal Router
1G1G
1G1G
1G 1G1G
1G
1G1G
1G1G1G 1G1G 1G1G 1G1G
1G
10GE
Device Layout
To O
ther
Bui
ldin
gs
10GE
10GE
10GE
98 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Notional Network Logical Diagram
10GE
10GE
10GE
10GE
10GE10GE
Border Router
WAN
Internal Router
Border Firewall
perfSONAR
99 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Security As a System Component based security is wrong. Needs to be a system.
• E.g. the firewall by itself has limited use, and can be easily broken by a mo1vated a3acker
System: • Cryptography to protect user access and data integrity • IDS to monitor before (and aler) events
• Host-‐based security is be3er for performance, but takes longer to implement. Firewalls are bad on performance but easy to plot down in a network.
• Let your router help you – if you know communica1on pa3erns (and know those that should be disallowed), why not use filters?
Campus CI Plan. Make one, update it olen. Shows funding bodies you know what is going on and have plans to address risks, and foster growth
Economic argument – if you are non-‐compe11ve for grants because you approached security from the wrong side, are you be3er in the long run?
100 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Security As a System Data Provenance
• Some bureaucra1c document states that all campus traffic must be a) encrypted and b) passed through a firewall for packet inspec1on. Why? - a) What data is private, and what isn’t? Student records, sure. Maybe even sensi1ve grant-‐related research. Encryp1ng all data is not necessary if you stop to think about the data. At least make it a user choice.
- b) Firewalls work when you can’t be sure of a traffic profile (e.g. they stop everything and give it the business). If you know the traffic profile, use that to your advantage. Data from X sites on ports Y, and Z.
• Policy is: - Wri3en by those that olen do not have prac1cal experience - Outdated almost immediately
• Review (create) CI Plan regularly.
101 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Security As a System User Management
• What is be3er: centrally managed user system for all resources vs. independently managed on each machine?
• Central - Pro: Easier administra1on when adding/dele1ng - Con: Single point of failure
• Individual - Pro/Con: Breach of once machine doesn’t necessarily imply that accounts on others are compromised (N.B. I think we are all guilty of recycling passwords though…)
• Answer depends on your campus, which is another reason why the DMZ is a blueprint, not a packaged solu1on
102 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Security As a System Device Profiles
• All the devices are equal (untrusted) - Have the number of phones/tablets eclipsed hard campus resources for any of you yet?
- You should absolutely not trust these, or *many* of your hard campus resources
• Some are more equal than others (trusted) - Does the Physics group have a dedicated admin who ‘gets it’? They know Linux, and have implemented host-‐based security, plus split out heavy hi3ers from normal users?
- Give them a fast path (Penn State Model) - If policy needs to be changed, start handing out cer1ficates to groups that complete a training. CYA…
103 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Sample Security Analysis from the University of Illinois (Nick Buraglio) How is security handled on campus now?
Firewalls
IPS
ACLs
Black hole routing
IDS
Host IDS
SNMP collection
The first 2 (Firewalls and IPS) are the only ones with performance implications. Can we create a secure environment without them?
104 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Sample Security Analysis from the University of Illinois (Nick Buraglio)
• Management and Security Cocerns: - “Adding visibility is essential for accountability” - “Timely mitigation of issues is required” - “Automated mitigation is highly desirable”* - “Once you’ve broken into a DMZ host you have an outpost in
enemy territory”
105 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Sample Security Analysis from the University of Illinois (Nick Buraglio)
University of Illinois management, network engineers, and security staff decided on the following for their Science DMZ:
• Flow Data for accountability (netflow/sflow/jflow)
• SNMP collection for baseline creation and capacity planning
• Router ACLs for best practice ingress blocks
• Passive network IDS for monitoring (Bro)
• Host IDS on all hosts outside the firewall (OSSec)
• IDS triggered black hole routing for mitigation • Triggers from both network and host IDS
• Bogon (bogus IP address) filtering
106 – ESnet Science Engagement ([email protected]) - 10/2/13
Lawrence Berkeley National Laboratory U.S. Department of Energy | Office of Science
Summary So Far
Monitoring is a key part of the story – ensures things work, don’t break, and stay fixed
Emulates the user user case, sit the monitoring near them, and talk to them regularly about experience
Security needs to evolve with technology and use case – one size fits all is wrong.
Revisit security choices often, the firewall team doesn’t need to be the bad guys as long as you are working toward the same goal.
107 – ESnet Science Engagement ([email protected]) - 10/2/13
The Science DMZ – perfSONAR & Network Monitoring
Questions?
Jason Zurawski - [email protected]
ESnet Science Engagement – [email protected]
http://fasterdata.es.net