a day in the life of a netadmin - the calm after the storm

2
 A Day in the Life of a Network Administr ator THE CALM AFTER THE STORM… How did a day of crisis prompt my company to rethink the way we managed our network ? Pretty easily. We knew we needed to change, we’d already had our fair share of problems, but it took a real crisis to spur us into action. With our new network management solution in place, we are shocked at what’s been happening on our network for who knows how lo ng! Now we know exactly what’s working and what’s not – which is a great position to be in. As a network administrator or manager, I‘m sure you’ve lived through a few of them yourself. If you haven’t already learned from them, perhaps our story will drive you to action too. HOW IT ALL STARTED It started with an unforeseen network outage on a week day, right in the middle of our peak sales season. Sure, our network had gone down before - even for several hours, but never at such a critical time. We basically lost access to our customers on one of our busiest sales days of the year. We do nearly 25% of our annual business during our peak season, so any kind of network availability issue costs the company big dollars. In hindsight, we know our crisis that day was complex. Multiple things were going wrong but we had no clear way of seeing if and how they were related. We could only guess at the reasons and underlying causes, because we had no uni ed way of troubleshooting or visualizing the network dependencies. It turned out to be a very educational day for us- we learned from the crisis and changed the way we manage the network. As a result, we’re much better equipped to minimize downtime and outages and in some cases, even prevent them. FROM NETWORK SLOWDOWN TO A CRISIS The rst hint of a problem showed up early that morning. One of our company sales reps called to say that her Webex session was slow. These slowdowns were usually temporary and resulted from occasional spikes in traf c. We were all set to add a new T1 line (next month), but that wouldn’t help us today. In the next ten minutes we got two more calls – one from a sales rep saying that his (VoIP) phone conversations had become hard to follow and the other call was from our Webmaster, who had noticed that our order entry application web pages were taking a long time to load. When it rains it pours. Next, we got a call from our telesales manager telling us that all sales reps were experiencing noticeable problems on the phone. With this news, we knew it was not an isolated user problem at all – it was now a network wide issue. We had to act fast. ABOUT ME My name is Mark Brown and I’m a Network Ad ministrator . I have a degree in Information Technology and have been in my job for almost four years. MY COMPANY I work for a medical device and technology reseller. My boss (Director of IT) and I are responsible for supporting 80 people at our office. We do nearly half of our business online and the rest via our telesales team. For a relatively small company, we have a pretty sophisticated infrastructure and key business apps which need to be available 24x7. TECHNOLOGY ENVIRONMENT Our web site and app servers are located in a datacenter upstate but our email servers, file servers,  VoIP servers and our demo machines are in house. Our sales team use Webex regularly, and we migrated to a VoIP system about two years ago. Altogether, we have approximately 20 servers, 90 workstations and phones and 40 network devices. BEFORE AND AFTER For the last six months, we’ve been using a network and systems management solution called WhatsUp Gold. It basically runs our network infrastructure, so I can focus on what I need to get done. I used to be forever behind schedule, even coming in on weekends. Now, all that has changed and it’s a great feeling personally and professionally to be ahead of what’s going on rather than being behind it.

Upload: ipswitchwhatsupgold2

Post on 07-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Day in the Life of a NetAdmin - The Calm After the Storm

8/6/2019 A Day in the Life of a NetAdmin - The Calm After the Storm

http://slidepdf.com/reader/full/a-day-in-the-life-of-a-netadmin-the-calm-after-the-storm 1/2

 A Day in the Life

of a Network AdministratorTHE CALM AFTER THE STORM…

How did a day of crisis prompt my

company to rethink the way we managed

our network? Pretty easily. We knew

we needed to change, we’d already

had our fair share of problems, but it

took a real crisis to spur us into action.

With our new network management

solution in place, we are shocked at

what’s been happening on our network

for who knows how long! Now we know

exactly what’s working and what’s not

– which is a great position to be in. As a

network administrator or manager, I‘m

sure you’ve lived through a few of them

yourself. If you haven’t already learned

from them, perhaps our story will drive

you to action too.

HOW IT ALL STARTED

It started with an unforeseen network

outage on a week day, right in the

middle of our peak sales season. Sure,

our network had gone down before -

even for several hours, but never at

such a critical time. We basically lost

access to our customers on one of

our busiest sales days of the year. We

do nearly 25% of our annual business

during our peak season, so any kind

of network availability issue costs the

company big dollars.

In hindsight, we know our crisis that

day was complex. Multiple things were

going wrong but we had no clear way

of seeing if and how they were related.

We could only guess at the reasons and

underlying causes, because we had

no unified way of troubleshooting or

visualizing the network dependencies.

It turned out to be a very educational

day for us- we learned from the crisis

and changed the way we manage

the network. As a result, we’re much

better equipped to minimize downtime

and outages and in some cases, even

prevent them.

FROM NETWORK SLOWDOWN

TO A CRISIS

The first hint of a problem showed up

early that morning. One of our company

sales reps called to say that her Webex

session was slow. These slowdowns

were usually temporary and resulted

from occasional spikes in traffic. We

were all set to add a new T1 line (next

month), but that wouldn’t help us

today.

In the next ten minutes we got two more

calls – one from a sales rep saying that

his (VoIP) phone conversations had

become hard to follow and the other

call was from our Webmaster, who had

noticed that our order entry application

web pages were taking a long time to

load. When it rains it pours. Next, we

got a call from our telesales manager

telling us that all sales reps were

experiencing noticeable problems on

the phone. With this news, we knew it

was not an isolated user problem at all

– it was now a network wide issue. We

had to act fast.

ABOUT ME

My name is Mark Brown and I’m

a Network Administrator. I have a

degree in Information Technology

and have been in my job for almost

four years.

MY COMPANY

I work for a medical device and

technology reseller. My boss

(Director of IT) and I are responsible

for supporting 80 people at ouroffice. We do nearly half of our

business online and the rest via

our telesales team. For a relatively

small company, we have a pretty

sophisticated infrastructure and key

business apps which need to be

available 24x7.

TECHNOLOGY ENVIRONMENT

Our web site and app servers are

located in a datacenter upstate

but our email servers, file servers,

 VoIP servers and our demomachines are in house. Our sales

team use Webex regularly, and we

migrated to a VoIP system about

two years ago. Altogether, we

have approximately 20 servers, 90

workstations and phones and 40

network devices.

BEFORE AND AFTER

For the last six months, we’ve

been using a network and systems

management solution called

WhatsUp Gold. It basically runs ournetwork infrastructure, so I can

focus on what I need to get done. I

used to be forever behind schedule,

even coming in on weekends.

Now, all that has changed and

it’s a great feeling personally

and professionally to be ahead of

what’s going on rather than being

behind it.

Page 2: A Day in the Life of a NetAdmin - The Calm After the Storm

8/6/2019 A Day in the Life of a NetAdmin - The Calm After the Storm

http://slidepdf.com/reader/full/a-day-in-the-life-of-a-netadmin-the-calm-after-the-storm 2/2

TROUBLESHOOTING THE OLD FASHIONED WAY

We checked the stats on the VoIP systems management

portal – and sure enough latency was high and call quality

was down. It looked liked network congestion. We’d

used a free tool called Wireshark to troubleshoot network

traffic issues before, so we set it up to monitor the

current problem. Yet when we looked at the results from

Wireshark, the traffic seemed within range. We began tothink the problem could be our external link.

Next step – call the service provider. We spent 30 minutes

on the phone troubleshooting our gateway router and

external link. Our service provider told us both tested fine,

so it wasn’t a link (or internet connectivity) problem after

all. During the call they gave us some interesting news;

they told us they were seeing occasional bursts of traffic

on our external link. Perhaps it was a congestion problem

after all.

We checked the traffic again via Wireshark and there itwas! We had missed it before, because the traffic was

fluctuating wildly. We were most certainly congested and

the traffic was coming from inside our network. There was

an extraordinary amount of HTTP and RTP header packets

and it looked like a lot of unnecessary traffic. Everyone

knew how critical the peak season was to our revenues

and they were especially careful not to load the network.

So, what would explain this burst of traffic?

DIGGING DEEPERWe checked the stats on the VoIP server. Sure enough

the retransmission of the failed packets was overloading

network I/O and CPU utilization was high. In fact, some

of the calls were being routed to our backup VoIP

server which shared the same system as our order entry

app server. We checked the back-up server and it was

overloaded too, now we knew why the order entry app

was so slow – faced with a double whammy of network

congestion and server performance issues.

The crisis was now full blown. We needed tofix the situationfast. We had eliminated any issue from the service provider

and from our own end-users. The network was congested

off and on and the bursts were being seen on the externa

link. Clearly, this meant that one or more of our interna

devices were communicating with an external site. And

then it dawned on us – it was a virus. That would explain

the traffic and the connection to the external site. Now, it

was not just an issue of performance – but could possibly

be a security breach as well.

INOCULATING THE NETWORK 

Eliminating the virus by shutting down the external link

was not an option. The network was too critical to the

business in peak season. The next few hours were a mad

rush to find the infected machines and quarantine them

We didn’t have a topology diagram, and we didn’t know

what was on each subnet, which made it harder. In the

end, we found four infected workstations and one server

 After we shut the last one off, traffic returned to normalWe rebooted the primary VoIP server too and after nearly

six hectic hours, we were back to business as usual.

PUTTING A LONG TERM SOLUTION IN PLACE

This was a wake-up call for my boss, the CEO, and for me

We had discussed purchasing a network management

solution before, but we were never able to carve out the

budget for it. Now we knew that if we had a solution in

place – we could have saved valuable hours trying to find

and solve the problem. The right performance monitoring

would have alerted us to network traffic congestion and

persistent high utilization on the servers. We would have

known the network topology and been able to visualize

the affected subnets and nodes. And active monitoring

for instances of failure, like the high number of dropped

packets, would have alerted us to impending faults far

before it impacted our end users and our business. It

didn’t take us long to make the decision to put WhatsUp

Gold in charge of our network management. Thankfully

life has been quieter ever since.