a whole new ballgame: leveraging big data in baseball€¦ · the red sox lead their arch rival new...

53
Bruce Yellin Advisory Systems Engineer EMC Corporation [email protected] A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL An article by EMC Proven Professional Knowledge Sharing Elite Author

Upload: others

Post on 25-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

Bruce YellinAdvisory Systems EngineerEMC [email protected]

A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL

An article by EMC Proven Professional

Knowledge Sharing Elite Author

Page 2: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 2

Table of Contents

Introduction – Baseball, Big Data, and Advanced Analytics ...................................... 9

Big Data and Baseball - Players, Coaches, Trainers, and Managers ....................... 13

Sportvision ................................................................................................................ 15

PITCHf/x ................................................................................................................ 16

HITf/x ..................................................................................................................... 25

FIELDf/x ................................................................................................................ 28

Big Data and the Business of Baseball ..................................................................... 33

Player Development .................................................................................................. 36

Revenue From Fans .................................................................................................. 38

Revenue From Media ................................................................................................ 42

Big Data Helps Create Algorithmic Baseball Journalism ......................................... 43

Listen To Your Data - Grady the Goat - The Curse of the Bambino ......................... 46

Conclusion: The Future of Big Data Baseball ........................................................... 47

Appendix - PITCHf/x Metadata ................................................................................... 50

Footnotes ..................................................................................................................... 51

PLEASE NOTE: It is recommended that this paper be printed in color.

Disclaimer: The views, processes, or methodologies published in this article are those

of the authors. They do not necessarily reflect EMC Corporation’s views, processes, or

methodologies.

Page 3: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 3

Big data and advanced analytics touch every facet of our lives. We see it in action with

on-line advertising, cash register printed coupons, and the marketing of airline tickets.

Big data measurements range from hundreds of terabytes to petabytes, with analytics

refining data quantity into quality actionable information. Baseball has

always been a data-rich statistical paradise and is called a data-

driven sport, yet the amount of data it used to generate pales in

comparison to today’s game. No wonder that Joe Maddon, manager

of the Tampa Bay Rays, calls the impact of big data on baseball its second great

renaissance1.

Up until the last twenty years, we judged players based on basic statistics. Now we have

much more in-depth data to make more accurate assessments. Big data can now help

hitters battle pitchers, pitchers battle hitters, and properly position defensive players for

the pitcher-hitter dynamic. Just as big data has helped companies worldwide become

more profitable and efficient, it will also help Major League Baseball (MLB) do the same.

Big data will give teams a competitive and financial advantage, and bring the fan a more

exciting experience. The data and analytics will come from action on the field and

business decisions. It will be available to customers in the ballparks and fans around the

world, giving them insights that were impossible just a few years ago.

Pretend you are at a game on a really humid, windless, hot August night at Boston’s

century-old Fenway Park. The Red Sox lead their arch rival New York Yankees 3-2 with

2 outs in the top of the 9th inning. On 2nd base is Robinson Cano and

on 3rd, Brett Gardner who as one of the fastest runners in baseball

could score the tying run in about 3 seconds. Just 20 minutes earlier,

most of the 34,000 fans stood and sang Neil Diamond’s classic

“Sweet Caroline” with gusto. Now they are standing again, this time

vocalizing their disdain for the Yankees. Their thunderous applause, foot stomping, and

120+ decibel roar can be heard blocks away as the pitch is delivered.

The 92 mph four-seam fastball crossed the 17” wide home plate

and thuds into Jarrod “Salty” Saltalamacchia’s catcher’s mitt. An

instant later, the umpire calls “Strike Two!” against former Red Sox

hero and now Yankees “Evil Empire” foot soldier, Kevin “Youk”

Youkilis. Youk doesn’t like the umpire’s call as boisterous fans, packed like sardines in

Page 4: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 4

bars along Yawkey Way, Lansdowne Street, and Brookline Avenue scream “Yes!” in

near-unison, accompanied by nearly 4 million ESPN viewers across America yelling for

or against the right-handed hitter to succeed or fail. A high-definition TV camera zooms

in at the perspiration dripping from under Kevin’s batting helmet. On the mound,

Boston’s hard-throwing right-handed warrior Clay Buchholz, his double-knit polyester

uniform drenched in sweat, has delivered his 108th pitch of the

evening. Statistically, right handed pitchers would rather face right-

handed hitters, so this matchup favors Buchholz, a former teammate

of Youk.

Buchholz takes a long time between pitches. He must get Youk out because the next

batter is even more dangerous. Over the next 24 seconds, both teams reset for the next

pitch. Before the game, the Yankees’ big data scientists charted the likely type and

location of Buchholz’s next pitch. Youk steps out of the batter’s box waiting for 3rd base

coach Robby Thompson to relay Yankee manager Joe Girardi’s offensive signal.

Thompson touches his cap, chest, thigh, gestures with his hands, then double-claps

telling Youk to expect Buchholz to jam him inside based on a PITCHf/x big data analysis.

Gardner and Cano are also looking at the signals so they know what is being attempted.

PITCHf/x data shows the Sox the location and type of the pitch Youk likes to hit. Even

with the diminishing skills of a 34-year old,

Youk is still a dangerous career .281 hitter who

thrives on pitches in the middle of the plate or

low and inside2. Red Sox bench coach Torey

Lovullo signals the fielders where to play

based on pitching coach Juan Nieves’ signal to

Salty. The impassioned fans begin to holler

again. Salty signals the 6’3” Buchholz a 1-2-3-2-1 with his fingers and moves his mitt

high. The first “1” is a fastball, “2” is an even number for an inside location, and the rest

is a decoy should Cano steal the signs and relay them to Youk. Salty and Buchholz have

agreed with 2 strikes to throw a fastball high and inside to set up their possible follow-up

pitch. Both sides have big data insight into the pitch speed, trajectory, and spin.

Page 5: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 5

Wielding a bat like a gladiator’s sword, Youk is about to go into combat against 9

defenders. Youk and Buchholz are ready, both team’s benches and bullpens are

focused, and the fans are making Fenway Park vibrate with excitement. Buchholz has

an incredibly difficult job - release the ball at the precise moment and with the correct

spin and angle, because if it is off by a fraction of a millisecond or degree it will either

miss the strike zone, or Youk might hit it and tie the game in a mere matter of seconds.

Prior to the game, the relevant big data was sliced and diced and presented on the

pitching coach’s iPad. Various game parameters such as his opponents, nightly rest, by-

inning pitch counts to tally endurance factors, and even the day/night impact on each

type of pitch Buchholz throws could be factored in. This helped Buchholz and Salty

develop a pitch strategy for the game based on the humidity, wind speed, and the

strengths and weaknesses of the Yankees. In

general, if Youk hits a pitch, there is a 40%

chance it will be a ground ball (GB), 40% a fly

ball (FB), and a 20% chance of a line drive

(LD)3. Youk has also done his pre-game iPad

homework and knows Buchholz favors the fastball in a 2-

strike count. He expects the pitch to cross the plate in an area

high center to lower outside, so he prepares himself mentally

to go with the pitch and try to hit it to right field. Buchholz and

Salty are going to attempt to deceive Youk. The runners

increase their lead, Salty shifts from center of home plate to

the right side, and Buchholz, throwing from the stretch position, begins his delivery.

Buchholz throws a ball with an outer shell made from two pieces of leather laced with

108 red stitches. The stitches are 7 hundredths of an inch above the smooth surface and

push against the airflow, causing it to rotate at hundreds to thousands of RPM

depending on the pitch. The red stiches on the seam’s axis can appear to the

batter as a “red dot” when the ball is released. Hall of Famer Reggie Jackson

says “If you can’t see the rotation, you have to be able to recognize if it’s a curve ball or

a slider. And if you can’t, you’re not going to be a major league player. Anyone who can

hit above .270 can see the red dot. Anyone who saw that dot on the ball, you knew it

was the slider. If it was a really big dot, you knew it was a hanger and you could hit it

out.”4

Page 6: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 6

In a tenth of a second, the ball completes 2 revolutions, travels 13’, and air resistance

starts to slow it down5. The stitches break up the airflow and reduce drag, enabling the

ball to go further than if it was smooth. Nicknamed the “Greek

God of Walks”, Youk has a keen eye for the ball with 20-11 visual

acuity (he sees from 20’ what most people see from 11’)6. He

spots the rotation and by processing the pitcher’s release point,

his hands and fingers, the ball’s spin, and probability of what Buchholz likes to throw,

surmises it’s a 4-seam fastball. The ball’s trajectory is inside and high, not low and

outside as Youk had orignally guessed. He must “protect” the plate at all costs and

prevent getting a 3rd stike, so in the next .15 seconds, he “tells” his body to swing

defensively. Salty also recognizes he is out of position and immediately moves his glove

to the left. The ball is now 25’ away from the plate. Youk’s knees flex as his weight shifts

from a flat stance to his back right leg, and his torso twists clockwise. His left foot strides

12” out and he shifts his weight forward as his coiled torso generates a massive amount

of upper body torque. It is a 90 mph fastball. Drag has slowed it by 7 mph. It travels the

55’5” between his release and home in .434 seconds assuming linear decelaration.

It is high. Gravity pulled the ball down 2.92’, offset by a backspin of -.98’, it is about to

cross the plate almost 2’ lower that its release point. Just 23’ and 1/7th of a second away,

it drops at a 100 downward trajectory as Youk’s hips begin a counterclockwise turn. He

begins his swing by maneuvering the bat through its arc and shifts his weight from back

right to front left. If he blinks, he would lose sight for .05 seconds, and it will be all over. A

camera flash might cost him .15 seconds7. Youk’s brain and muscle memory instruct his

hands for a high and inside fastball. His 31.5 ounce model T141 Louisville Slugger round

maple bat is moving at 85 mph and collides with the spinning cork, yarn, and leather-

MUST DECIDE in .03 seconds to swing and

another .03 seconds to swing high,

low, inside, outside

Brain processes information, gauges ball speed and

location.08 seconds

The batter sees the ball, sends image to brain

.1 seconds

Drag decelerates Buchholz’s 90 mph fastball to 83 mph as it

approaches Youkilis and crosses home plate in .434 seconds

Brain signals

the legs to

start stride

.03 seconds

2.75" diameter bat

must hit 3" diameter

ball within 1/8" of

dead center in ±.007

seconds.01s swing starts

.05s can still check

.15s power swing

55' 50' 45' 40' 35' 30' 25' 20' 15' 10' 5' 0'

Sec Traveled:.434s .27s .24s .18s .1s

Feet Traveled:55' 32' 29' 22' 13'

Page 7: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 7

covered 5.2 ounce hardball for a millisecond “…with 6,000-8,000 pounds of force”8.

“SMACK” is heard as the impact produces a Batted Ball Speed (BBS) of 120 mph.

His timing has to be within 5 milliseconds in one of

two dimensions to hit a fair ball. Fortunately for the

Red Sox, the swing’s 2nd dimension is off. The ball

misses the bat’s sweet spot by just over ½” at -220

instead of 390 and Youk harmlessly fouls it off.

The count remains at no balls, 2 strikes. The runners go back to their bases and

everyone resets, including the fans. Red Sox manager John Farrell, a former pitcher,

signals Salty for a curveball down and away based on how Youk handled the last pitch.

Yankee manager Joe Girardi, a former catcher, also believes it will be a curve and

signals that to Robby Thompson. TV viewers are told to expect a changeup.

The Red Sox bench coach gestures his outfielders Jonny Gomes, Jacoby Ellsbury, and

Shane Victorino to the right since Buchholz will toss

a curveball low and outside. They are excellent

outfielders, but the coach knows from FIELDf/x9 big

data that correct positioning could make the

difference between a hit and an out. Outfielders

generally cover 23’ in a second. Their reaction time

is 0.35 seconds forward, 0.52 seconds sideward, and 0.65 seconds backward creating a

defensive oval10.

Buchholz agrees to Salty’s

finger-signals for a curveball.

The fans are standing and

the noise is deafening.

Before the game, Youk’s iPad displayed

Buchholz’s PITCHf/x chart to the right11.

His hitting coach, Kevin Long, had said:

1) Buchholz tires after 105 pitches 2) Buchholz likes to use his curveball 3) Buchholz’s curveball slows down to 74-75 mph after 84 pitches.

Page 8: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 8

Youk readies for a slow curveball. The pitch is on its way. But instead

of low and away, Buchholz’s 110th pitch is coming down the center of

the plate. With an unquenchable desire to be tonight’s hero, Youk’s bat

strikes the curveball at 390 and rockets it towards deep center field.

In the physics of baseball, the mass and velocity of the bat and ball

determine the “conservation of momentum” and predict the equal and

opposite forces at that instant. The “coefficient of restitution” or the bounce

of a ball after it is deformed by the bat, causes the ball to gain forward motion because

of kenetic energy. The bat deforms slighty before springing back to its natural shape –

notice its shape versus the yellow line. Gardner and

Cano are off with the “crack of the bat”. Ellsbury, out

of position, instinctively turns to his left, sprinting

towards the green wall. The fan’s chanting instantly turns into a prayer.

If the ball falls in, the Yanks will lead as Gardner is nearing home and Cano is sprinting

from 2nd with the go-ahead run. Some in the crowd scream to the baseball gods for

Ellsbury to make the catch. Others turn silent as all hope is seemingly gone. Based on

where he was playing, he needs 4.4 seconds to reach the hit, but only has 4.3 seconds

as the ball sails over his head. Gardner ties the score at 3-3. Shortstop Stephen Drew

and 2nd baseman Dustin Pedroia race to short center field to catch the relay throw. Cano

has crossed 3rd base on his way home. Youk’s

smash hit the base of the wall near the “Stop N’

Shop” sign and ricochets away from Ellsbury. In this

battle of wills, Cano crosses home for the Yankee

lead as Youk lumbers into 2nd base with a double.

The Sox, with a glorious victory seemingly in hand, now trail by a run in the top of the 9th

with 2 outs. Left-handed reliever Craig Breslow replaces Buchholz and strikes out left-

handed designated hitter Travis Haffner to end the inning.

The Yankees in the bottom of the 9th bring the all-time leader in

saves and future Hall of Famer, right-handed Mariano Rivera, to the

mound. His devastating cut fastball fools the best of hitters. The cutter’s two

finger fastball grip is off-centered against the ball’s seams. Delivered like his fastball, it

has dynamic late movement half-way to home – the distance for which a batter has to

Page 9: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 9

decide to swing or not.12 It breaks down and in towards a lefty or away for a righty just

before it reaches home. His WHIP

is 1.00013. Legendary Pedro Martinez, by

comparison, had a career 1.054 WHIP14, while the 2013 MLB average was 1.3015.

First to bat is Salty, a switch hitter batting left-handed. The Yankees shift players to the

right based on FIELDf/x charts. PITCHf/x’s heat map shows Rivera’s distinct cutter

locations when he faces a lefty or a righty,

and how well he succeeds at avoiding the

center of the plate16. During his career,

lefties are batting just .209 and righties

.21517. Salty pops up the 1st pitch cutter to

2nd baseman Robinson Cano for the 1st out.

Right-handed hitting 3rd baseman Will Middlebrooks is ready to face Rivera. He is fed a

steady diet of outside and low cutters and hits a ground ball to 1st baseman Lyle

Overbay for the 2nd out. Boston desperately needs a run to tie. Left-handed hitting

shortstop Stephen Drew, batting .249 with 12 home runs, is their last hope. A key to

Rivera’s success is that batters who do hit his cutter do not hit it

well. Rivera jams Drew inside with a cutter on the 1st pitch that

breaks his bat – not uncommon against Rivera. The ball is tapped

to Overbay for the 3rd out as the Yankees beat their arch rivals.

The crowd is silent.

Introduction – Baseball, Big Data, and Advanced Analytics

The modern game of baseball dates back to the late 1840’s and is one of those pure

sports where statistics has thrived. Attend a game today and a giant scoreboard displays

the batter’s picture, his batting average, home runs, RBIs, and other information. Fans

on either side of you may have bought a program with the latest statistics on each

team’s players, as well as articles, paid advertisements, and the all- important blank

Page 10: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 10

scorecard for each side. To the right is a fan’s

scorecard used in the late 1800’s. Serious fans

pencil in the player’s last names and uniform

numbers, then summarize each pitch, foul ball, hit,

out, walk, error, and other events in coded

shorthand eerily close to those used at the dawn

of the game. Fielders have positional numbers

such as 1 for the pitcher, 2 for the catcher, and so

on. While the concept hasn’t changed much, the scorecard is now available as an

Android and iPhone application that helps keep score and links fans into the powerful

work of baseball statistics and sabermetrics (“Society for American Baseball Research”

plus “metrics”).

Core “counting” statistics such as at bats, runs, hits, singles, doubles, triples, home runs,

and more originated with the game, and became integral to America’s pastime with the

formation of the Elias Sports Bureau in 191318. In 1947, the Brooklyn Dodgers hired a

full-time statistician. Modern-era baseball

managers have used statistics to run their

team, perhaps none more famous than the

fiery Hall of Fame Baltimore Orioles manager

Earl Weaver. Well before the birth of the

personal computer, Weaver tracked pitcher-

hitter matchups on index cards like the one on

the right to fine-tune his platooning system19

and make pitching changes in the 1960s.

Managers like Weaver kept charts like the

one on the left which shows the 1st and 5th

innings were best for scoring runs20, perhaps

because the pitcher has yet to reach his

groove, then tires, or has already shown the

batters his repertoire a few times that game.

They kept situational charts, such as the

best time to bunt21, or with a runner on 3rd base and 1 out, they could expect to score on

average .920 runs22. Bill James evolved the statistical art through his baseball books and

Page 11: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 11

Runs Scored from Situations

Runners No Outs 1 Out 2 Outs

--- .537 .294 .114

x-- .907 .544 .239

-x- 1.138 .720 .347

--x 1.349 .920 .391

xx- 1.515 .968 .468

x-x 1.762 1.140 .522

-xx 1.957 1.353 .630

xxx 2.399 1.617 .830

Value ?

2013 # Cost # Cost Per

Games 160 $20,625 44 $636,364

Appearences 673 $4,903 181 $154,696

Walks 72 $45,833 23 $1,217,391

Hits 167 $19,760 38 $736,842

Runs 103 $32,039 21 $1,333,333

Home Runs 53 $62,264 7 $4,000,000

Chris Davis Alex Rodriguez

is credited with the name sabermetrics in 1980. Sabermetrics

derived new statistics from the basic ones, such as Slugging

Percentage

, On-Base Percentage

, and On-base Plus Slugging

.

Michael Lewis’s “Moneyball: The Art of Winning an Unfair Game“ documented the birth

of modern baseball analytics and the actions of Oakland Athletics general manager Billy

Beane. Beane bucked the preconceived notions of his scouts and explored analytics to

find a better, more affordable way to assemble a winning team. Rather than

concentrate on batting averages, RBIs, and stolen bases, he focused on new

variables like on-base percentage and slugging percentage to find under-

valued talent. Moneyball became legendary when Brad Pitt starred in a 2011 movie

based on the book. Moneyball has become an adjective for finding hidden value, and a

guiding mantra that benefits individuals, corporations, and nations, not just baseball

teams.

In 2002, Beane could only spend $41 million to compete in a league of expensive stars

and richer teams like the Yankees (with their $125 million budget). That year, both the

Athletics and Yankees won 103 games. One of Beane’s approaches was to find

undervalued, patient players who helped produce runs by walking23, in lieu of overvalued

players with high metrics such as batting average and stolen bases. Oakland made the

playoffs from 2000 through 2003.

Chris Davis is one of the better undervalued stars these days24. He was a 5th round 2006

draft pick of the Texas Rangers, who traded him to the Orioles in 2009. By 2013, Davis

was making $3.3 million or 3.5% of the team payroll, and led both leagues with 53 home

runs. In comparison, Yankee’s Alex Rodriguez is the highest paid player at $28 million a

year – more than the Houston Astros payroll25. How do

they stack up?26 Davis was “paid” $62,264 for each of

his 53 home runs while Rodriguez got a whopping

$4,000,000 for each of his 9 homers.

Page 12: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 12

Concepts like big data have turned raw data into useful information for the fans, players,

coaches, managers, and other organizations that promote baseball. Over the last 5

years, sensors, high-speed digital photography, and Doppler radar have led to an

explosion of baseball data.

Fans especially love the new

and exciting ways to follow their

heroes, such as this real-time

big data Internet view of balls

and strikes crossing home at

the July 6, 2013 Yankee-Oriole

game.

While baseball is still just a game, batters are always looking for an advantage over the

pitcher and vice versa. If a batter knows what type of pitch they will get, they gain a

significant upper hand by being better prepared to hit it! Big data technology may get

them that leg up. At the 2012 Sloan Sports Conference, a paper “Predicting the Next

Pitch“27 discussed an advanced analytics approach whereby batters could dramatically

increase their knowledge of what the next pitch would be. Rather than guessing or

looking for a specific pitch, like a fastball, their big data analysis of 2008 data showed

they could increase the chances of getting that fastball from 18% to an amazing 311%!

(Note: MLB rules limit players and coaches to printed reports during a game – i.e. no

electronic real-time data can actively influence play on the field28.)

Moneyball’s predictive statistics created radical theories that brought significant changes

to a team’s structure. Now, big

data is adding new insights and

dimensions to the game that never

existed before. Fans expect more,

and big data is ready to deliver29.

They are exposed to the “how” and

“why” of player actions, leading

them to experience the game at a whole new level. Teams are employing big data to find

overlooked domestic and international baseball talent they never knew existed, as well

Page 13: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 13

as finding relationships between people and merchandise, ticketing, and the media. That

is big data. Beane’s statistics are just a part of it.

Big Data and Baseball - Players, Coaches, Trainers, and Managers

We have seen a fraction of the statistical metrics and analytics employed by baseball.

Even before computers, baseball was about tracking and improving. With the advent of

computers, new metrics and ways of thinking

about the sport emerged. In 2007, tools like

PITCHf/x escalated the amount of baseball

data, dwarfing the data collections that went on

before it30. Pure statistics can distort the game –

a run, hit, or out. PITCHf/x and tools like it add

insight and enhance the game. In just the last 5 years, advances in computer science

have unlocked a Pandora’s Box of data and analysis.

Players always relied on their coaches, and in the early years of the game, some had

photographs of their opponents or their own swings. Then came motion pictures shot on

film. Eventually, each team developed a VCR tape library of each hitter’s at-bats in

addition to footage of opposing pitchers and their arsenals. Hall of Fame right-fielder

Tony Gywnn played 20 years for the San Diego Padres with a career .338 batting

average and a .847 OPS31. Going back to 1983, he recorded his own games on video

tape, played it back on one VCR and recorded his swings on another VCR. He would

then further edit the “swings” tape into 3 more tapes – good at-bats, at-bats with hits,

and just the swings that produced the hits32. Gwynn was a trendsetter in this regard.

These days, pitchers can graphically review the opposing lineup before the game

through a scouting big data iPad. They examine batter

strengths and weakness against the variety, location,

and speed of pitches they throw to see if it gets them a

strike out, pop up, or a (less fortunate) home run in game

situations. After the game, the data helps the pitcher see

what he did well and areas needing improvement. A

hitter can likewise study an opposing pitcher and dissect each pitch to gain an

advantage the next time he faces him33. It can improve the predictability of pitches by

Page 14: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 14

2009 FB% 2011 FB%

All 74.5 72.1

1-0 69.5 67.3

2-0 82.1 79.7

2-1 69.1 66.8

3-1 85.8 83.5

Fastball decline - hitter's counts

12.5% to 50% by leveraging data relationships such as “…Pitcher/Batter prior,

Pitcher/Count prior, the previous pitch, and the score of the game.” 34 Bloomberg Sports

offers this as a software subscription to players.

Big data has already impacted the game. It used to be that when a batter was ahead in

the count, they expected a fastball. Between 2009 and 2011, the percentage of fastballs

has dropped almost 2.4%35. The pitcher and pitching coaches are

employing dynamic, less predictable mixtures than those that have

occurred over the last 150 years of the game.

Big data technology helps bring excitement to baseball. Technology researcher Gartner

Group says “Big data is high-volume, high-velocity and high-variety information

assets that demand cost-effective, innovative forms of information processing for

enhanced insight and decision making.”36 With metrics that change over time, high-

volume is measured in hundreds of terabytes or petabytes, high-velocity in milliseconds,

and high-variety involves complexity, using rows and columns to process structured,

semi structured, as well as unstructured data such as social media and tweets. You

know you have a big data problem when the dataset’s size exceeds the ability of typical

database software tools to capture,

store, manage, and analyze it. Big

data analytics company Vertica

uses the illustration seen here to

define the typical dataset37.

A fundamental part of the big data definition is management of disparate data. A typical

sabermetric database might have hundreds of thousands of records, but they run into a

problem when they couple that database to scouting and medical reports, video, radar,

camera tracking, crowd size, crowd noise, home/away, defensive shifts, hotel

accommodations, length of road travel and duration, slumps, weather, health and fitness

data, team chemistry, financial contract length and value, bonuses, trade data, and other

information. In a few years, wearable sensor player data may need to be integrated (they

are unlikely to be put in the balls themselves). And that is just the start. Let’s see how

various products use big data to help analyze, predict, and optimize player performance,

and revolutionize baseball by supplying data that never existed before.

Page 15: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 15

Sportvision

Emmy Award-winner Sportvision is a sports broadcasting technology company that

introduced K-Zone in 2001, the same year the Gartner Group

published their big data definition38,39. Moneyball had not yet hit the

presses, “the” baseball tool was Microsoft Excel, and all MLB statistics

totaled less than 2% of the baseball data collected today. Sportvision’s COMMANDf/x,

FIELDf/x, PITCHf/x, HITf/x, and SCOUTrax systems thrive on big data in baseball,

capturing 2.5 terabytes of data a game and 6 petabytes for all regular season games

played by 30 teams. COMMANDf/x analyzes the position of the catcher’s mitt to

determine if the pitcher hits the agreed upon spot. FIELDf/x digitally tracks the exact

location of every player and hit, and for the first time, brings movement analysis to the

game. Sportvison general manager Ryan Zander says “It’s almost overwhelming how

much data we’re creating.”40 While some of the data is non-action, such as pitcher

warm-ups, birds flying, and players waiting between innings, overall they collect 2-2.5

million records per game.

Sportvision is on the cusp of the big data movement. Much of their revenue comes from

selling data generated from missile tracking concepts. Its 2006 PITCHf/x product used

in-house compute and storage technology. Along the way, they changed computing

models and currently employ multiple replicated cloud databases using Amazon Web

Services (AWS) and business intelligence/visualization software from Domo41.

Sportvision handles streaming data from every MLB game, sometimes concurrently, and

performs real-time analysis as well as split-off data feeds for live Internet and TV action.

Sportvision focuses on giving fans a richer experience while optimizing the management

and marketing of sports businesses. While this article focuses on MLB, their amazing

enhancement techniques are also employed in football,

hockey, golf, basketball, soccer, stock car racing, and the

Olympics. For example, their Emmy Award-winning “1st

and Ten Line”42 uses a yellow virtual marker to show

football fans where a team has to go to reach a first down.

Their MLB product suite illustrates how big data computer vision impacts the sport.

Algorithms measure 3D objects such as people, baseballs, and background shapes by

leveraging artificial intelligence, cognitive science (study of the mind and its processes),

Page 16: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 16

and machine learning. PITCHf/x visualization of the ball crossing the plate and

underlying data reach fans in stadiums and at home through the Internet, mobile apps,

and televised games. Trajectory and ball speed off the bat is captured and portrayed by

HITf/x. Sportvision data allows us to objectively answer questions that burn in the hearts

of baseball aficionados everywhere, such as:

Who is the best center fielder when playing shallow?

Which double-play shortstop/2nd baseman combinations are the most efficient?

Where should the infielder position himself for a relay throw from the outfield?

Who has the weakest/strongest throwing arms from the outfield?

While the Oakland Athletics used Moneyball concepts to find undervalued players, tools

like PITCHf/x and FIELDf/x now employ advanced big data metrics unavailable in that

era. These tools precisely measure player performance and calculate if they are in the

top 10% of batted ball velocity, use optimal paths 85% of the time to make a catch, or

whether they have 98% throwing accuracy – a real-time refinement of Moneyball.

In the future, big data sports analytics has to meet new challenges, such

as the demands that 3DTV will have when depth is added to a

broadcast. While 3DTV has had fits and starts, fans thrill to the batters

view of the ball as it approaches the plate, or root for a fielder trying to catch up to a

batted ball. Higher definition video is just around the corner. 4K Ultra HDTV has 8-mega

pixels or 4X the resolution of a 1080P 2-mega pixel HDTV, and will put a strain on this

real-time big data system. Who knows – fans may want to see the stitching on the ball or

the grains of wood on bats when this staggering level of detail becomes common place.

PITCHf/x

Sportvision’s PITCHf/x system tracks every pitch

in every game using three permanently mounted

digital cameras to follow each pitch at 30 frames

per second (FPS) through a centralized tracking

system. First installed in 200743, one camera is

placed high over 1st base and the other high over

home plate. A third centerfield-mounted camera

stays focused on home plate to establish the

strike zone since it differs for each batter.

Page 17: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 17

For each time t, 9 values are collected: 3 positions x, y, z 3 velocities vx, vy, vz 3 accelerations ax, ay, az

Each camera is attached to its own computer allowing each image to

be instantly scanned for a baseball. If one is found, the next frame is

checked for a ball in-flight to make sure it is a pitch. The cameras

triangulate the ball’s exact mid-flight location through real-time

digitally processed pattern-recognition of each video frame. Raw

data is transformed into (x,y,z)

coordinates of the pitch. PITCHf/x

calculates 60 data points per pitch including the release point, spin,

speed, break, arc, location, and where the ball crosses the plate to within an inch. It

algorithmically determines 8 pitch types: fastball, sinker, curve,

changeup, slider, split-finger fastball, cut fastball, and a

knuckleball44. The real-time data is used by broadcasters and

internet providers like ESPN and Bloomberg, and augmented

by staff at the game with box scores and live play-by-play.

The combined data stream is transmitted to servers and the video frames are converted

into XML data files. Fans in the ballpark can follow the PITCHf/x action on the center

field screens or almost instantly on their iPads. At home, fans see PITCHf/x enhanced

TV graphics as the ball is in flight (broadcasters may call it by their trade name). Multiply

this gigabyte-per-pitch activity by the often simultaneous 15 league games at the same

sub-second, all processed by Major League Baseball Advanced Media (MLBAM), and

you get an incredible high-volume and high-velocity big data feed. The XML data is

available online for free within seconds. This bonanza of baseball insider information

includes the details of a pitcher's repertoire and their approach to a batter.

The ball’s speed, location, and trajectory are precisely measured as soon as the pitcher

releases it until it crosses home plate. It is algorithmically possible to discern the effect of

a fast versus slow pitch, the ball’s break up or down, left or

right, and its forward or reverse spin. As a

result, its initial velocity, gravity, drag, and

the Magnus force45 can be quantified and

displayed. Positional data allows the ball to

be shown from above looking at home, a

pitcher’s view towards home, and the

batter’s view looking at the pitcher.

Page 18: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 18

PITCHf/x produces statistics that sabermetricians only dreamt about. More than strike,

ball, and hit outcomes, it gives information on the ball’s path and exactly where it

crosses the strike zone. A trail can be

graphically added to the ball for instant

replays46. The data also allows for follow-

up coaching analysis or further fan

enjoyment. Cameras log 36 events per

pitch, and there are roughly 300 pitches a game not counting 8 warm up pitches every

half-inning. With 15 match ups (30 teams) and 162 games a year, the system records

over 26 million real-time XML records a season. Compared to the older radar gun

approach, PITCHf/x is analogous to visiting a doctor for a continuous electrocardiogram

as opposed to just checking your pulse. The radar gun gives a single pitch observation

while PITCHf/x gives the set of flight dynamics.

PITCHf/x allows for extensive automated and accurate data on velocity, pitch type,

location, and result. It has tracked over 4 million pitches from 2007-2011 and is used by

16 leagues across the U.S., Canada, South Korea, and the Dominican Republic47.

Unfortunately, the system cannot go back in time to analyze pitching great Sandy Koufax

or home run slugger Babe Ruth – we have incomplete views of these legends.

Coaches, managers, and scouts use various tools to eke an out

an extra win and better their playoff chances. Fans can use the

same technology to enhance the game. Every year there is

something new and exciting to learn about this game. For

example, big data tools like PITCHf/x are available on smart

phones. Other tools slice and dice the data into amazing analytics

such as pitch plots of horizontal movement versus speed.

Fans love the way PITCHf/x shows the exact spot where the ball crosses the graphical

plate48, especially when they think the umpire has blown the call, which happens about

every six called pitches that do not involve a swing according to baseball author Tobias

Moskowitz49. Truth be told, making 300 split- second correct decisions a game on a ball

darting through the air at 90 mph is hard to do. It is also possible the umpire is biased

against a batter or pitcher and calls a strike on an actual “ball” and vice-versa. PITCHf/x

is invaluable in determining the accurate and optimal strike zone location. It shows which

Page 19: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 19

Year W L ERA H/9 HR/9 BB/9 SO/9 OPS

2008 0 0 1.93 5.8 0.6 2.6 7.7 .501

2009 10 7 4.42 8.3 1.2 3.8 7.2 .716

2010 19 6 2.72 7.3 0.6 3.4 8.1 .637

2011 12 13 3.49 7.7 0.9 2.5 8.7 .659

2012 20 5 2.56 7.4 0.7 2.5 8.7 .602

2013 3 5 3.94 9.5 1.0 1.6 7.5 .720

batters swing at bad pitches, and it is easy to chart strike zone consistency

by pitch type and velocity. In this PITCHf/x illustration, Ichiro Suzuki of the

Seattle Mariners felt a pitch was outside (it is) and was ejected from a

game after disagreeing with an umpire’s “strike” call.

Referencing the charts below, umpire Eric Cooper called the action for a

2012 New York Mets (squares) game against the Orioles (triangles)50. The

chart on the left was against left-handed batters, calls on right-handed batters in the

chart on the right. Red is a called strike and green is a called ball. It was remarkable that

Cooper was correct in most of his judgment calls. Big data in action!

PITCHf/x can also help analyze the ball’s in-flight behavior. In this 7 pitch sequence, the

speed (SPD), break (BRK – the arc

of the ball - similar to gravity), left-

right horizontal movement caused by

its spin (PFX), pitch type, and the

result are shown. Notice the big

difference between the speed and the break of the curveball versus the fastball.

Here is another example of PITCHf/x helping to pinpoint pitching issues. David Price is a

27-year old left-handed Tampa Bay Rays ace who won the

2012 American League Cy Young award with a 20-5 record, a

2.56 ERA, and a .602 OPS. He made $4.3 million in 2012 and

$10 million in 2013. Great things were expected of him, so

fans were concerned with his 3-5 record over the first 12 games of 2013. His ERA

jumped up 53%, hits per 9 innings were up 28%, home runs/9 were up 43%, strike

outs/9 were down 14%, and OPS was up 20%. What was wrong? If this was just a short-

Page 20: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 20

term issue, fans would not have been overly concerned, but the problem occurred start

after start. BrooksBaseball.com’s PITCHf/x data showed his four-seam fastball (little

movement) was down 1.9 mph and his sinking fastball (downward and horizontal

movement) down 2.3 mph. The slower these pitches are, the less effective his changeup

is. His reliance on faster pitches dropped 5% while favoring slower pitches by 5%.

With PITCHf/x data publically available at no

charge, a fan used it to create this custom graphic

to the right showing Price’s issues51. From a

batter’s perspective, you can see the movement

and ball’s speed as it leaves Price’s hand. The fan

noted it is not good to have a fastball and sinker

(circled in red) look similar to the hitter and that

Price’s curveball was more effective with a larger

drop in 2012 than during his first 12 appearances

in 2013.

R.A. Dickey, winner of the 2012 National League Cy Young award, is one of the premier

knuckleball pitchers. A knuckleball is aerodynamically difficult to hit. It rotates once or

twice after release whereas a fastball rotates 16-17 times. Generally thrown between 60-

80 mph, a knuckleball wobbles on its way to home, and is further

influenced by the wind and drag. On June 18, 2012, Dickey faced the

Orioles and had 13 strikeouts in a 5-0 shutout. This PITCHf/x

visualization shows his 114 pitches from the catcher’s viewpoint, with

84% of them being knuckleballs, - red strikes and yellow balls52.

Page 21: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 21

Verlander

2011 # %

min

Vel

max

Vel

avg

Vel

Fastball 2,139 55% 91.1 101.4 95.0

Curveball 729 19% 74.4 84.9 79.4

Changeup 719 18% 82.6 91.0 86.8

Slider 327 8% 81.9 90.6 85.9

Total 3,914

PITCHf/x produces readable XML data which can be easily manipulated a myriad of

ways and is described in upcoming pages.

TexasLeaguers.com uses the exact same PITCHf/x XML data to produce a bird’s eye

view of the pitches53. On the left is Dickey’s release point. The right-hander stands to the

3rd base side of the pitching rubber. His knuckleball “* KN” is incredibly straight, with

virtually no left-right bending after it is released. On the right, the ball dramatically drops

as it crosses home. In its last 15’ of flight, it drops almost 3’ making it extremely difficult

to hit! PITCHf/x provides 50’ of flight from the pitchers release to home, but visualization

services like TexasLeaguers calculate the data closer to the release point and show 55’

of travel. “The ball's path is calculated by plugging the acceleration and velocity data

provided by MLBAM into the standard equation for acceleration: x = ax0*t2 + vx0*t + x0”

54.

Let’s leverage PITCHf/x big data to explore the success of the 2011 American League

Cy Young winner, Tigers right-handed pitcher, Justin Verlander. That year, he had an

impressive 24-5 record, averaged over 7 innings per

game, held batters to a low .191 average, and struck out

over a quarter of his opponents. PITCHf/x shows nearly

half of his 4,000 pitches were 95 mph fastballs, mixing in

a 79 mph curveball (19%), 87 mph changeup (18%), and an 86 mph slider (8%)55.

His success depends on his devastating circle changeup. It is thrown like a

fastball but with a dramatically different grip. You can see the index finger

and thumb form a circle that doesn’t touch the seams, and the ball is in the

palm of his hand. Verlander’s changeup is 8+ mph slower than his blazing fastball, and

that is the key. His delivery fools the batter into believing the pitch is a fastball, but it

takes .037 seconds longer to arrive. That is enough to throw off the batter’s timing.

When batters expect a changeup, their reaction time to the fastball is too slow. PITCHf/x

shows that the difference in velocity makes his changeup very effective.

Page 22: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 22

Spin Rate RPM AVG SLG

Swinging

Strike

Low Spin <2300 .244 .403 8%

Avg Spin 2300-2500 .185 .340 10%

Plus Spin 2500-2700 .178 .251 12%

High Spin >2700 .166 .212 15%

The more a ball spins, the harder it is to hit. This chart shows batting averages, slugging

percentages, and swinging strike effectiveness by spin rates56. Verlander attributes part

of his success to his 3,004 RPM curveball which rotates 23%

more than MLB average of 2,450 rpm. He threw it 19% of the

time in 2011. PITCHf/x “…calculates the spin rate of pitched

balls based on the fit trajectory,”57 and is an approximation. Video doesn’t directly

capture the rotation, just where the ball is in each frame. It takes more advanced camera

technology with higher FPS and data rates to count the rpm of a white ball with red

seams using video. Fortunately, PITCHf/x can be augmented with additional data such

as TrackMan’s radar system that does capture the spin of the ball58 without a camera.

On the left is Verlander’s tight release point cluster for 201159. The pitch stratification on

the right is very consistent. As a right handed pitcher, his 95 mph fastball and his 86 mph

changeup stay 5-10” to the left part of home plate – speed is the only difference.

The game footage illustration below shows Verlander with PITCHf/x insights from 4

distinct pitch types superimposed on each other with captions60. He mixes 100 mph

fastballs, 79 mph curves, 86 mph changeups, and 88 mph sliders. In this at-bat, veteran

Minnesota Twins Justin Morneau shows multiple swings and misses. All 4 pitch types

had the same motion and release point. The screen shot on the right shows the pitches

have significantly separated with different velocities. In baseball slang, that is called

nasty.

Page 23: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 23

Lincecum

Pitch ID

Start

Speed

End

Speed Break Rotation

Release

Point

2459447 82.9 75.7 5.5 1,042.1 6.12

2576000 86.1 80.0 3.7 689.8 6.05

Coaches today look at release points for all their pitchers when their velocity drops. It is

hard to spot a 3-6” drop from the dugout, but real-time data shows this instantly. The

ultimate use case is in the field of kinesiology where predictive analytics could prevent a

shoulder, torso, elbow, wrist, or hand injury when a “bad” pitching motion is detected.

Suppose you are a right handed batter

facing the Giant’s 2008-09 National

League Cy Young winner, righty Tim

Lincecum, and you want get a feel for his

slider. PITCHf/x big data can display

images of batters similar to you and show

what you might see in a game. The

search finds event IDs #2459447 and

#2576000. This visualization of slider #2576000 leaves Tim’s hand at 86.1 mph and has

slowed to 85.5 mph on its way to home61.

This illustration shows slider #2459447

was just outside at 75.7 mph, spun at

1,042 rpm (8 rotations from its release

until it crossed home), broke to the right

5.46”, and is was called it a strike. The 2nd

slider #2576000 crossed the plate much

lower and at 80 mph, rotated at 689 rpm

(5 rotations), had a little less break to the right at 3.65”,

and the was also called a strike.

Page 24: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 24

Detroit Tigers Miguel Cabrera is a feared hitter. As the

American League 2012 Triple Crown winner, PITCHf/x

shows his hitting strengths and weaknesses62. He crushes

low and inside pitches. Opposing managers use spray

charts to position players where hits are likely to go63. The

legend shows if the ball was an out, error, or hit. In 2012, if

an opposing manager with this small sample size had his

pitcher throw Miguel a curveball, he might use the

following spray chart and have

his 3rd baseman and shortstop

play in the blue and red oval

areas on the right. Spray

charts can be automatically

created by coupling HITf/x and

PITCHf/x data64.

PITCHf/x is also useful to teach “framing”. As the ball crosses the plate, an algorithm can

determine if the umpire will call it a ball or a strike. Good catchers “frame” or move their

glove once they catch the ball in order to influence the umpire’s decision. For example, if

borderline pitches are called strikes 50% of the time, increasing the called strike success

to 60% through framing gives a team a terrific advantage. One study showed that

framing saves an average of 20 runs a season65. Another study found that Rays’ catcher

Jose Molina saved 73 runs one year by framing the pitch66. That difference could mean

a couple of extra victories, potentially worth millions of dollars to a team. Based on

PITCHf/x data, Molina frames 13.4% of balls into called strikes67.

PITCHf/x data is free and in XML format, so let’s take a

much closer look at it. The data is found at

http://gd2.mlb.com/components/game/mlb/. Pick a year

>=2007, a month, and a day, and find the game you want

and the inning. This example is from 9th inning of game 4

of the October 28, 2012 World Series at Comerica Park,

with the Giants 1st baseman Brandon Belt facing Tigers

pitcher Phil Coke. The full URL is

http://gd2.mlb.com/components/game/mlb/year_2012/month_10/day_28/gid_2012_10_28_sfnmlb_detmlb_1/inning/inning_9.xml.

Page 25: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 25

Open the inning_9.XML. The metadata definitions are in the appendix. Coke throws 6

pitches: Ball, Ball, Foul ball, Foul ball, Ball, and Swinging Strike. Scroll down to this

sequence of -<atbat num=”65” for the pitches thrown to the PlayerID #474832 Brandon

Belt (Elias Sports Bureau/MLB numbering scheme). The pitcher Phil Coke is #457435.

Let’s focus on the XML data for pitch #4 that Belt fouls off in the red oval:

<pitch des="Foul" des_es="Foul" id="507" type="S" tfs="230849"

tfs_zulu="2012-10-29T03:08:49Z" x="109.01" y="132.11" sv_id="121028_230849"

start_speed="94.4" end_speed="86.0" sz_top="3.32" sz_bot="1.53"

pfx_x="8.23" pfx_z="10.3" px="-0.315" pz="2.919" x0="2.562" y0="50.0"

z0="6.035" vx0="-10.7" vy0="-137.956" vz0="-6.156" ax="15.708"

ay="33.556" az="-12.449" break_y="23.7" break_angle="-43.3"

break_length="4.1" pitch_type="FF" type_confidence=".676" zone="1"

nasty="41" spin_dir="141.469" spin_rate="2651.720" cc="" mt=""/>

The coordinates for the x-axis are to the catcher’s right, y-axis

towards the pitcher, and the z-axis vertical upward68. XML is hard

to visualize, but it says the four-seam fastball lost 8.4 mph

[start_speed = "94.4" minus end_speed = "86.0"] and

is Fouled off [pitch des = "Foul"].

HITf/x

Whereas PITCHf/x follows the pitcher’s delivery, HITf/x follows the ball’s trajectory off

the bat. HITf/x reprocesses the same PITCHf/x images to extract ball data whenever the

bat hits it – no separate cameras are required69. It tracks the ball to the pitcher’s mound

and measures its speed off the bat as well as vertical and horizontal angles as it leaves

the bat. This helps determine if the ball is a hit but not where it lands. Unlike publically

available PITCHf/x data, HITf/x data is privately licensed, and was only available to the

Page 26: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 26

Date Hitter Team Pitcher Team Ballpark

Speed

off bat Feet VLA HLA

9/1/12 Edwin Encarnacion TOR J.P. Howell TB Rogers Center 118.9 488 24.0 101.5

4/15/12 Travis Hafner CLE Luis Mendoza KC Kauffman Stadium 117.2 481 29.7 64.4

8/17/12 Giancarlo Stanton MIA Josh Roenicke COL Coors Field 116.3 494 24.9 95.5

6/3/12 Nelson Cruz TEX Bobby Cassevah LAA Angel Stadium 116.3 484 26.5 100.1

7/2/12 Cameron Maybin SD Trevor Cahill ARI Chase Field 115.7 485 24.9 98.0

public for a short while in 2009. Much of the HITf/x functionality was incorporated into

Sportvision’s latest effort, FIELDf/x, which is discussed in the next section.

Algorithms kick in at the split-second the ball hits the bat to extract three-dimensional

trajectory and movement parameters. Other data can be integrated such as pitch

tracking, atmospheric conditions, park dimensions, etc. to analyze the expected

path. The course can be affected by the ball’s spin, humidity, temperature, drag,

and wind speed70. On humid days, the baseball doesn’t travel as far because

the yarn in the ball soaks up moisture causing it to absorb more of the bat’s energy.

HITf/x uses 20 optical samples of the batted ball to produce

the bat’s contact point, ball speed off the bat, the elevation

angle, and where the ball is heading71. These angles are key

to understanding the ball’s flight and affecting the defensive

alignment designed to prevent the hitter from safely reaching

base.

The distance a ball travels is highly dependent on

three factors. One is the speed off the bat and its

backspin. Dry balls travel further during warm, humid

weather at higher altitude with the wind blowing from

home to the outfield. HITf/x correlates how hard it is

struck with the odds of it being a hit72. An 85 mph hit

could be a single, with a double, triple, and home run

requiring 95-100 mph. The 2nd factor is the vertical launch angle (VLA), such as 200 (900

would be straight up in the air and 00 would

be a straight line drive). The last one is the

horizontal launch angle (HLA) or spray of the

ball (the location it is headed with 450 being the right field foul line and 1350 is the left

field foul line)73. These angles are unaffected by conditions like the weather, and

mathematically show

whether it is a home run

or in the playing area. In

2012, the hardest hit ball belongs to Toronto’s Edwin Encarnacion (119 mph).74

Page 27: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 27

Let’s use the Colorado Rockies, who play at

Denver’s Coors Field a mile above sea level, to

explain the atmospheric impact on a hit. Thinner air

allows the ball to travel further, so a 400’ home run at

sea level travels 34’ further at Coors75. On August

17, 2012, the years’ longest home run (green

highlight in the previous table and image to the right)

was hit at Coors by right-handed Giancarlo Stanton

of the Miami Marlins off of Rockies righty pitcher

Josh Roenicke on a 6th inning breaking-ball. With HITf/x data augmented by Hit

Tracker76, the ball traveled 494’. It left Stanton’s bat at 116 mph at a 24.90 VLA and a

95.50 HLA. Hit Tracker doesn’t use cameras or radar, but relies on trajectory data to

calculate the path, and HITf/x data for direction, flight time, and absolute landing spot.

Not only is the data exciting to fans, but useful to position fielders. If Yankee right-hander

Phil Hughes faced the Orioles, the Yankee manager would shift the defense based on a

spray chart lineup card of previous Hughes-Orioles duels 77. This shows where Markakis,

Machado, and Davis have hit Hughes’ pitches. The shortstop would move to the right of

2nd base when Davis came to bat. A sufficiently large sample size is clearly required.

Based on this chart where Red Sox slugger David Ortiz hits the

ball78, you might move your shortstop to the right of 2nd base,

your 2nd baseman into short right field, and let your 3rd

baseman play in the shortstop position. The Rays employ a

shift based on spray patterns, and in 2011 their league leading

defense saved 85 runs79.

Page 28: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 28

Scouts can use HITf/x to objectively evaluate prospects and approximate their success

in a team’s home stadium. This helps determine if their observations truly exemplify the

player’s ability, find out which are under/over-valued, or figure out if they are just lucky.

HITf/x can establish a hitter’s true performance profile. If they are in a slump, their batted

ball speed could be compared to their historical results. If they average 80 mph but could

only muster 70 this week, a coach might want to explore the power drop-off. This helps

rule out bad luck as the culprit, such as hitting balls right at fielders, and helps determine

if they have a mechanical issue with their swing. While this is just one indicator, it is

preferable to looking at a declining batting average and explaining they are having a 0-

22 hitting “streak”. Another example is whether game conditions, or the amount of foul

territory in a series of parks, could be causing their short-term poor performance.

FIELDf/x

Most of baseball’s statistical history

focuses on hitting and pitching, with

defensive stats by and large limited to

outs, errors, and double-plays once a

fielder reaches a ball80. A shortstop who

ranges far to his right to scoop up a ball,

leaps, spins, and fires the ball to 1st base

to get the runner out, is merely credited with an assist. A runner who times the pitcher’s

delivery, takes an aggressive lead, sprints to 2nd and slides head-first for a stolen base is

simply credited with a stolen base. To value a hit, the defense and its alignment, crowd

noise, park dimensions, weather, and other factors should be taken into account.

FIELDf/x real-time data tells us how the play happened, not simply that it occurred. With

FIELDf/x, defensive luck can be factored out of their ability to play a position and a

runner is given credit for his running abilities.

Box scores report numeric conclusions – runs, hits, errors, averages, innings, pitches,

etc. Quantifying fielding, throw accuracy, and efficient base running has been elusive

and often limited to a stop watch. A fielder’s skill or lack thereof is treated as a binary

event, wrapped in adjectives or ignored. FIELDf/x revolutionizes player evaluation and

reporting with everything on the field digitally tracked. Even sabermetric enhancements

Page 29: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 29

pale in comparison to the in-depth big data analysis that cutting-edge FIELDf/x provides

– the majority of what playing baseball is all about. Sportvision calls FIELDf/x the “…holy

grail of baseball metrics”81 and is a tremendous leap forward for defensive positioning

and player reaction times. We can now quantify if a fielder took off in the right direction

when the ball was hit, or if he went forward when he should have gone backward. It

leads to precise and measurable objective views of how much a player adds or subtracts

from a team’s capabilities, and how interrelated his actions are to other events on the

field.

With empirical FIELDf/x data, the notion of a stolen base becomes a

subject unto itself. Hall of Famer Rickey Henderson’s plaque calls him

the “Man of Steal” as he holds the MLB record with 1,406 stolen bases

over a 25-year career. If FIELDf/x were in use when he played, we

might learn more about how he “reads” the pitcher, how he evaluated a

catcher’s throwing arm, and how he maximized a 4-5 step lead, sliding

into 2nd base 2.9 seconds later82. Conversely, teams could use data to better defend

against the stolen base. Big data systems show the length of the initial and secondary

lead off a bag against a particular pitcher’s wind-up, catch by the catcher, the catcher’s

throw to 2nd base, and the tag on the runner. It determines how fast a runner needs to be

against specific pitcher-catcher-fielder combinations. Teaching a runner to get a slightly

longer lead could mean the difference between a win or a loss, or reduce successful

pitcher pick-offs if the runner can leave a fraction of a second later and still steal the

base. The variables include the lead length, runner’s speed, the type of pitch thrown, the

catcher’s reaction time and the strength and accuracy of their arm, and whether the

runner slides feet first or head first. The catcher must get their throw to 2nd base in 2.0-

2.2 seconds83.

Until recently, precision data required a point-in-time radar gun. FIELDf/x uses four high

definition Prosilica GX computer controlled cameras mounted at high points in the

stadium to cover the exact position of all offensive and defensive players every second

they are on the field84. Initially installed in 2010 at AT&T

Park (San Francisco Giants)85, motion capture processing

is used on the images to interpret the action. The system

precisely follows the ball as it crosses the plate and

Page 30: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 30

computer vision algorithms turn the offensive and defensive players, offensive coaches,

and umpires86 into trackable graphic images. An operator adds the umpire’s ball-strike

judgment call and people’s names. As with PITCHf/x, details are recorded as the

“…pitcher releases the ball, batter hits the ball, fielder gains possession of a ball (fields

it, or catches it from a throw) and fielder throws the ball.”87 With a hit ball, FIELDf/x

displays its angle of elevation, maximum height, and how far it travels, all while showing

the fielder’s speed to the landing spot and the distance they cover. It captures action at

1/30th of a second, resulting in approximately 50,000 offensive and defensive player

object-recognition measurements88 and 600,000 – 1,000,000 data points89,90 a game. It

results in an amazing high-volume, high-velocity, and high-variety big data feed.

The system provides a bird's-eye view of all on-field action – leads, speeds, routes, and

the ball’s fair or foul trajectory. It reports on a fielder’s range, reaction time, speed to the

ball, throwing velocity, and throwing accuracy to a base or as a relay throw. For

example, 3rd basemen have about 1½ seconds to glove a ball while middle infielders

have an extra ½ second giving them greater range91. FIELDf/x shows if a fielder takes

an optimal path to the ball, or plays better in the daytime or nighttime, or on grass or

artificial turf. And as we’ve already seen, managers use this data to position fielders, and

decide on pitch selection based on the situation, the pitcher, and the opponent.

The long debate over the quickest left fielder, the shortstop with the greatest range, or

who is worthy of a Gold Glove can now be quantified. A full season of

FIELDf/x empirical data makes sabermetric statistics on errors per

chance, assists, or passed balls passé. FIELDf/x metrics can help with

contract negotiations by giving general managers a player’s “hustle

value” comparison, or help pair up shortstops and 2nd basemen based on their combined

time to complete a double play. For example, if shortstop “Rick” threw to 2nd baseman

“John” in .4 seconds and then John threw to 1st baseman “Phil” in .5 seconds, you could

then determine if other player pairings could complete the play in less than .9 seconds.

Here is an example of how FIELDf/x allows coaches to accurately evaluate a fielder’s

defensive worth to a club. The Yankees lead the Rays in the 5th inning 2-1 with 2 outs on

July 2, 2012. Rays Will Rhymes hits an 87 mph pitch from Yankee Freddy Garcia to

center field. The ball is 31’ high as Yankee center fielder Curtis Granderson runs 44’ at

Page 31: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 31

12.9 mph towards it92. The YELLOW path he takes is sub-optimal. Between innings,

coaches might have discussed this with Granderson to understand what went wrong.

FIELDf/x generates an order of magnitude more real-time data than PITCHf/x. Each

camera has two link-aggregated gigabit Ethernet ports that capture action at 30 FPS and

stream it at up to 240 MB/s during the game. They optically track a ball that could travel

600’, as well as 9 fielders, 4 umpires, a batter, 3 base runners, ball boys, and whoever

enters the field. The system can tell players apart even when they cross each other. All

this comes out to 2.5 million records for a 3 hour game93. Combined with other feeds,

Sportvision and MLB have amassed over 30 petabytes of data.94 It is stored and

analyzed in near-real time and is available to a manager for his decisions or to evaluate

player performance. Teams rely on analysts (data scientists) to sift through the daily

data volumes for unknown insights that could lead to an extra run or better defense.

Solid defense is key to a winning season. Correctly positioning your players for each

opponent, park geometries, weather conditions, pitcher/pitch, and other variables can

save runs and help your team get to the playoffs. That requires big data. Sabermatrician

Bill James equated runs scored and runs allowed to a team’s winning percentage95

. For example, a team that scores 800 runs and gives up

700,

, will win 92 games that year. A 5% FIELDf/x

defensive improvement could reduce the runs allowed to 665 runs

. Four extra wins could make or break a season.

This FIELDf/x image shows the ball’s HIT PATH on the right96. A straight line connects

the left fielder’s starting point to where the ball is going. The fielder runs 85’ at 20.1 mph

to the hit at a sub-optimal 3.9 seconds after taking 1.2 seconds to react to it. The path

Page 32: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 32

was 15’ longer than a straight line. His

fielding efficiency is .824

.The catch probability is

calculated in real-time from the digital

photography. FIELDf/x adds a descriptive

language to simple box scores.

These still frames (with graphics enhanced to make the action easy to follow) are from a

FIELDf/x test of an Athletics-Mariners game on September 20, 200897. There is 1 out

with Seattle’s Kenji Johjima at bat, Jeremy Reed on 3rd base, Miguel Cairo on

2nd base, and Bryan LaHair on 1st base. Oakland defensively has 3rd baseman Jack

Hannahan , shortstop Bobby Crosby , 2nd baseman Cliff Pennington , center

fielder Ryan Sweeney , and left fielder Aaron Cunningham .

The action begins with Oakland’s Kirk Saarloos’s pitch to catcher Rob Bowen .

Johjima hits it deep to left field. We know from historical FIELDf/x data that

Cunningham covers 130’ in 7 seconds, or the time a high fly ball stays in the air. His

coach positioned him based on who is at bat and perhaps the type of pitch the pitcher

was throwing. Defensive players go after the hit and position themselves

Page 33: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 33

for relay throws and a possible play at the plate. Runners prepare to tag up in

the event the ball is caught. The ball hits the wall and the runners sprint home. The relay

home is late and two runs score. Runners stay on 1st and 2nd.

These tools allow managers to create optimal defensive plans and decide which

shortstop to use based on weather, altitude, day game after a night game, type of turf,

etc. Coaches learn more about the pitchers their team will face, and the pitcher can

study the batters they will oppose in more detail than a traditional scouting report would

contain. A manager can build a list of pitch sequences for the pitcher to throw to

increase the chances of getting a hitter out. Likewise, a manager can let the hitter know

what type of pitch and its location to expect in a situation – runners on, wind, day/night,

turf, etc. With this level of detail available in real-time, players in training sessions can

get instant feedback on what they did well and what they need to work on98. While the

real-time data can’t actively influence play on the field, it does help improve player

preparation.

Big Data and the Business of Baseball

On the surface, baseball is a simple game and fun to play. For those running a baseball

business, however, simplicity couldn’t be farther from the truth. The overarching

Page 34: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 34

organization is Major League Baseball, and they have 30 team franchisees. MLB owns

the brand, logos, and other aspects licensed for use as defined by a constitution. The

league employs revenue sharing with 31% of each team’s net local revenue distributed

evenly to all teams99. Small market teams like the Kansas City Royals get more money

than they put in, and very rich teams like the Yankees get back less than they contribute.

At a high level, each team is governed by a MLB franchise contract and permitted to

select players, set prices, and operate like a for-profit business. Teams put a “product”

on the field to generate fan interest. With each team having its own business model that

is tailored to its geographic franchise, some teams are quite profitable while others are

lucky to break even. In 2013, the Yankees were the most valuable team at $2.3 billion

with 2012 revenue of $471 million. Somewhat surprisingly, the Rays were the lowest

valued team at $451 million, yet they had 2012 revenue of $167 million100.

To bring in sufficient revenues to keep a franchise profitable, a club must begin with the

“old school” basics of putting a winning team on the field that fans love to watch, since

the fervor of great play in front of a sellout crowd also generates concession revenue.

The club then needs to layer on big data improvements, such as assembling a better

team for less money, establishing a dynamic and profitable customer relationship, and

positioning the team as a marketing and media revenue generating engine.

Consider yourself a team’s general manager, with your goal to assemble a

championship contender. A guideline you might have to follow is based on money

buying success. These are possible strategies you might use and teams that fit that

profile:

Highest salary – expensive veterans, top-of-their-game all-stars (Yankees)

High salary – past their prime stars (Cubs)

Medium salary – primary focus on pitchers, secondary focus on hitters (Giants)

Low salary – primary focus on hitting, secondary focus on pitching (Angels)

Lowest salary – minor league players for MLB minimums (Athletics)

There are many more such variables, combinations, and strategies. You probably do not

want to group hitters with pitchers, and you probably want to add factors like park

dimensions, typical weather conditions, number of day versus night games, travel

schedules, and more. You want left-handed hitters if your ballpark has a short right field

dimension. Some combinations may not make any sense, while others will have a strong

correlation. You will likely get more walks if you have a lot of shorter players. If you want

Page 35: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 35

the oldest players or the heaviest players, you are getting veterans who really know how

to play the game, but may suffer

more injuries or tire easier. Pick

the youngest players and they

might be faster, but have less

experience. Or you could use

Billy Beane-style variables to

assemble a team focusing on

Defensive Runs Saved101. A

strategy of winning at all costs

may attract fans to the stadium, in which case you might want to outspend the other

teams. Any or all or none of these can be part of big data mining in assembling the right

team for your goals. To the right is an example of a strategy that could be tried102.

Leveraging big data cannot only help a team make it to the championships, but generate

meaningful extra revenue at the same time. Tickets are sold at a premium, souvenirs

and concession prices increase, and on cold October nights, clubs sell a lot of hot

chocolate. The 2009 Yankees earned an extra $72 million by playing in the

postseason103. Broadcasters view the World Series as a revenue bonanza, especially

when the teams are in top markets and the best-of seven series goes to 7 games. For

example, the 2009 World Series featured the Philadelphia Phillies against the Yankees.

An average of 19 million viewers saw the first 6 games jumping to 22 million viewers for

Game 7104. Championship games also bring additional revenues to local businesses

such as bars, restaurants, and MLB merchandise retailers105.

MLB players earn a contractual minimum of $490,000 in 2013 and the highest paid

player makes $29 million106. In 2011, Phillies pitcher Cliff Lee signed a $120 million 5-

year contract that earned him $25 million in 2013107. That equates to $8,067 for each of

his 3,099 pitches he made that year108. Teams collectively spend over $3 billion on

player salaries109, and try to determine each player’s worth based on their contributions.

Each team typically has 225 major and minor league players, plus 100-200 other full-

time employees, 1000-2000 part-time ushers, game-time concession sales people, etc.

Hundreds of thousands of direct and indirect workers make part of their living from

baseball, including parking attendants, police/security, hotel staff, peanut farmers, gas

station attendants, garment workers, newspaper writers, TV and radio commentators

Page 36: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 36

and engineers, etc. In truth, baseball is a simple game played by kids in 45 countries110

and viewed by fans of all ages worldwide, while backed by a trillion dollar global industry

that touches the lives of a significant portion of the planet in one way or another.

Businesses hate uncertainty, and baseball is anything but certain. We want to know the

future, but we don’t. In the business of baseball, you have a lot of data at your disposal,

and need to make a lot of decisions, with the goal of making the business viable despite

this uncertainty. That is where big data helps the most – discovery:

What roster of players should be assembled?

Players get injured - who should replace them and how will it affect the team?

How long should their contracts be and how much are they worth?

How to predict the future performance of a player?

What arm angles and pitcher velocity changes can predict medical issues, and how to use medical care and conditioning to get them and keep them healthy?

With big data systems like PITCHf/x and FIELDf/x, each prospect might have a hitting,

base running, fielding, and throwing score during a game. Over the course of a season,

players would have an empirical rating based on their play – a comprehensive report

card. For example, if college player “Fred” has a 92 rating and “Bill” a 96, teams might

offer Bill a higher signing bonus based on their ratings. The approach would also work to

evaluate and compensate older players who substitute experience for raw athleticism as

their career wanes. Some executives might even employ personality and/or medical

testing to find the best athletes that can earn the team tens of millions more dollars.

Player Development

Fans yearn to be part of an exciting team – great

pitching, power, speed, and defense. They come

to the park to see an exciting game and cheer

their team to victory. All clubs would like to field a

team of feared all-stars, but their collective cost is

too high. The 2013 Yankee player payroll was ten

times that of the Astros. It is true that generally

the team with the best, most expensive players

tend to win more games and make it to the playoffs more often than teams with the least

expensive players, but it is far from a certainty. Injuries to star players can quickly turn

expensive teams into mediocre ones. This wins per dollar graph from 2001 to 2010111

shows teams above the line won more often than those below the line without an

Page 37: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 37

ironclad correlation to payroll. The Athletics won an average of 87 games at $804,597

per win while the Yankees spent $2,244,897 for each of their 98 wins. There is obviously

more to a sucessful team than player’s paychecks.

In 2013, teams with above average minor league affiliates were the Cardinals, Mariners,

and Rays112. Prospects come from high schools, colleges, and key locations around the

world. In 2012, 27% or 224 MLB players came from the Dominican Republic (87),

Venezuela (58), Canada (15), Japan (13), Cuba (11), Mexico (9), Panama (7), and

others113. In the pre-Moneyball era, you scouted these locals to find talented players for

bargain-basement contracts. Big data has made the task a bit easier. With empirical

data and the power of the Internet, scouts can “watch” every game from their office and

analyze potential players before they hop on a plane or drive to visit them in action. They

can get a feel for a pitcher’s habits and understand what they throw in game situations.

Are they effective when behind 3 balls no strikes? Do they hold a runner at 1st base?

Does their changeup have a big break? Do they keep their first pitch down?

In the pre-Moneyball era, players were often ranked by “tools” with “5-tools” being the

best. Teams were prone to draft players who had that “look” – handsome, athletic body,

and charisma. Today, they want players who get hits from optimal mechanics, have the

fastest reaction times, take the shortest path on the bases, and have the most accurate

and strongest throwing arms. The big data objective is to employ metrics like these114:

60th percentile of ability to go with the pitch

92th percentile of forward exit velocity

62th percentile of path to ball

72th percentile of throwing arm accuracy

40th percentile of speed

As a result, Sportvision tools are appearing at little leagues, colleges,

and baseball schools, and are used alongside radar guns and stop-watches. SmartKage

uses HITf/x and PITCHf/x in a video-sensor batting cage to provide over 40 performance

metrics to student athletes115. Their hitting, pitching, fielding, running, catching, and

agility are quantified to improve the player. Scouts also receive this data116. SmartKage

is installed in over 300 locations, 3,500 colleges117, and many minor league parks118.

Page 38: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 38

Hitters' Questions Model Data

What does he throw? Top 2 Pitches

Pitch Repertoire/Variety

Horizontal Pitch Location

Vertical Pitch Location

How hard does he throw? FB Velocity

What kind of movement? Horizontal Movement

Vertical Movement

Where does he come from? Release Point

How does he like to pitch? Swinging Strike %

Zone %

Edge %

Top 2-pitch Sequence

Clustering Pitchers

With all the players in MLB, it is hard to find plentiful data on specific hitter-pitcher

matchups. This small sample size problem is evident when a manager picks a batting

order against a particular pitcher and decides to use someone who is 2 for 12 against a

pitcher, 1 for 3, or even 3 for 20 over a 3-4 year period because they lack a sufficient

statistical history to work with. What if pitchers were grouped with others who pitched the

same way and clustered based on fastball,

changeup, and curve release points, velocity,

and horizontal and vertical break

characteristics? A grouping might show how a

hitter could do against this model of a pitcher’s

ability. Vince Gennaro used a YarcData Urika

big data appliance to comb through a vast amount of PITCHf/x data to create pitcher

clusters against left-handed batters119. In red are

pitchers such as Gio Gonzalez and Ricky Romero. In

green, Jon Lester and Cole Hamels. Purple has CC

Sabathia and Boone Logan. Bruce Chen and Johan Santana are blue groups, with

David Price and Chris Sale in yellow. Gennaro’s

analysis evaluated 12 different attributes as shown in

this table120. This data could help a manager use a

stronger statistical correlation to select a lineup. Instead

of picking a batter against CC Sabathia, the manager

would use the group stats of Sabathia, Craig Breslow,

Rex Brothers, Clayton Kershaw, and Boone Logan.

To Gennaro, “…big data is about discovery. It's about asking the right questions to spot

a customer trend, to identify a new market, or, in personnel matters, to find a gem of an

employee, someone who you didn't know existed until their name started showing up

regularly on Twitter or LinkedIn. Big data helps you to know what don't know.”121

Revenue From Fans

Team owners will tell you the key to a profitable baseball franchise starts with an exciting

team, for excitement equals fans. Ticket revenue is just the tip of the iceberg. Selling

more tickets means higher concession revenue, increased merchandise sales, more

advertisers, etc. Management has many approaches to sell tickets, from giveaway

player bobble-head dolls, firework evenings, co-sponsored free caps, balls, or bats for

Page 39: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 39

kids, and other items. There are also special stadium events such as “Old-Timers”

reunions, retiring a player’s number, and tributes. It’s not just in the ballpark; TV viewers

on their couches, passengers listening to car radio, Internet baseball fans, all experience

the fervor through licensed broadcasts. A happy fan base translates into higher revenue,

and higher TV/radio ratings lead to higher advertising revenue.

While a widely accepted goal is to sell all of your seats before the season begins to

season subscribers, not all teams share that view. The Nationals “…limit the number of

season tickets available for 2013 to 20,000 of our 41,000 available seats.”122 Single-

game tickets have higher club revenue than games purchased on a discounted season

plan. Clubs have also experimented with various ticket sale strategies in order to

increase direct ticket revenue. Variable and dynamic are two approaches to allow for

“market” ticket pricing against a hot club or discount tickets to a last-place opponent.

When their team wins, ticket prices can increase, or vice-versa if they lose.

With variable pricing, teams set seat prices based on the day, game time, and pre-

season popularity of the opponent. For example, a $19 Monday seat may cost $21 for a

Wednesday game. The Red Sox introduced premium prices for Opening Day and

Yankees games, and reduced prices for weeknight games in colder months of April, May

and September123. This increases attendance for games with less desirable opponents,

and raises prices for high demand tickets against top teams. Dynamic pricing uses big

data to gauge real-time interest to set prices based on supply and demand. Similar to

the airlines, prices are higher when your team is on a hot streak, a player is about to

break an important record, your team is vying for the pennant, or a great summer day.

Twists on dynamic pricing include allowing the fan to upgrade their seat when the

stadium is not full at a lower price than if they purchased the better seat weeks earlier.

Merchandise creates substantial club revenue because they incur no cost while profiting

from each item sold. Hats, jerseys, and foam fingers use licensed team logos and

names of players past and present that fans relate to. In 2012, almost 25 million fans

bought items124. Fans purchased $3.2 billion worth of goods in 2011125 and it is

estimated that MLB keeps 4% of each sale before distributing the revenue to teams126.

Page 40: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 40

The Nationals are a technologically advanced club that

uses big data for fan marketing and ticket promotion. Their

“Ultimate Ballpark Access"127 card is a fan’s ticket, cash,

and rewards card that gets them personalized discounts in

the park. The card is key to their Customer Relationship

Management (CRM) system that integrates marketing, sales, and customer service to

get a better understanding of the fan. In exchange, fans get a better game experience

with points exchangeable for merchandise or food discounts, seat upgrades, promotions

during rain delays for free beer with a hot dog purchase, and other advantages128. With

cash tied to the card, fans are encouraged and incentivized to use it at the game.

Real-time data, including a fan’s location in the park, allows the Nationals to redeploy

concession staff when queuing causes long waits, offer promotions to fans that arrive

early, and reward frequent fans. Knowing more about each fan even allows vendors to

greet them by their loyalty card name. “Hi Debi, thanks for coming today! As a loyal fan, I

see you are entitled to a free hot dog. Want it with mustard”? They even increase sales

during inclement weather through parking and subway discounts if fans buy a ticket.

All Nationals business groups share a common CRM view - ticketing, merchandise

manager, concession coordinator, etc. They strive to improve the customer experience

and allow fans to dictate their level of participation in the system. The Nationals track

how often a season ticket holder attends a game and make it easy for that fan to resell

their tickets on StubHub since the goal is a full stadium129. If they see a pattern where a

fan buys tickets when certain teams visit, such as when the Mets come to play, they can

offer them a Nationals-Mets ticket package. Their loyalty card is also tied in with other

merchants such as grocery stores for extra ballpark discounts. They increase the value

of the Nationals brand by knowing more about their fans and market directly to them.

Business managers also want to know their fans better. Data was always available on

season or luxury box ticket holders, but insight into fans that came once a year was hard

to come by. Big data to the rescue, especially with a CRM card! For example, they can

deduce that a fan who buys beer and cotton candy in the 3rd inning likely has a child at

the game, and can text them a 7th inning 50% discount offer on children’s caps.

Page 41: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 41

Offering fans rich interactive media is part of the big data

ticket buying experience. For example, the Yankees ticket

website uses an interactive park tour to find a view you

like. MLB runs ticketing for most of the teams, and tracks

the buying habits of its fans to improve the fan experience

and help gain insight into its customers. Getting to the park

is made easier using a smart phone or tablet for directions and parking details. MLB’s

app also shows team statistics and video highlights from games attended, player songs,

special offers, and ticket upgrades130. Fans can receive smart phone alerts about the

best parking, gates to use, and bathrooms with short lines, see where they are in the

park, bring up a food and drink menu, and have their order delivered to their seats!

More seated fans means more advertising revenue. Time slots before the game begins,

between innings, during pitching changes, and on the rain tarp during inclement weather

are ideal for advertisers looking to capture the attention of 30,000-50,000 fans. Wrigley

Field (Chicago Cubs) is expanding their advertising square footage, including a left field

Jumbotron expected to “…generate $300 million over five years to use for Wrigley

renovations.”131 Advertising revenue is closely tied to fan attendance, which reflects

whether the team is in contention and exciting to watch. For example, the 2010 Mets at

Citi Field drew 2,559,738 fans with a 79-83 record132. A year later, their record fell to 77-

85 and drew 8% fewer fans. Advertising revenue dropped $1.9 million to $44.2 million133.

MLBAM was created to develop a common look-and-feel web site for each team134.

They now generate revenue, and in 2012, brought in $650 million through 3 million

MLB.tv and MLB.com subscribers. AtBat is their top smart app with almost 7 million

downloads. It provides play-by-play action plus the PITCHf/x graphical flight of the ball

as it crosses home135. Through electronic devices, fans increase their enjoyment through

real-time statistics and individual replays within 15 seconds. “...MLB.tv subscription

allows you to stream the game through an Xbox or any Wi-Fi connection. Meanwhile,

BAM's AtBat app allows for streaming on

your iPhone or iPad. It also allows fans to

watch the action in "Game Day" mode , a

data-enriched, live-graphic presentation of the

game.”136 It takes a lot of real-time big data

expertise to unify diverse data feeds behind

Page 42: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 42

the scenes, mask the complexity and make it appear simple and seamless to the fan.

MLBAM is also working on bringing PITCHf/x, HITf/x, and FIELDf/x to the home video

game console. Imagine being at home, pinch hitting for a player with real runners on

base, trying to swing a virtual bat at Mariano Rivera’s cut fastball coming at you on your

big screen TV at 90 mph. The gaming potential is tremendous and promises to offer

greater fan immersion compared to standard XBOX or PlayStation baseball games.

It should not be a surprise that running a successful ball team is a big data problem. The

data shows all sorts of patterns137. Who buys tickets? Who buys merchandise? Who

likes what? What’s the best time to make them an offer? What form should that offer

take? On the game side, what combination of players produces the best results? Where

should you position your defense? What pitches in what sequence should you look for?

It’s come down to Moneyball on steroids. Moneyball was “…based on 80,000 box scores

and play by play data (1963-2002), or about 1.5GB of just outcome data”138, versus

today where multi-terabytes of data are produced in one game – a big data explosion!

There’s lots of data and answers in that data that leads to some amazing conclusions.

Revenue From Media

Clubs also make money from TV and radio deals. Teams contractually share revenue

from national contracts with CBS, ESPN, TBS, FOX, etc., amounting to $12.4 billion over

8 years, or nearly $52 million a season per club139. Teams also have local contracts. For

example, the Yankees own the YES Network which paid them $85 million in 2013 for

broadcasting rights, eventually reaching $350 million a year by 2042140. In 2012, the

Yankees sold 49% of YES to FOX Sports for $3.4 billion141. In 2013, the Dodgers and

Time Warner Cable signed a $7-8 billion, 20-25 year deal worth $280 million a year142.

Why are sports so popular with networks and advertisers? One big reason is the DVR.

Viewers record their favorite shows and fast-forward through commercials to achieve a

near non-stop experience. That behavior hurts product advertising and the revenue it

generates. People may record their favorite TV show to skip the commercials, but

watching DVR'ed sports results in a mediocre experience. There are many ways to track

a score, so once you find out the result, you tend to either not watch your DVR’ed game,

Page 43: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 43

or watch it with less enjoyment. That is why advertisers pay so much to air commercials

between innings – people want to watch sports games live and not record them.

A team can make out well when it comes to negotiating advertising contracts with

broadcasters. For a broadcaster, a single game represents four 30-second TV or radio

commercials per half inning, or 56 commercials during a 9-inning game. In an extreme

example, Fox TV sold each 2010 baseball playoff 30-second slot for $225,000, and

$450,000 for World Series slots143, with the 2012 All-Star game going for $550,000 per

commercial144. There may also be advertisements during the game, such as the “Pepsi

home run blast” or the “Kia player of the game”. There are also commercials before and

after the game, but it is harder to guarantee advertisers the same viewer demographics,

ratings points, and market share of the 18-49 year olds they are trying to reach.

Print advertisers buy space on pocket schedules, media guides, game day handouts,

yearbooks, and the back of paper tickets. Ads are also placed in concourses around the

park, electronic scoreboards, outfield fences, behind home plate and the bases, and

scrolling messages around the field. Premium giveaways with an advertiser’s message,

such as a souvenir cup or trinket, are handed to fans as they enter the park. MLB

attracts an interesting demographic for advertisers. For example, 25% of the Milwaukee

Brewers fans earn $100,000 or more compared to 17% of the Milwaukee Designated

Market Area145. The team’s demographic audience, market, and team interest dictate the

prices charged for signage. For example, a big market team like the Red Sox might

charge $300,000 for the same home plate signage that costs $50,000-$60,000 for a half-

inning at Kauffman Stadium (Royals)146. When the Red Sox play the Royals, the Boston

TV audience sees advertisements that cost 1/6 the price of Fenway ads.

Big Data Helps Create Algorithmic Baseball Journalism

Automated Insights (AI) created realtime.automatedinsights.com147 which robo-writes

content about any MLB game in real-time. It includes big data mined

observations from previous games, be they from the day before or 100

years ago. With instant updates to games in progress and recaps derived from the final

box score, facts and insights are added to yield a colorful account of the play-by-play.

Originally named StatSheet.com, AI leverages structured statistics to produce big data

automated written content. They use linguistic algorithms to automatically publish 500-

Page 44: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 44

1,000 word recaps such as this from a

Red Sox – Tigers 2013 game148. They

generate recaps for 417 other websites

covering the NFL, MLB, NCAA Division

I college basketball, and others,

consisting of 400,000+ tweets on score

updates, play-by-play accounts, and

more149. Their “…network is already

comprised of more than one million

unique pages of content and 15,000-

20,000 original articles are added every month.”150

Software-generated content can turn quantitative information into English, but qualitative

writing still requires a human touch. Some benefits of “robo-writing” include:

never gets writer’s block

works morning, noon, and night, and produces content quickly

can alter its writing style through parameter changes

can analyze data without getting tired

low cost per article

Programs can take something simple like “Adam Dunn of the White Sox homered to left

center” and instantly turn it into:

“That was an MLB leading 35th home run for the year. Adam

Dunn moves into sole possession of 50th place all time for

home runs in Major League Baseball. When there are two or

more home runs in a game, the White Sox are 34-10.”151

Rather than invest in a large data center with physical servers, storage systems, backup

and disaster recover capabilities, large power and cooling bills, and more to pull off this

big data feat, they leverage the elastic cloud computing capabilities of Amazon’s AWS.

On peak Tuesdays, they spin up 300 AWS servers on demand for one hour to input

millions of Yahoo data points and author millions of customized fantasy sports reports,

each tailored to an individual fantasy team owner!152 A report could delve into a player’s

actions versus historical events, a manager’s decisions, playoff implications, and more.

They create hundreds of reports a second and AWS bills them for the actual usage.

Page 45: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 45

This US patent shows the AI basic flow153. Statistics flow from an information provider

(305) to any device that can relay baseball

stats according to an agreed API, XML,

CSV, or spreadsheet format. A content

generation engine (320) issues a SQL

query against the content generation

database (330) for statistics that includes

event IDs such as the one used in the

PITCHf/x metadata, or other classification trends, streaks, new records, etc.

The actual content is based on templates designed by AI journalists and programmers

which include <phrases> and <variables>. The templates can include “…event preview,

in-progress event update, event recap/summary, statistical analysis, progress report,

etc.”154 Once a template is selected, text is varied to avoid duplication of content by

using <phrase variation> and <statistic>. The variation allows the tone of the content to

be hopeful, positive, optimistic, angry, indifferent, pessimistic, judgmental, etc. Their

algorithms also derive uniqueness from the significance of the event.

Known as computational journalism, machine written news, automated content creation,

algorithmic journalism, and robotic reporter, to name a few, companies like AI mine big

data to create humanized content around statistics by adding the who, what, where,

when, and why with nouns, verbs, and other parts of speech. To guard against

repetitiveness, they use artificial intelligence to scan big data datasets for patterns,

insights, and trends using natural language constructs. While not yet at the level of a

professional writer, they do a fine job with plain English. Their approach can also

generate content for real estate listings, financial reports, and traffic alerts.

AI employs journalists who specialize on meta-data structure and content, and work with

programmers on data filtering parameters to fit that syntax. They coach algorithms to

identify various data “angles”. Content may originate from a hit or an out, winner or loser,

record setters, and players who had a great day. The methods involve context and data

mining searches from proprietary databases, external data feeds, and external big data

sources. Automated mining can free up a real reporter from the drudgery of digging out

data clues. The big data aspect adds historical perspective for “why” an event is special.

Working with data rich subjects, turning statistics like RBIs and home runs into

Page 46: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 46

sentences and paragraphs is a natural fit for robo-writing. AI’s programmers help fill the

gap created by enormous rich data streams and the demand for formulaic journalism155.

Here are two different articles written about the same game. Which was written by the

NY Times and which by an algorithm?

BOSTON — Things looked bleak for the Angels when they trailed by two runs in the ninth inning, but Los Angeles recovered thanks to a key single from Vladimir Guerrero to pull out a 7-6 victory over the Boston Red Sox at Fenway Park on Sunday. Guerrero drove in two Angels runners. He went 2-4 at the plate. “When it comes down to honoring Nick Adenhart, and what happened in April in Anaheim, yes, it probably was the biggest hit (of my career),” Guerrero said. “Because I’m dedicating that to a former teammate, a guy that passed away.” After Chone Figgins walked, Bobby Abreu doubled and Torii Hunter was intentionally walked, the Angels were leading by one when Guerrero came to the plate against Jonathan Papelbon with two outs and the bases loaded in the ninth inning. He singled scoring Abreu from second and Figgins from third, which gave Angels the lead for good.

BOSTON — In the hopes of channeling one of their most famous postseason moments, Boston reached back 23 years on Sunday and had Dave Henderson throw out the ceremonial first pitch. It was Henderson’s home run that rallied the Red Sox when they were down to their last strike against the Angels in the 1986 American League Championship Series. Now with the Red Sox facing elimination by the Angels again, here was Henderson, proof that no game, no series, is over until the final out is secure. But this time, all these years later, it was the Angels who made that point. Down to their final strike three times in the ninth inning — just as Henderson was in 1986 — Los Angeles scored three runs off Boston’s dominant closer Jonathan Papelbon, to win the game, 7-6, and eliminate the Red Sox from the postseason. Henderson’s magic, apparently, is not reusable.

Still not sure? An algorithm created the one on the left from a box score with on-line

quotes thrown in for color156. While the artificial article is not as good as the journalist’s,

the quality isn’t noticeably sub-par. It shows state of the art linguistic algorithms

combined with big data. Algorithms may help with data mining research, create content

that journalists virtually never do such as Little League game commentary, handle

simple medical checkup explanations involving cholesterol readings, or traffic reports.

They can also report on the finer points of pitch movement, trajectories off the bat, and a

shortstop’s movement to his left, while updating it every second. Algorithms could also

alert a coach in natural language to a tiring pitcher’s lower release point or velocity.

Listen To Your Data - Grady the Goat - The Curse of the Bambino

As good as baseball statistics and analytics have become, they don’t guarantee a win,

nor is everyone a true believer. An example of where data was available to help make a

better decision - a premise of big data - was during the 2003 ALCS pitting the Red Sox

against the Yankees. The Red Sox were just 5 defensive outs away from winning their

5th pennant since 1918 with a chance to win the World Series, something they had not

Page 47: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 47

done in 85 years. That they failed to win the game can be traced to the decision of Red

Sox manager Grady Little to “go with his gut”, and not with the data available to him, at a

critical moment late in the game.

The celebrated Red Sox pitcher Pedro Martinez was on the mound, and

was brilliant giving up only 3 hits through 7 innings. He had now thrown

over 100 pitches, however, and his team’s data analysis showed that his effectiveness

dropped noticeably after 6 innings or the 105-pitch mark. Historically, batters had a .555

OPS through 6 innings facing Martinez, jumping to .758 after that. As Martinez passed

the 105-pitch mark, the Yankees began to tee off on the previously dominant Red Sox

“ace”, chipping away at the Sox’s 5-2 lead. Astonishingly, Little refused to pull Martinez

from the game until he had thrown 123 pitches; his last pitch, thrown to Yankees catcher

Jorge Posada, saw the Yankees tie the game, which they went on to win. The so-called

“Curse of the Bambino”, which linked the Red Sox trading the legendary Babe Ruth in

1919 to the Yankees to their World Series drought since then, was seemingly alive and

well. Little was fired 11 days later; his refusal to rely on clear-cut evidence that a new

pitcher was needed had broken the hearts of Red Sox Nation for yet another year.

Conclusion: The Future of Big Data Baseball

Baseball is a game where a “once in a lifetime” batter has a .400 average and a pitcher

that wins 15 games a year is ranked in the top 10%. The Moneyball era marked a radical

overhaul of baseball analytics and the refinement of the process of baseball. Teams

today invest millions on advanced analytics to determine the players they need, how to

use them, how to help them, maximize revenue, and transform their team’s relationship

to their fans. A symbiotic relationship between the athlete and the nerd has emerged.

Fans continue to be exposed to this data explosion, with degrees of difficulty on fielding

plays replacing adjectives like “great catch”, or base running metrics that make for hotly

debated player comparisons. At home, fans see a liberal dose of data capabilities

because of advertising demands and new metrics that grab everyone’s attention.

Management and the media continue to expand revenues and increase fan interest.

Coaches are able to keep players healthier during the grueling season, and turn a

player’s average season into great ones with a few extra hits a month. As Vince

Gennaro says “There are going to be pregame meetings on where the shortstop is going

Page 48: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 48

Client

Server

File System

Append Only Log

Write

AcknowledgeMem Table – Storage row

Flush

Commit Log Data – SS Table

WRITE PATH

DATA 1 DATA 2 DATA 3 DATA 4

DATA 1 DATA 2 DATA 3 DATA 4

DATA 1 DATA 2 DATA 3 DATA 4

DATA 1 DATA 2 DATA 3 DATA 4

DATA 1 DATA 2 DATA 3 DATA 4

Client

Server

File SystemRandom Seek

Request

MissMem Table – Storage row

Read

Data – SS Table

Key Cache

Bloom Filter

Acknowledge READ PATH

DATA 1 DATA 2 DATA 3 DATA 4

DATA 1 DATA 2 DATA 3 DATA 4

DATA 1 DATA 2 DATA 3 DATA 4

DATA 1 DATA 2 DATA 3 DATA 4

to play exactly for each of the hitters”157. Players will make better catches, runners will

get more stolen bases, and the pitcher/hitter relationship will become more intense.

Big data baseball continues to grow as PITCHf/x and other systems reach minor league

affiliates. Over time, 7,500 major and minor league players could be tracked a year. Add

7,700 U.S. colleges158, 37,000 American high schools, and big data could someday track

over a million players. By following players around the world, there will be millions of

players that could be affordably tracked. Big data has come to baseball!

More players equal more games and more data as demonstrated by Gennaro’s use of

Urika. Management will gain more insight into which high school players to draft, find

undrafted “gems”, decide who to put together on teams, how to help player

development, determine negotiating strategies based on empirical player values, and

more. The data collected is complex and difficult to derive information from using

classical database tools, and some teams have begun to experiment with Hadoop.

Hadoop is a big data processing method where structured and unstructured data is

spread amongst dozens to hundreds of computers in parallel to quickly answer complex

questions, similar to the speed of a Google search request. The Rays are using Hadoop

to help derive extra revenue and create a better fan experience159. Coaches use

Cassandra, Hive, or Hbase analytics160,161 to discover better defenses, optimize running,

or give the hitter or pitcher an advantage. The following is the basic Cassandra logic.

Big data sports managers know that live baseball entertainment is not limited to the

ballpark. They would love, for example, to tie PITCHf/x into the untapped market of real-

time video games. Fans trying to bat against Mariano Rivera with real game footage

could bring in substantial revenue. Someday, the data might tie into holographic baseball

in your den, so fans could physically “stand tall” against the pitcher with bases loaded!

Page 49: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 49

Teams don’t want to eliminate the human

factors that make baseball a great game,

but instead leverage big data to help them

test theories and make smarter decisions in

all facets of their business. Grady Little’s

“old school” gut feeling days and the era of

Earl Weaver note cards may be coming to

an end. The “new school” big data explosion shows the decision pool of data is growing

at a rate beyond the abilities of a simple, singular data mining approach162. Mastering

these new analytics should yield valuable competitive insights in all facets of the game.

The Yankees have realized this and employ 21 statisticians163 who apply their

knowledge of analytics, modeling, and computer science to the customer relationship,

and solve complex business problems by asking “what if” questions of the data.

Perhaps the day will come when the fan in their seat will ask their smart phone about the

likelihood of a runner on 1st base scoring if the batter gets a curveball and drives it into

right field. Or transit officials may someday reduce car traffic and improve mass transit

service through real-time insight into the number of fans attending a game, fan

sentiment, and the score. We want to have a great time at the game and owners want us

excited about their product. Trying to understand the data relationships is a big

challenge, and teams that succeed will increase their revenue. When teams succeed,

they enhance our national pastime. Play ball!

Page 50: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 50

Appendix - PITCHf/x Metadata

The following definitions support the inning_9.XML file used earlier in this paper.

Field This data value Description

des Foul ball Pitch description

des es Foul ball Pitch description Spanish

ID 507 Unique pitch number of the game

type S B ball, S strike, X in play

tfs 230849 Event #

tfs_zulu 2012-10-29 to 3:08:49Z Starting date and time of event

x 109.01 x-pixel at home plate

y 132.11 y-pixel at home plate

sv_id 121028_230849 date/time stamp when PITCHf/x first detects the pitch YYMMDD_hhmmss

start_speed 94.4 release point velocity mph

end_speed 86.0 crossing front of home plate velocity mph

sz_top 3.32 distance from ground to the top of the strike zone, feet

sz_bot 1.53 distance from ground to the bottom of the strike zone, feet

pfx_x 8.23 horizontal movement of the pitch between the release point and home plate, inches

pfx_z 10.3 vertical movement of the pitch between the release point and home plate, inches

px -0.315 left/right distance of the pitch from the middle of the plate as it crossed home plate, feet

pz 2.919 height of the pitch as it crossed the front of home plate, feet

x0 2.562 left/right distance of the pitch, measured at the initial point, feet

y0 50.0 distance from home plate to measure the initial parameters, feet

z0 6.035 height of the pitch measured at the initial point, feet

vx0 -10.7 x velocity of the pitch measured at initial point, feet per second

vy0 -137.96 y velocity of the pitch measured at initial point, feet per second

vz0 -6.156 z velocity of the pitch measured at initial point, feet per second

ax 15.708 x acceleration of the pitch measured at the initial point, feet per second per second

ay 33.556 y acceleration of the pitch measured at the initial point, feet per second per second

az -12.449 z acceleration of the pitch measured at the initial point, feet per second per second

break_y 23.7

distance from home plate to the point in the pitch trajectory where it achieved its greatest

deviation from the straight line between the release point and the front of home plate, feet

break_angle -43.3

angle from vertical to the straight line path from the release point to where the pitch crossed

the front of home plate, as seen from the catcher’s/umpire’s perspective, degrees

break_length 4.1

greatest distance between the pitch's trajectory between the release point to the front of

home plate, and the straight line from the release point to the front of home plate, inches

pitch_type FF

most probable pitch type according to a neural net classification algorithm developed by Ross

Paul of MLBAM - FF=Four-seam Fastball

type_confidence 0.676

value of classification algorithm’s output node corresponding to the most probable pitch

type, multiplied by 1.5 if the pitch is known by MLBAM as part of the pitcher’s repertoire

zone 1

location of the pitch based on the boxes into which the Gameday app divides the strike zone

for its hot/cold zone graphics

nasty 41 how hard a pitch it is to hit 0-100

spin_dir 141.469 direction of spin of the baseball, degrees

spin_rate 2651.72 measured as the number of rotations the ball, rpm

cc Comments?

mt Unknown?

Page 51: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 51

Footnotes

1 http://espn.go.com/mlb/story/_/id/6908844/information-age-changing-way-game-played

2 http://espn.go.com/mlb/player/hotzones/_/id/5375/kevin-youkilis

3 http://www.fangraphs.com/graphs.aspx?playerid=1935&position=1B/3B&page=0&type=mini

4 http://blog.ceciliatan.com/?p=478

5 Adapted - http://www.ericcressey.com/troubleshooting-baseball-hitting-timing-is-not-always-the-problem

6 www.providencejournal.com/sports/red-sox/content/20120316-baseball-vision-when-20-20-eyesight-just-wont-cut-it.ece

7 “The Physics of baseball”, by Robert K. Adair, ISBN 0-06-008436-7, Page 39

8 http://www.bostonbaseball.com/whitesox/baseball_extras/physics.html

9 http://www.sportvision.com/baseball/fieldfx

10 http://baseball.physics.illinois.edu/FieldFX-TDR-GregR.pdf

11

http://www.brooksbaseball.net/pfxVB/pfx.php?s_type=2&sp_type=1&year=2012&month=8&day=4&batterX=&pitchSel=453329&game=gid_2012_08_04_minmlb_bosmlb_1/&prevGame=gid_2012_08_04_minmlb_bosmlb_1/&prevDate=84&inning1=y 12

http://princeofslides.blogspot.com/2011/05/sab-r-metrics-gif-movies-and-pitch.html 13

http://www.baseball-reference.com/players/r/riverma01.shtml 14

http://www.baseball-reference.com/players/m/martipe02.shtml 15

http://www.fangraphs.com/leaders.aspx?pos=all&stats=pit&lg=all&qual=0&type=1&season=2013&month=0&season1=2013&ind=0&team=0,ss&rost=0&age=&filter=&players=0 16

http://www.fangraphs.com/heatmap.aspx?playerid=844&position=P&pitch=FC 17

http://articles.latimes.com/2013/jul/29/sports/la-sp-0730-mariano-rivera-20130730 18

http://en.wikipedia.org/wiki/Elias_Sports_Bureau 19

http://sabr.org/sabermetrics 20

http://www.predictiveanalyticsworld.com/sanfrancisco/2012/presentations/pdf/Day2_1430_Vartanian.pdf 21

http://baseballanalysts.com/archives/2006/07/empirical_analy_1.php 22

http://media.hometeamsonline.com/photos/baseball/SOBRATO/TG_Strategy_Presentation.pdf 23

http://turbotodd.wordpress.com/2011/10/26/information-on-demand-2011-a-data-driven-conversation-with-michael-lewis-billy-beane/ 24

http://www.baseball-reference.com/players/d/davisch02.shtml 25

http://deadspin.com/2013-payrolls-and-salaries-for-every-mlb-team-462765594 26

http://baseballplayersalaries.com/ 27

http://www.sloansportsconference.com/wp-content/uploads/2012/02/98-Predicting-the-Next-Pitch_updated.pdf 28

http://sports.yahoo.com/fantasy/blog/roto_arcade/post/Been-Caught-Stealing-MLB-warns-Phillies-about-s?urn=fantasy,240535 29

http://www.geekwire.com/2013/major-league-baseball-dominates-big-data-mobile-game/ 30

http://seanlahman.com/blog/wp-content/uploads/2013/08/SABR-43-presentation.pdf 31

http://espn.go.com/mlb/player/stats/_/id/1150/tony-gwynn 32

http://sportsillustrated.cnn.com/vault/article/magazine/MAG1115578/2/index.htm 33

http://espn.go.com/mlb/story/_/id/6908844/information-age-changing-way-game-played 34

http://www.sloansportsconference.com/?p=6137 35

http://espn.go.com/mlb/story/_/id/6908844/information-age-changing-way-game-played 36

http://www.gartner.com/it-glossary/big-data/ 37

http://www.sloansportsconference.com/wp-content/uploads/2013/Slides/205/HP%20Vertica.pdf 38

http://www.sportvision.com/ 39

“The Technology of Baseball” by Thomas K. Adamson. ISBN 978-1-4296-9955-6, P.15 40

http://www.bloomberg.com/news/2011-03-31/baseball-is-set-for-deluge-in-data-as-monitoring-of-players-goes-hi-tech.html 41

http://blogs.wsj.com/cio/2013/07/15/baseball-all-stars-data-gets-more-sophisticated-with-field-fx/ 42

http://en.wikipedia.org/wiki/1st_&_Ten_(graphics_system) 43

http://www.dnallen.com/stuff/AllenSkillVLuck11.pdf 44

“Broadcasting Baseball: A History of the National Pastime on Radio and Television”, ISBN 978-0-7864-4644-5, P. 204 45

“Why a Curveball Curves”, ISBN 978-1-58816-794-1. Page 63, “The Physics of a Curveball” by Dr. Peter Brancazio, my professor at Brooklyn College, 1971. The Magnus effect is the top spin causing the ball to move downward in addition to gravity, back spin causing the ball to rise, all due to the ball’s interaction with the air and the wake it creates. 46

http://sportsvideo.org/main/blog/2010/10/tbs-sportvision-offer-hd-viewers-new-angles-with-pitchfx/ 47

http://www.kbardesign.com/141266/1404601/motion-design/sportvision-baseball-presentation 48

http://www.hardballtimes.com/main/blog_article/ichiro-strike-three/ 49

“Scorecasting: The Hidden Influences Behind How Sports Are Played and Games Are Won” by Tobias Moskowitz, ISBN 978-0-307-59180-7, P.14. 50

http://www.brooksbaseball.net/pfxVB/zoneTrack.php?&game=gid_2012_06_18_balmlb_nynmlb_1/&innings=yyyyyyyyy&month=06&day=18&year=2012 51

http://rayscoloredglasses.com/2013/05/09/velocity-concerns-for-rays-david-price-extend-beyond-his-fastball/ 52

http://pitchfx.texasleaguers.com/pitcher/285079/?batters=A&count=AA&pitches=AA&from=6%2F18%2F2012&to=6%2F18%2F2012

Page 52: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 52

53

http://pitchfx.texasleaguers.com/pitcher/285079/?batters=A&count=AA&pitches=AA&from=6/18/2012&to=6/18/2012 54

http://pitchfx.texasleaguers.com/readingcharts.php 55

http://www.fangraphs.com/pitchfx.aspx?playerid=8700&position=P 56

http://sportsillustrated.cnn.com/2011/writers/tom_verducci/04/12/fastballs.trackman/index.html 57

http://baseballanalysts.com/archives/fx_visualizatio_1/ 58

An Introduction to TrackMan Baseball - http://baseballanalysts.com/archives/fx_visualizations/002587-print.html 59

http://www.fangraphs.com/pitchfxg.aspx?playerid=8700&position=P&season=2011&date=0&dh=0 60

http://wapc.mlb.com/play/?content_id=26750779&topic_id=&c_id=mlb&tcid=vpp_copy_26750779&v=3 61

http://www.kbardesign.com/141266/1400894/motion-design/pitchfx-data-visualization 62

http://msn.foxsports.com/mlb/player/miguel-cabrera/hotzone/140955?q=miguel-cabrera 63

http://pitchfx.texasleaguers.com/batter/408234/?pitchers=A&count=AA&pitches=AA&from=1/1/2012&to=12/31/2012 64

http://www.sportvision.com/news/research-confirms-hard-throwers-advantage 65

http://wapc.mlb.com/play/?c_id=mlb&content_id=20171503&topic_id=7417714&v=3 66

http://www.sportvision.com/news/baseballs-strike-zone-illusionist 67

http://www.sportvision.com/news/baseballs-strike-zone-illusionist 68

http://phys.csuchico.edu/baseball/talks/AAPT(Jul-2013)/slides.pdf 69

http://baseball.physics.illinois.edu/TrackingTechnologiesBaseball.pdf 70

http://www.sportvision.com/baseball/hitfx 71

http://www.sportvision.com/baseball/hitfx 72

http://sportsvideo.org/main/blog/2013/07/sportvisions-hit-speed-data-making-broadcast-debut-on-fox-sports-regional-networks/ 73

http://seamheads.com/2010/05/09/meet-the-new-park-factors-part-ii/ 74

http://www.beyondtheboxscore.com/2009/6/6/900817/graph-of-the-day-average-batted 75

http://baseballengineer.com/2010/04/13/meet-the-new-park-factors-–-part-ii/ 76

http://www.hittrackeronline.com/detail.php?id=2012_3635&type=hitter 77

pitchfx.texasleaguers.com/pitcher/461833/?batters=A&count=AA&pitches=AA&from=4/1/2012&to=9/30/2013 78

http://espn.go.com/mlb/story/_/id/6908844/information-age-changing-way-game-played 79

http://www.billjamesonline.com/best_defensive_teams_of_the_decade/ 80

http://www.baseballnation.com/2011/4/23/2128357/will-fieldf-x-go-public 81

http://www.businessinsider.com/fieldfx-sportvision-technology-baseball-tracking-2011-03 82

http://en.wikipedia.org/wiki/Rickey_Henderson 83

http://www.fangraphs.com/community/2010-pitchfx-summit-recap/ 84

http://www.alliedvisiontec.com/apac/news/news-display/article/lets-play-ballbrmachine-vision-cameras-in-professional-baseball.html 85

http://www.baseballprospectus.com/article.php?articleid=11869 86

New Technologies in Baseball, Aug 7, 2010, Rand Pendleton, Sportvision Dir Video Broadcast Development 87

http://baseballanalysts.com/archives/fx_visualizatio_1/ 88

http://www.baseballprospectus.com/article.php?articleid=11869 89

http://www.slideshare.net/timothyf/fisher-baseball-baseball-stats-overload?from_search=13 90

http://www.seanlahman.com/2013/08/baseball-in-the-age-of-big-data/ 91

http://www.baseballprospectus.com/article.php?articleid=11869 92

http://www.kbardesign.com/141266/1429848/motion-design/baseball-reaction-time-concepts 93

www.bloomberg.com/news/2011-03-31/baseball-is-set-for-deluge-in-data-as-monitoring-of-players-goes-hi-tech.html 94

http://www.intelfreepress.com/news/moneyball-brings-big-data-to-silicon-valleys-minor-league/6239 95

http://en.wikipedia.org/wiki/Pythagorean_expectation 96

http://60ft6in.com/2012/12/09/technology-for-the-fans/ 97

http://www.nytimes.com/2009/07/10/sports/baseball/10cameras.html?_r=1& 98

http://www.youtube.com/watch?v=HIJRZW3AoOw 99

http://www.fangraphs.com/library/business/revenue-sharing/ 100

http://www.forbes.com/mlb-valuations/list/ 101

http://espn.go.com/blog/statsinfo/tag/_/name/defensive-runs-saved 102

https://infocus.emc.com/william_schmarzo/using-the-big-data-strategy-document-to-win-the-world-series/ 103

http://www.cnbc.com/id/38046122 104

http://articles.latimes.com/2011/oct/21/sports/la-sp-world-series-tv-20111022 105

http://www.tauntongazette.com/news/x1565405444/World-Series-a-boost-for-local-business 106

http://sports.newsday.com/long-island/data/baseball/mlb-salaries-2013/ 107

http://www.baseball-reference.com/players/l/leecl02.shtml 108

http://www.fangraphs.com/statss.aspx?playerid=1636&position=P#battedball 109

http://www.cbssports.com/mlb/salaries 110

http://wiki.answers.com/Q/Which_countries_play_baseball 111

http://blogs.sas.com/content/cokins/2011/08/23/moneyball-turning-the-odds-on-the-casino-with-analytics/ 112

http://www.minorleagueball.com/2013/1/28/3925786/2013-baseball-farm-system-rankings 113

http://progressive-economy.org/files/2012/05/fact512.opening.day_1.pdf 114

http://www.sloansportsconference.com/?p=7809 115

http://www.smartkage.com/#/services/professional 116

https://www.prbuzz.com/sports/114405-more-good-news-for-boston-area-baseball-players-and-fans-of-all-ages.html 117

http://sabr.org/analytics/speakers/2012 118

http://www.reuters.com/article/2011/08/25/idUS162922+25-Aug-2011+MW20110825 119

http://www.yarcdata.com/Solutions/baseball-analytics.php

Page 53: A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN BASEBALL€¦ · The Red Sox lead their arch rival New York Yankees 3-2 with 2 outs in the top of the 9th inning. On 2nd ... hero and

2014 EMC Proven Professional Knowledge Sharing 53

120

http://vincegennaro.mlblogs.com/2013/04/22/clustering-pitchers-by-similarity-part-1/ 121

http://www.yarcdata.com/Resources/video9.php 122

http://www.forums.mlb.com/n/pfx/forum.aspx?nav=messages&webtag=ml-washington&tid=38839 123

http://www.csnne.com/blog/red-sox-talk/red-sox-introduce-new-ticket-pricing-structure?p=ya5nbcs&ocid=yahoo 124

http://www.statista.com/topics/968/major-league-baseball/#chapter3 125

http://www.ocregister.com/articles/licensed-347956-products-major.html 126

http://www.askmen.com/sports/business/14_sports_business.html 127

http://washington.nationals.mlb.com/was/ticketing/ultimate_ballpark_access.jsp 128

http://emerging.uschamber.com/events/2013/bhs-big-data-and-why-it-matters 129

http://nationals.brochure-mlb.com/2013/faqs.php 130

http://newyork.yankees.mlb.com/mobile/attheballpark/index.jsp?c_id=nyy 131

http://www.nwherald.com/2013/07/12/cubs-get-signage-approval-ending-stadium-revenue-excuses/atyifb4/ 132

http://www.baseball-reference.com/teams/NYM/attend.shtml 133

http://bats.blogs.nytimes.com/2013/03/05/attendance-and-revenue-fall-at-citi-field/?_r=0 134

http://www.marketwatch.com/story/baseball-prospectus-announces-multi-year-partnership-with-mlbam-2013-09-04 135

http://www.forbes.com/sites/mikeozanian/2013/03/27/baseball-team-valuations-2013-yankees-on-top-at-2-3-billion/ 136

http://www.fastcompany.com/1822802/mlb-advanced-medias-bob-bowman-playing-digital-hardball-and-hes-winning 137

http://www.bloomberg.com/video/applying-big-data-thinking-to-sports-LZvCgdCFTxGYMtBXCVGm6g.html 138

http://sabr.org/latest/2013-sabr-analytics-conference-research-presentations - transcription of Vince Gennaro, "The Big Data Approach to Baseball Analytics" 139

http://www.forbes.com/sites/mikeozanian/2013/03/27/baseball-team-valuations-2013-yankees-on-top-at-2-3-billion/ 140

http://www.bloomberg.com/news/2013-03-28/yankees-value-surges-to-2-3-billion-on-yes-deal-forbes-says.html 141

http://sportsbusinessnews.com/content/big-business-major-league-baseball-and-television 142

http://www.nytimes.com/2013/01/29/sports/baseball/dodgers-sweet-tv-deal-will-taste-bitter-to-fans.html 143

www.bloomberg.com/news/2010-10-03/fox-says-ad-slots-for-baseball-playoff-series-world-series-are-90-sold.html 144

http://www.adweek.com/news/television/fox-sells-out-mlb-all-star-game-141581 145

http://milwaukee.brewers.mlb.com/mil/sponsorship/demographics/index.jsp 146

http://www.boston.com/sports/baseball/redsox/articles/2007/04/25/sox_have_dice_k_but_rivals_reaping_ad_dollars/ 147

http://www.forbes.com/sites/jasonbelzer/2013/02/26/automated-insights-poised-to-revolutionize-sports-media/ 148

http://www.sportce.com/red-sox-beat-tigers-5-2-advance-to-world-series-automated-insights/ 149

http://statsheet.com/sites 150

http://statsheet.com/about 151

http://www.youtube.com/watch?feature=player_embedded&v=M9oeJDkBjqA 152

http://upstart.bizjournals.com/companies/startups/2013/05/22/automated-insights-turns-data-to-stories.html?page=all 153

U.S. Patent US 20110246182A1 154

U.S. Patent US 20110246182A1 155

http://itknowledgeexchange.techtarget.com/cio/big-data-meets-automated-content-development/ 156

http://www.globalaffairs.org/forum/threads/computerised-journalism.62888/ 157

www.bloomberg.com/news/2011-03-31/baseball-is-set-for-deluge-in-data-as-monitoring-of-players-goes-hi-tech.html 158

http://wiki.answers.com/Q/How_many_colleges_are_there_in_the_US 159

http://www.7x7.com/tech-gadgets/lightspeed-venture-s-barry-eggers-how-big-data-disrupting-baseball 160

http://hadoop.apache.org/ 161

http://pivotallabs.com/patrick-mcfadin-killer-apps-using-apache-cassandra/ 162

http://resources.idgenterprise.com/original/AST-0082578_Turn_Big_Data_Inward_With_IT_Analytics.pdf 163

http://www.ft.com/cms/s/2/3f5cc88c-0b21-11e1-ae56-00144feabdc0.html#axzz1dkG8fZcd

EMC believes the information in this publication is accurate as of its publication date.

The information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC

CORPORATION MAKES NO RESPRESENTATIONS OR WARRANTIES OF ANY KIND

WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND

SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR

FITNESS FOR A PARTICULAR PURPOSE.

Use, copying, and distribution of any EMC software described in this publication requires

an applicable software license.