data scientist (analytics)

19
What is Data Science?

Upload: sharfaraj-nowaz-sayem

Post on 21-Jul-2016

51 views

Category:

Documents


4 download

DESCRIPTION

data science

TRANSCRIPT

Page 1: Data Scientist (Analytics)

What is Data Science?

Page 2: Data Scientist (Analytics)

Data science is the study of the generalizable extraction of knowledge from data. It incorporates varying elements and builds on techniques and theories from many fields, including mathematics, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing and high performance computing with the goal of extracting meaning from data and creating data products. Data science is a buzzword, often used interchangeably with analytics or Big data, that is often abused for marketing anything involving data processing, in particular to re-brand existing competitive, intelligence and business analytics approaches.

Page 3: Data Scientist (Analytics)
Page 4: Data Scientist (Analytics)

Figure: Drew Conway’s Venn diagram of data science

Page 5: Data Scientist (Analytics)

Data Scientist

Page 6: Data Scientist (Analytics)

Data Scientist solves complex data problems through employing deep expertise in some scientific discipline. It is generally expected that data scientist are able to work with various elements of computer science, mathematics and statistics. However a data scientist is most likely to be an expert in only one or two of these disciplines and proficient in another two or three. And it means that data science must be practical as a team, where across the membership of the team there is expertise and proficiency across all the disciplines.

Page 7: Data Scientist (Analytics)

Desirable qualities of a Data Scientist

Page 8: Data Scientist (Analytics)

•Data grappling skills: they should know how to move data around and manipulate data with some programming language or languages.•Data viz experience: they should know how to draw informative pictures of data. That should in fact be the very first thing they do when they encounter new data•Knowledge of stats, error bars, confidence intervals: ask them to explain this stuff to you. They should be able to.

(Continued….

Page 9: Data Scientist (Analytics)

•Experience with forecasting and prediction, both general and specific (ex): lots of variety here, and if you have more than one data scientist position open, I’d try to get people from different backgrounds (finance and machine learning for example) because you’ll get great cross-pollination that way•Great communication skills: data scientists will be a big part of your business and will contribute to communications with big clients.

Page 10: Data Scientist (Analytics)

Why we statisticians are here?

Page 11: Data Scientist (Analytics)

There is a debate in the arena of Data Scientist that Statisticians are not needed in the field of Data Science. But by the following few reasons one can prove that the Statistics or Statisticians is a vital part of Data Science."Data grappling skills" are things we have learnt along

the way in modern regression and advanced data analysis, which between them guarantee an intensive R usage. These are things we explicitly teach in statistical computing, with even more R.

"Data viz experience" begins with our introductory Statistics classes, and then goes on in great depth in statistical graphics and visualization, with even more of the accompanying R. The habit of starting to understand any new data by drawing pictures is certainly something we inculcate.

(Continued…

Page 12: Data Scientist (Analytics)

"Knowledge of stats, error bars, confidence intervals" needs no elaboration.

"Experience with forecasting and prediction" again, both regression and advanced data analysis are full of this.

"Great communication skills" Graphics, regression, and advanced data analysis all require, and grade on, the ability to write comprehensible and useful data analysis reports. The research projects class involves a lot of this, as well as regular oral presentations. It would be good if we did more on this front, however.

Page 13: Data Scientist (Analytics)

Big Data

Page 14: Data Scientist (Analytics)

Big Data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set.

Page 15: Data Scientist (Analytics)

In the other hand we can say that define Big Data As, Very large distributed aggregations of loosely structured data – often incomplete and inaccessible:Petabytes/Exabytes of data,Millions/billions of people,Billions/trillions of record,Loosely structured and often distributed data,Flat schemas with few complex interrelationships,Often involving time-stamped events,Often made up of incomplete data,Often including connections between data elements

that must be probabilistically inferred.

Page 16: Data Scientist (Analytics)

Some of examples of Big Data problems are:Web-based businesses are developing information

products that combine data gathered from customers to offer more appealing recommendations and more successful coupon programs.

By embracing social media, retail organizations are engaging brand advocates, changing the perception of brand antagonists, and even enabling enthusiastic customers to sell their products.

Hospitals are analyzing medical data and patient records to predict those patients that are likely to seek readmission within a few months of discharge. The hospital can then intervene in hopes of preventing another costly hospital stay.

Page 17: Data Scientist (Analytics)
Page 18: Data Scientist (Analytics)
Page 19: Data Scientist (Analytics)

Big Data Management tools we will use in future:

RHadoopHivePigPython