final info-b501_prashnti_sravani_kunle

31
ANALYZING DATA EXTRACTED FROM BREAST CANCER SURVEILLANCE CONSORTIUM TO DETERMINE THE MITIGATING FACTORS OF BREAST CANCER IN US. WOMEN Prasanthi Kodthala, Sravani Vemireddy, Olakunle Oladiran

Upload: olakunle-francis-oladiran

Post on 21-Jan-2017

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Final INFO-B501_Prashnti_Sravani_Kunle

ANALYZING DATA EXTRACTED FROM BREAST CANCER

SURVEILLANCE CONSORTIUM TO DETERMINE THE

MITIGATING FACTORS OF BREAST CANCER IN US. WOMEN

Prasanthi Kodthala, Sravani Vemireddy, Olakunle Oladiran

Page 2: Final INFO-B501_Prashnti_Sravani_Kunle

INTRODUCTION

Over the years, the prevalence of various forms of cancer

has been witnessed across the world. Breast cancer seems to be

the leading form of the various forms malignant disease with

about 14.6% new cases among women as reported by the

National Cancer Institute.

Our aim is to inquire into those factors that could give us

directions in respect of the causes of breast cancer

Source: http://seer.cancer.gov/statfacts/html/breast.html

Page 3: Final INFO-B501_Prashnti_Sravani_Kunle

BACKGROUND

• In the United States, breast cancer is said to be the most common cancer in women. Each year there are about 2,300

new cases of breast cancer in men and about 246,660 new cases in women as of 2016.

• The precise causes of breast cancer are still unclear.

• Risk factors of breast cancer include age, family history, menopause etc.

Page 4: Final INFO-B501_Prashnti_Sravani_Kunle

BACKGROUND (CONT’D)

Statistical Analyses as a Tool

We would be using some statistical tools to observe the relationship between the factors that have been mentioned which

provides some insights into our study.

It is worthy of note to state that these factors regardless of the relationship between may or may not be dependent on

another, while correlation which would be employed to analyze the relationship does not imply that the factors are

responsible for the prevalence of breast cancer. We are hoping that the various statistical analyses conducted would show

us direction to follow to have a better understanding of what to do to reduce breast cancer prevalence.

Page 5: Final INFO-B501_Prashnti_Sravani_Kunle

DATA OVERVIEW & HYPOTHESIS

The dataset to be used to examine the various factors that could provide insights into breast cancer prevention was

extracted with permission from the Breast Cancer Surveillance Consortium (BCSC), which is a research resource for

studies designed for delivery of breast cancers screening and other patient related outcomes in the US:

A few of the reasons why we chose this dataset includes the fact that:

• BCSC is a credible source of datasets for most breast cancer-related issues.

• It is supported by statistical coordinating center and also the National Cancer Institute(NCI)

Page 6: Final INFO-B501_Prashnti_Sravani_Kunle

OBJECTIVES

The main objective of this research is to inquire into the stated hypothesis which borders around the major factors that

may provide some insights on the reasons for the prevalence of breast cancer.

We analyzed some properly selected datasets which may be helpful in revealing the factors to be managed, which could

lead into a cancer-free society.

Hypothesis:

• To determine the extent of correlation between parameters such menopause, hormone therapy use, Ever Given Birth,

previous breast cancer and family history of breast cancer

• Evaluate screening tests (i.e., digital mammogram and ultrasound) of breast cancer to prove which would be more

beneficial

• Determine if Level of education is a factor in breast cancer awareness

Page 7: Final INFO-B501_Prashnti_Sravani_Kunle

METHODS USED

1.Correlation Analysis:

Pearson product-moment correlation coefficient

𝜌X ,Y = Cov (X ,Y )

𝜎X 𝜎Y

= 𝔼[(X −𝜇X )(Y −𝜇Y )]

𝜎X 𝜎Y

• Cov means covariance

• 𝔼 is the expectation operator

• 𝜇X is the expected value of random variable X

• 𝜎X is the standard deviation of X

• 𝜌X ,Y ranges from -1 to +1

2. Descriptive Analysis:

• Bar chart

• Histogram

Page 8: Final INFO-B501_Prashnti_Sravani_Kunle

BREAKDOWN OF THE PARAMETERS

The following datasets were extracted from the broadsheet provided by the Breast Cancer Consortium:

• Number of women with or without menopause

• Number of women who are either On Hormone Therapy Use or not

• Number of women with or without Previous Breast Cancer

• Number of women with or without Family History Of Breast Cancer

• Number of women who had ever given birth or not

For insights into screening tests:

• Number of Women who had undergone Ultrasound Screening

• Women who had Undergone Digital Mammograms

level of education in relation to Breast Cancer awareness

• Level of education of women in the dataset

Page 9: Final INFO-B501_Prashnti_Sravani_Kunle

Data extraction

from BCSC website

to a spreadsheet

Created tables

in MySQL

database

Established a connection

to MySQL database for

data retrieval and

analysis on a Python

GUI

Performed correlation

analysis between the

parameters

Data visualization using

Histogram to show level

of education

Comparative analysis

between screening tests

using Bar Chats

WORKFLOW

Page 10: Final INFO-B501_Prashnti_Sravani_Kunle

Tools Used for Data Analysis

Data Extraction and Normalization

• Excel 2013

Database Design and Connection

• My SQL Workbench

• Putty for Secure Shell Tunneling

Statistical Analyses

• Anaconda GUI for Python

DATA EXTRACTION & ANALYSIS

Page 11: Final INFO-B501_Prashnti_Sravani_Kunle

DATA EXTRACTION &ANALYSIS

Step One

With the permission of BCSC, tables were extracted from their Website and MS Excel 2013 was used to perform data

normalization on the some of the datasets used. Based on the fact that there were some ‘Missing’ Values.

Step Two

Tables were created in the MySQL database for the variables to be analysed, and also, queries were created to insert the

data into the tables created.

Page 12: Final INFO-B501_Prashnti_Sravani_Kunle

DATA EXTRACTION &ANALYSIS(CONT’D)Table Creation in the Database

Page 13: Final INFO-B501_Prashnti_Sravani_Kunle

DATA EXTRACTION &ANALYSIS(CONT’D)

Inserting Data into the Database

Page 14: Final INFO-B501_Prashnti_Sravani_Kunle

DATA EXTRACTION &ANALYSIS (CONT’D)

Step Four

Get anaconda ready by importing the required python libraries to be used, such as:

• Numpy

• Pandas( which comes with Anaconda by default).

• Jupyter Notebook

• Matplotlib

• Seaborn

• SSH(for local host tunneling)

• MySQL_DB

Step Five

Secure Shell tunneling had to be configured for Anaconda to be able to connect to the Database. This was conducted

through Putty.

Page 15: Final INFO-B501_Prashnti_Sravani_Kunle

DATA EXTRACTION &ANALYSIS (CONT’D)Python Libraries installation and SSH Tunneling

Page 16: Final INFO-B501_Prashnti_Sravani_Kunle

DATA EXTRACTION &ANALYSIS (CONT’D)Step Six

Jupyter Notebook was started and the database connection was initiated on the notebook. Python queries were written to pull

the required datasets from the database.

Page 17: Final INFO-B501_Prashnti_Sravani_Kunle

DATA EXTRACTION &ANALYSIS (CONT’D)

Broadsheet of the Parameters Used for First Hypothesis

Page 18: Final INFO-B501_Prashnti_Sravani_Kunle

DATA EXTRACTION &ANALYSIS (CONT’D)Step Six

A table showing a general description of the data was shown and a correlation table showing the strength and the directional

relationship between the variables. Scatterplots were also generated for the correlation using the seaborn library.

Page 19: Final INFO-B501_Prashnti_Sravani_Kunle

DATA EXTRACTION &ANALYSIS (CONT’D)

Correlation Broadsheet

Page 20: Final INFO-B501_Prashnti_Sravani_Kunle

DATA EXTRACTION &ANALYSIS (CONT’D)

Scatterplots

Page 21: Final INFO-B501_Prashnti_Sravani_Kunle

DATA EXTRACTION &ANALYSIS (CONT’D)

Scatterplots

Page 22: Final INFO-B501_Prashnti_Sravani_Kunle

DATA EXTRACTION &ANALYSIS (CONT’D)

Scatterplots

Page 23: Final INFO-B501_Prashnti_Sravani_Kunle

DATA EXTRACTION &ANALYSIS (CONT’D)Step Seven

In respect of the second hypothesis, a descriptive analysis to compare the rate of acceptance of Digital Mammogram and

Ultrasound was conducted using Bar charts.

Page 24: Final INFO-B501_Prashnti_Sravani_Kunle

DATA EXTRACTION &ANALYSIS (CONT’D)Step Eight

Also, descriptive analysis show that the level of education of education could be a major determinant for breast cancer

awareness.

Key:

Lower_Than_HS Lower than High School

HS = High school

SC = Some College

CPCG College/Post College

Graduate

Page 25: Final INFO-B501_Prashnti_Sravani_Kunle

DISCUSSION

Based on the results of analyses we noticed a few positive and negative relationship between the five parameters we

analyzed. The breakdown of the strength and directional relationship are as follows:

Positive correlations

Ever Given Birth and Current Hormonal Therapy

Not Given Birth and Family History

Current Hormonal Therapy and Previous Breast Cancer

Negative correlation

No Previous Breast Cancer and Current Hormonal Therapy

Not Given Birth and Current Hormonal Therapy

Given Birth and Family History

Page 26: Final INFO-B501_Prashnti_Sravani_Kunle

LIMITATIONS

• Although, age and Body Mass Index( BMI) are major risk factors in causing breast cancer we could not include that to

check correlations with other parameters due to difference in the data.

• As we were not able to find reliable datasets that are recent, we conducted our analyses on retrospective data set.

The take on this is that, other factors could have been analyzed which could give us deeper insights into reducing the

likelihood of having breast cancer.

Page 27: Final INFO-B501_Prashnti_Sravani_Kunle

FUTURE IMPLEMENTATIONS

We were able to build the framework for Python to pull data from a database and conduct various analyses on it. We

hope that in the nearest future when more recent datasets are available it would be easier to conduct a full-scale

analyses on it. Speculations such as the fact that Digital Mammogram may also be responsible for Breast Cancer could

also be examined. We would just need to do the following:

• Refine the Python and the embedded SQL queries to accommodate for more flexibility and ensure more automation

which would probably reduce the processing speed of the system.

• We also hope that a dashboard could be built to make it easy for data extraction, transportation and loading into the

database as well as make it readily available for descriptive analyses.

Page 28: Final INFO-B501_Prashnti_Sravani_Kunle

CONCLUSION

Examining the results of the correlation we could determine that there are some insights as regards to the

relationships between the variables such as:

• Current Hormonal Therapy use and Given Birth

However, we would stress the point that correlation does not necessarily mean causation, therefore, it is not

sufficient that the use of Hormone Therapy and having children would mean that a woman would have

breast cancer. Other factors could be responsible for breast cancer that are different from the factors

considered. We would assume that a Randomized Controlled Experiment is required to ascertain that these

factors actually cause breast cancer.

As the years progressed ultrasounds took over digital mammograms which shows that digital mammogram is

the effective screening test for detecting cancer.

We therefore conclude that, level of education plays an important role in creating breast cancer awareness

based on the results of the data analyzed.

Page 29: Final INFO-B501_Prashnti_Sravani_Kunle

REFERENCES

• Data collection and sharing was supported by the National Cancer Institute-funded Breast Cancer

Surveillance Consortium (HHSN261201100031C).

• American Cancer Society: Cancer Facts and Figures 2016. Atlanta, Ga: American Cancer Society, 201

• Think stats Exploratory data analysis in python(Version 2.0.27) Allen B. Downey

Page 30: Final INFO-B501_Prashnti_Sravani_Kunle

ACKNOWLEDGEMENTS

We acknowledge the effort of our able Professor, Dr. Purkayastha for his help in getting direction of the

research, most especially establishing a successful connection between Python and the MySQL database

Page 31: Final INFO-B501_Prashnti_Sravani_Kunle