final info-b501_prashnti_sravani_kunle
TRANSCRIPT
ANALYZING DATA EXTRACTED FROM BREAST CANCER
SURVEILLANCE CONSORTIUM TO DETERMINE THE
MITIGATING FACTORS OF BREAST CANCER IN US. WOMEN
Prasanthi Kodthala, Sravani Vemireddy, Olakunle Oladiran
INTRODUCTION
Over the years, the prevalence of various forms of cancer
has been witnessed across the world. Breast cancer seems to be
the leading form of the various forms malignant disease with
about 14.6% new cases among women as reported by the
National Cancer Institute.
Our aim is to inquire into those factors that could give us
directions in respect of the causes of breast cancer
Source: http://seer.cancer.gov/statfacts/html/breast.html
BACKGROUND
• In the United States, breast cancer is said to be the most common cancer in women. Each year there are about 2,300
new cases of breast cancer in men and about 246,660 new cases in women as of 2016.
• The precise causes of breast cancer are still unclear.
• Risk factors of breast cancer include age, family history, menopause etc.
BACKGROUND (CONT’D)
Statistical Analyses as a Tool
We would be using some statistical tools to observe the relationship between the factors that have been mentioned which
provides some insights into our study.
It is worthy of note to state that these factors regardless of the relationship between may or may not be dependent on
another, while correlation which would be employed to analyze the relationship does not imply that the factors are
responsible for the prevalence of breast cancer. We are hoping that the various statistical analyses conducted would show
us direction to follow to have a better understanding of what to do to reduce breast cancer prevalence.
DATA OVERVIEW & HYPOTHESIS
The dataset to be used to examine the various factors that could provide insights into breast cancer prevention was
extracted with permission from the Breast Cancer Surveillance Consortium (BCSC), which is a research resource for
studies designed for delivery of breast cancers screening and other patient related outcomes in the US:
A few of the reasons why we chose this dataset includes the fact that:
• BCSC is a credible source of datasets for most breast cancer-related issues.
• It is supported by statistical coordinating center and also the National Cancer Institute(NCI)
OBJECTIVES
The main objective of this research is to inquire into the stated hypothesis which borders around the major factors that
may provide some insights on the reasons for the prevalence of breast cancer.
We analyzed some properly selected datasets which may be helpful in revealing the factors to be managed, which could
lead into a cancer-free society.
Hypothesis:
• To determine the extent of correlation between parameters such menopause, hormone therapy use, Ever Given Birth,
previous breast cancer and family history of breast cancer
• Evaluate screening tests (i.e., digital mammogram and ultrasound) of breast cancer to prove which would be more
beneficial
• Determine if Level of education is a factor in breast cancer awareness
METHODS USED
1.Correlation Analysis:
Pearson product-moment correlation coefficient
𝜌X ,Y = Cov (X ,Y )
𝜎X 𝜎Y
= 𝔼[(X −𝜇X )(Y −𝜇Y )]
𝜎X 𝜎Y
• Cov means covariance
• 𝔼 is the expectation operator
• 𝜇X is the expected value of random variable X
• 𝜎X is the standard deviation of X
• 𝜌X ,Y ranges from -1 to +1
2. Descriptive Analysis:
• Bar chart
• Histogram
BREAKDOWN OF THE PARAMETERS
The following datasets were extracted from the broadsheet provided by the Breast Cancer Consortium:
• Number of women with or without menopause
• Number of women who are either On Hormone Therapy Use or not
• Number of women with or without Previous Breast Cancer
• Number of women with or without Family History Of Breast Cancer
• Number of women who had ever given birth or not
For insights into screening tests:
• Number of Women who had undergone Ultrasound Screening
• Women who had Undergone Digital Mammograms
level of education in relation to Breast Cancer awareness
• Level of education of women in the dataset
Data extraction
from BCSC website
to a spreadsheet
Created tables
in MySQL
database
Established a connection
to MySQL database for
data retrieval and
analysis on a Python
GUI
Performed correlation
analysis between the
parameters
Data visualization using
Histogram to show level
of education
Comparative analysis
between screening tests
using Bar Chats
WORKFLOW
Tools Used for Data Analysis
Data Extraction and Normalization
• Excel 2013
Database Design and Connection
• My SQL Workbench
• Putty for Secure Shell Tunneling
Statistical Analyses
• Anaconda GUI for Python
DATA EXTRACTION & ANALYSIS
DATA EXTRACTION &ANALYSIS
Step One
With the permission of BCSC, tables were extracted from their Website and MS Excel 2013 was used to perform data
normalization on the some of the datasets used. Based on the fact that there were some ‘Missing’ Values.
Step Two
Tables were created in the MySQL database for the variables to be analysed, and also, queries were created to insert the
data into the tables created.
DATA EXTRACTION &ANALYSIS(CONT’D)Table Creation in the Database
DATA EXTRACTION &ANALYSIS(CONT’D)
Inserting Data into the Database
DATA EXTRACTION &ANALYSIS (CONT’D)
Step Four
Get anaconda ready by importing the required python libraries to be used, such as:
• Numpy
• Pandas( which comes with Anaconda by default).
• Jupyter Notebook
• Matplotlib
• Seaborn
• SSH(for local host tunneling)
• MySQL_DB
Step Five
Secure Shell tunneling had to be configured for Anaconda to be able to connect to the Database. This was conducted
through Putty.
DATA EXTRACTION &ANALYSIS (CONT’D)Python Libraries installation and SSH Tunneling
DATA EXTRACTION &ANALYSIS (CONT’D)Step Six
Jupyter Notebook was started and the database connection was initiated on the notebook. Python queries were written to pull
the required datasets from the database.
DATA EXTRACTION &ANALYSIS (CONT’D)
Broadsheet of the Parameters Used for First Hypothesis
DATA EXTRACTION &ANALYSIS (CONT’D)Step Six
A table showing a general description of the data was shown and a correlation table showing the strength and the directional
relationship between the variables. Scatterplots were also generated for the correlation using the seaborn library.
DATA EXTRACTION &ANALYSIS (CONT’D)
Correlation Broadsheet
DATA EXTRACTION &ANALYSIS (CONT’D)
Scatterplots
DATA EXTRACTION &ANALYSIS (CONT’D)
Scatterplots
DATA EXTRACTION &ANALYSIS (CONT’D)
Scatterplots
DATA EXTRACTION &ANALYSIS (CONT’D)Step Seven
In respect of the second hypothesis, a descriptive analysis to compare the rate of acceptance of Digital Mammogram and
Ultrasound was conducted using Bar charts.
DATA EXTRACTION &ANALYSIS (CONT’D)Step Eight
Also, descriptive analysis show that the level of education of education could be a major determinant for breast cancer
awareness.
Key:
Lower_Than_HS Lower than High School
HS = High school
SC = Some College
CPCG College/Post College
Graduate
DISCUSSION
Based on the results of analyses we noticed a few positive and negative relationship between the five parameters we
analyzed. The breakdown of the strength and directional relationship are as follows:
Positive correlations
Ever Given Birth and Current Hormonal Therapy
Not Given Birth and Family History
Current Hormonal Therapy and Previous Breast Cancer
Negative correlation
No Previous Breast Cancer and Current Hormonal Therapy
Not Given Birth and Current Hormonal Therapy
Given Birth and Family History
LIMITATIONS
• Although, age and Body Mass Index( BMI) are major risk factors in causing breast cancer we could not include that to
check correlations with other parameters due to difference in the data.
• As we were not able to find reliable datasets that are recent, we conducted our analyses on retrospective data set.
The take on this is that, other factors could have been analyzed which could give us deeper insights into reducing the
likelihood of having breast cancer.
FUTURE IMPLEMENTATIONS
We were able to build the framework for Python to pull data from a database and conduct various analyses on it. We
hope that in the nearest future when more recent datasets are available it would be easier to conduct a full-scale
analyses on it. Speculations such as the fact that Digital Mammogram may also be responsible for Breast Cancer could
also be examined. We would just need to do the following:
• Refine the Python and the embedded SQL queries to accommodate for more flexibility and ensure more automation
which would probably reduce the processing speed of the system.
• We also hope that a dashboard could be built to make it easy for data extraction, transportation and loading into the
database as well as make it readily available for descriptive analyses.
CONCLUSION
Examining the results of the correlation we could determine that there are some insights as regards to the
relationships between the variables such as:
• Current Hormonal Therapy use and Given Birth
However, we would stress the point that correlation does not necessarily mean causation, therefore, it is not
sufficient that the use of Hormone Therapy and having children would mean that a woman would have
breast cancer. Other factors could be responsible for breast cancer that are different from the factors
considered. We would assume that a Randomized Controlled Experiment is required to ascertain that these
factors actually cause breast cancer.
As the years progressed ultrasounds took over digital mammograms which shows that digital mammogram is
the effective screening test for detecting cancer.
We therefore conclude that, level of education plays an important role in creating breast cancer awareness
based on the results of the data analyzed.
REFERENCES
• Data collection and sharing was supported by the National Cancer Institute-funded Breast Cancer
Surveillance Consortium (HHSN261201100031C).
• American Cancer Society: Cancer Facts and Figures 2016. Atlanta, Ga: American Cancer Society, 201
• Think stats Exploratory data analysis in python(Version 2.0.27) Allen B. Downey
ACKNOWLEDGEMENTS
We acknowledge the effort of our able Professor, Dr. Purkayastha for his help in getting direction of the
research, most especially establishing a successful connection between Python and the MySQL database