intership summary

44
Internship Summary Junting Ma(Sarah)

Upload: junting-ma

Post on 16-Jan-2017

156 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: intership summary

Internship SummaryJunting Ma(Sarah)

Page 2: intership summary

Introduction• Find mutations and conditions from 81 pages

clinical trial data• Learned Javascript and MongoDB• Learned python and made a web crawler• Applied machine learning method to classify

publications• Future works

Page 3: intership summary

Javascript and MongoDB• Find the top cancers in top 100 cities according to the

cancers occurrence and visualize the result.

Page 4: intership summary
Page 5: intership summary

• Count the number of gene appeared on each clinical trial and visualized the data.

Page 6: intership summary

• Find corresponding drugs and publication ID for each mutation

Page 7: intership summary

• Count cancer trials in each medical group and visualized the data

Page 8: intership summary

• Find the top cancers and medical group according to the number of trials by phase

Page 9: intership summary
Page 10: intership summary

Make web crawler to get contact information for our company

• Motivation: From contact searching project, we did many simple repeated works, like open the page, look for email addresses, copy and paste, search key words on Google…

We could save much time if we make programs to do those work automatically.• Improvement: The python program is able to do all things

mentioned above automatically.• Result: Get about 500 email addresses from 300 medical

institution names.(Search four key words for each insititution)

Page 11: intership summary

Applied machine learning method to classify publications

Motivations: • 40,000 publications will take huge time for

manually classification.• The publications that are in one class tend to

have some similar words or phrases on their title and purpose section.

• Machine learning methods are good for classification problems.

Page 12: intership summary

Introduction of machine learning methods

• Supervised Learning Use a data set that already known their classes, which is also called training set to build a model to predict the unknown data.• Unsupervised Learning Classify data based on the similarities without given any training set.

Page 13: intership summary

• Ideas: Use unsupervised learning methods to classify our data based on the similarity of their titles and purpose.

Try different methods of unsupervised learning to choose which one is the best for our data.i) Principal Component Analysis (PCA)ii) K-means Clusteriii) Hierarchy Cluster

Page 14: intership summary

• 1) PCA • Dimension reduction.• Put data into new space and trying to capture

more features of data using less dimension.• Result: Each new dimension in new space just

capture few features of data. No good.• Problem: Hard to interpret

Page 15: intership summary

• 2) K-means Cluster

• Problem: • Locally optimization and unstable (different start points may have different

result).• Hard to decide which points to start with and how many group we should

divide into.

Page 16: intership summary

• 3) Hierarchy Cluster• Start with all data points in individual groups• Repeat: merge the two “closest” groups• Stop: when all groups have been merged into a single group.

Page 17: intership summary
Page 18: intership summary
Page 19: intership summary
Page 20: intership summary
Page 21: intership summary
Page 22: intership summary
Page 23: intership summary
Page 24: intership summary
Page 25: intership summary
Page 26: intership summary

• Advantages:• 1.gives us a structure from whole picture.• 2. divide our data according to the tree structure instead of guessing.• 3. we could look into data part by part.

Page 27: intership summary

Steps of processing our data

• 1) Count each words’ occurrence in each publication title and make a dictionary of all appeared words.

• 2) Calculate the distance between publications in high dimensional space (each word in the dictionary represents a dimension)

• 3) Use the distances to make dendrogram.

Page 28: intership summary

Outputs:

Page 29: intership summary

Cut by 0.45Problems: Hard to know what each group meansSolution: Applied supervised learning based on the label from dendrogram to make tree model with key words.

Page 30: intership summary

• Usage: 1) Gives us the features of the publications in one group. 2) Gives us some useful key words that we could put into our searching engine.

Page 31: intership summary

• Problems:• The key words showed on the tree may not be what we want.• Using phrases instead of single words may give us better

interpretation.• Not all publications are useful to us.• Solutions:• Calculate the occurrence of each 2-word-phrases, 3-word-

phrases, 4-word-phrases,….The top phrases may give us valuable information on how to make classifications.

• Compare publication titles to trials titles and just do analysis on those publications with title similar to trials titles.

• Make cluster tree base on phrases instead of words.

Page 32: intership summary

New outputs(3-word-phrase):

Page 33: intership summary

Cut by 0.36

Page 34: intership summary
Page 35: intership summary

Take all group1 data out and subdivide it.

Page 36: intership summary
Page 37: intership summary

4-word-phrases

Page 38: intership summary
Page 39: intership summary
Page 40: intership summary

2-word-phrase

Page 41: intership summary
Page 42: intership summary
Page 43: intership summary

Future works• 1) Delete some useless phrases in the dictionary and

remake the trees.• 2)Make more trees models base on different-word-phrases.• 3)Read publication titles that are suggested in one group to

see whether the classification algorithm make sense.• 4)Do supervised learning after we manually classified a

number of publications. This can improve the accuracy and use computer to classify the publications in a way we want rather than just use the similarities.

• 5)Learn new methods in advanced machine learning class.

Page 44: intership summary

Thanks for listening!