intership summary
TRANSCRIPT
Internship SummaryJunting Ma(Sarah)
Introduction• Find mutations and conditions from 81 pages
clinical trial data• Learned Javascript and MongoDB• Learned python and made a web crawler• Applied machine learning method to classify
publications• Future works
Javascript and MongoDB• Find the top cancers in top 100 cities according to the
cancers occurrence and visualize the result.
• Count the number of gene appeared on each clinical trial and visualized the data.
• Find corresponding drugs and publication ID for each mutation
• Count cancer trials in each medical group and visualized the data
• Find the top cancers and medical group according to the number of trials by phase
Make web crawler to get contact information for our company
• Motivation: From contact searching project, we did many simple repeated works, like open the page, look for email addresses, copy and paste, search key words on Google…
We could save much time if we make programs to do those work automatically.• Improvement: The python program is able to do all things
mentioned above automatically.• Result: Get about 500 email addresses from 300 medical
institution names.(Search four key words for each insititution)
Applied machine learning method to classify publications
Motivations: • 40,000 publications will take huge time for
manually classification.• The publications that are in one class tend to
have some similar words or phrases on their title and purpose section.
• Machine learning methods are good for classification problems.
Introduction of machine learning methods
• Supervised Learning Use a data set that already known their classes, which is also called training set to build a model to predict the unknown data.• Unsupervised Learning Classify data based on the similarities without given any training set.
• Ideas: Use unsupervised learning methods to classify our data based on the similarity of their titles and purpose.
Try different methods of unsupervised learning to choose which one is the best for our data.i) Principal Component Analysis (PCA)ii) K-means Clusteriii) Hierarchy Cluster
• 1) PCA • Dimension reduction.• Put data into new space and trying to capture
more features of data using less dimension.• Result: Each new dimension in new space just
capture few features of data. No good.• Problem: Hard to interpret
• 2) K-means Cluster
• Problem: • Locally optimization and unstable (different start points may have different
result).• Hard to decide which points to start with and how many group we should
divide into.
• 3) Hierarchy Cluster• Start with all data points in individual groups• Repeat: merge the two “closest” groups• Stop: when all groups have been merged into a single group.
• Advantages:• 1.gives us a structure from whole picture.• 2. divide our data according to the tree structure instead of guessing.• 3. we could look into data part by part.
Steps of processing our data
• 1) Count each words’ occurrence in each publication title and make a dictionary of all appeared words.
• 2) Calculate the distance between publications in high dimensional space (each word in the dictionary represents a dimension)
• 3) Use the distances to make dendrogram.
Outputs:
Cut by 0.45Problems: Hard to know what each group meansSolution: Applied supervised learning based on the label from dendrogram to make tree model with key words.
• Usage: 1) Gives us the features of the publications in one group. 2) Gives us some useful key words that we could put into our searching engine.
• Problems:• The key words showed on the tree may not be what we want.• Using phrases instead of single words may give us better
interpretation.• Not all publications are useful to us.• Solutions:• Calculate the occurrence of each 2-word-phrases, 3-word-
phrases, 4-word-phrases,….The top phrases may give us valuable information on how to make classifications.
• Compare publication titles to trials titles and just do analysis on those publications with title similar to trials titles.
• Make cluster tree base on phrases instead of words.
New outputs(3-word-phrase):
Cut by 0.36
Take all group1 data out and subdivide it.
4-word-phrases
2-word-phrase
Future works• 1) Delete some useless phrases in the dictionary and
remake the trees.• 2)Make more trees models base on different-word-phrases.• 3)Read publication titles that are suggested in one group to
see whether the classification algorithm make sense.• 4)Do supervised learning after we manually classified a
number of publications. This can improve the accuracy and use computer to classify the publications in a way we want rather than just use the similarities.
• 5)Learn new methods in advanced machine learning class.
Thanks for listening!