intership summary

Internship SummaryJunting Ma(Sarah)

Introduction• Find mutations and conditions from 81 pages

clinical trial data• Learned Javascript and MongoDB• Learned python and made a web crawler• Applied machine learning method to classify

publications• Future works

Javascript and MongoDB• Find the top cancers in top 100 cities according to the

cancers occurrence and visualize the result.

• Count the number of gene appeared on each clinical trial and visualized the data.

• Find corresponding drugs and publication ID for each mutation

• Count cancer trials in each medical group and visualized the data

• Find the top cancers and medical group according to the number of trials by phase

Make web crawler to get contact information for our company

• Motivation: From contact searching project, we did many simple repeated works, like open the page, look for email addresses, copy and paste, search key words on Google…

We could save much time if we make programs to do those work automatically.• Improvement: The python program is able to do all things

mentioned above automatically.• Result: Get about 500 email addresses from 300 medical

institution names.(Search four key words for each insititution)

Applied machine learning method to classify publications

Motivations: • 40,000 publications will take huge time for

manually classification.• The publications that are in one class tend to

have some similar words or phrases on their title and purpose section.

• Machine learning methods are good for classification problems.

Introduction of machine learning methods

• Supervised Learning Use a data set that already known their classes, which is also called training set to build a model to predict the unknown data.• Unsupervised Learning Classify data based on the similarities without given any training set.

• Ideas: Use unsupervised learning methods to classify our data based on the similarity of their titles and purpose.

Try different methods of unsupervised learning to choose which one is the best for our data.i) Principal Component Analysis (PCA)ii) K-means Clusteriii) Hierarchy Cluster

• 1) PCA • Dimension reduction.• Put data into new space and trying to capture

more features of data using less dimension.• Result: Each new dimension in new space just

capture few features of data. No good.• Problem: Hard to interpret

• 2) K-means Cluster

• Problem: • Locally optimization and unstable (different start points may have different

result).• Hard to decide which points to start with and how many group we should

divide into.

• 3) Hierarchy Cluster• Start with all data points in individual groups• Repeat: merge the two “closest” groups• Stop: when all groups have been merged into a single group.

• Advantages:• 1.gives us a structure from whole picture.• 2. divide our data according to the tree structure instead of guessing.• 3. we could look into data part by part.

Steps of processing our data

• 1) Count each words’ occurrence in each publication title and make a dictionary of all appeared words.

• 2) Calculate the distance between publications in high dimensional space (each word in the dictionary represents a dimension)

• 3) Use the distances to make dendrogram.

Outputs:

Cut by 0.45Problems: Hard to know what each group meansSolution: Applied supervised learning based on the label from dendrogram to make tree model with key words.

• Usage: 1) Gives us the features of the publications in one group. 2) Gives us some useful key words that we could put into our searching engine.

• Problems:• The key words showed on the tree may not be what we want.• Using phrases instead of single words may give us better

interpretation.• Not all publications are useful to us.• Solutions:• Calculate the occurrence of each 2-word-phrases, 3-word-

phrases, 4-word-phrases,….The top phrases may give us valuable information on how to make classifications.

• Compare publication titles to trials titles and just do analysis on those publications with title similar to trials titles.

• Make cluster tree base on phrases instead of words.

New outputs(3-word-phrase):

Cut by 0.36

Take all group1 data out and subdivide it.

4-word-phrases

2-word-phrase

Future works• 1) Delete some useless phrases in the dictionary and

remake the trees.• 2)Make more trees models base on different-word-phrases.• 3)Read publication titles that are suggested in one group to

see whether the classification algorithm make sense.• 4)Do supervised learning after we manually classified a

number of publications. This can improve the accuracy and use computer to classify the publications in a way we want rather than just use the similarities.

• 5)Learn new methods in advanced machine learning class.

Thanks for listening!

intership summary

Documents