automating nsf herd reporting using machine learning and ... · cima session: the use of advance...

17
Automating NSF HERD Reporting Using Machine Learning and Administrative Data Rodolfo H. Torres CIMA Session: The Use of Advance Analytics to Drive Decisions 2018 APLU Annual Meeting New Orleans Marriott, New Orleans LA November 11, 2018 This research has been supported in part by the National Science Foundation under the EAGER Awards 1547464 / 1547513 Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Upload: others

Post on 22-Mar-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automating NSF HERD Reporting Using Machine Learning and ... · CIMA Session: The Use of Advance Analytics to Drive Decisions 2018 APLU Annual Meeting New Orleans Marriott, New Orleans

Automating NSF HERD Reporting Using Machine Learning and Administrative Data

Rodolfo H. Torres

CIMA Session: The Use of Advance Analytics to Drive Decisions2018 APLU Annual MeetingNew Orleans Marriott, New Orleans LANovember 11, 2018

This research has been supported in part by the National Science Foundation under the EAGER Awards 1547464 / 1547513Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Page 2: Automating NSF HERD Reporting Using Machine Learning and ... · CIMA Session: The Use of Advance Analytics to Drive Decisions 2018 APLU Annual Meeting New Orleans Marriott, New Orleans

Luke HuanInitial co-PI - Former Professor EECS / ITTCUniversity of Kansas

Current PositionHead of Beijing Big Data LabBaidu Research

Joshua RosenbloomProfessor and ChairDepartment of EconomicsIowa State University

Joseph St.AmandFormer Graduate StudentEECSUniversity of Kansas

Current PositionChief Technology OfficerPatients Voices

Adrienne SadovskyPrincipal Analyst SeniorOffice of ResearchUniversity of Kansas

Project in Collaboration with

Page 3: Automating NSF HERD Reporting Using Machine Learning and ... · CIMA Session: The Use of Advance Analytics to Drive Decisions 2018 APLU Annual Meeting New Orleans Marriott, New Orleans

From https://www.nsf.gov/statistics/srvyherd/#sd

“The Higher Education Research and Development Survey,…, is the primary source of information on R&D expenditures at U.S. colleges and universities. The survey collects information on R&D expenditures by field of research and source of funds and also gathers information on types of research and expenses …The survey is an annual census of institutions that expended at least $150,000 in separately budgeted R&D in the fiscal year.”

The HERD Survey

In FY 2016 there were 902 institutions reporting data for a total of $72B in total R&D expenditures, of which $39B were from Federal sources.

Page 4: Automating NSF HERD Reporting Using Machine Learning and ... · CIMA Session: The Use of Advance Analytics to Drive Decisions 2018 APLU Annual Meeting New Orleans Marriott, New Orleans

• R&D expenditures by source of funds (federal government, state and local government, business, nonprofit, institutional, and other)

• R&D expenditures passed through to sub-recipients or received as a sub-recipient

• Federally funded R&D expenditures by federal agency

• R&D expenditures by purpose of work (e.g., Basic Research, Applied Research, Development, etc. )

• Federally and non-federal funded R&D expenditures by field (e.g., Computer Sciences, Chemistry, Economics, etc.)

Some Features of the HERD Survey

Page 5: Automating NSF HERD Reporting Using Machine Learning and ... · CIMA Session: The Use of Advance Analytics to Drive Decisions 2018 APLU Annual Meeting New Orleans Marriott, New Orleans

Total and federally financed higher education R&D expenditures, by type of R&D: 2010–2016 (in thousands)

Sample of Tables in the HERD Report

Fiscal year

Total Federal

All R&D expenditures Basic research

Applied research Development

All R&D expenditures Basic research

Applied research Development

2010 61,286,610 40,416,177 15,478,375 5,392,058 37,477,582 25,399,596 9,361,940 2,716,046

2011 65,274,393 42,809,196 16,733,579 5,731,618 40,768,251 27,331,458 10,498,586 2,938,207

2012 65,729,007 42,401,697 17,295,653 6,031,657 40,142,223 26,469,347 10,577,754 3,095,122

2013 67,013,138 43,305,409 17,390,865 6,316,864 39,445,931 26,071,617 10,327,219 3,047,095

2014 67,196,537 42,989,478 17,745,860 6,461,199 37,960,175 24,905,121 10,015,778 3,039,276

2015 68,566,890 43,865,982 18,022,569 6,678,339 37,848,552 24,945,232 9,969,994 2,933,326

2016 71,833,308 45,101,655 19,986,766 6,744,887 38,793,542 24,944,577 10,893,286 2,955,679

SOURCE: National Science Foundation, National Center for Science and Engineering Statistics, Higher Education Research and Development Survey https://ncsesdata.nsf.gov/herd/2016/html/HERD2016_DST_08.html

Page 6: Automating NSF HERD Reporting Using Machine Learning and ... · CIMA Session: The Use of Advance Analytics to Drive Decisions 2018 APLU Annual Meeting New Orleans Marriott, New Orleans

Source NSF https://ncsesdata.nsf.gov/herd/2015/html/HERD2016_DST_05.html

Expenditures by Field and Source 2016All R&D

expenditures

Source of funds

Federal governmentState and local

government Institution funds BusinessNonprofit

organizations All other sourcesAll R&D fields 71,833,308 38,793,542 4,025,280 17,974,962 4,210,563 4,614,800 2,214,161Science 56,290,662 31,090,354 3,023,028 13,541,084 3,031,096 3,868,151 1,736,949Computer and information sciences 2,077,884 1,442,771 49,502 399,965 90,288 59,588 35,770Geosciences, atmospheric sciences, and ocean sciences 3,087,774 1,992,990 157,693 614,647 109,478 127,763 85,203Atmospheric science and meteorology 626,518 513,275 18,416 68,923 6,319 7,660 11,925Geological and earth sciences 999,351 605,706 47,334 226,541 51,237 32,557 35,976Ocean sciences and marine sciences 1,097,864 665,121 59,874 241,440 32,896 69,841 28,692Geosciences, atmospheric sciences, and ocean sciences, nec 364,041 208,888 32,069 77,743 19,026 17,705 8,610

Life sciences 40,887,850 21,798,334 2,437,745 9,700,749 2,569,302 3,038,475 1,343,245Agricultural sciences 3,293,092 976,912 873,403 1,031,049 166,341 134,067 111,320Biological and biomedical sciences 13,048,981 7,707,943 554,094 2,983,417 552,727 958,620 292,180Health sciences 22,393,716 12,098,295 813,806 5,025,036 1,802,695 1,832,951 820,933Natural resources and conservationb 689,725 315,559 115,681 193,967 14,949 30,632 18,937Life sciences, nec 1,462,336 699,625 80,761 467,280 32,590 82,205 99,875

Mathematics and statistics 681,661 444,419 25,714 170,414 8,844 23,601 8,669Physical sciences 4,893,565 3,286,816 93,518 1,044,829 139,153 200,852 128,397Astronomy and astrophysics 622,008 418,147 1,839 122,375 4,578 34,443 40,626Chemistry 1,775,071 1,097,719 48,331 421,143 82,673 82,956 42,249Materials scienceb 172,086 111,802 4,579 38,435 9,518 5,465 2,287Physics 2,124,098 1,523,751 33,703 417,189 37,851 71,221 40,383Physical sciences, nec 200,302 135,397 5,066 45,687 4,533 6,767 2,852

Psychology 1,218,721 761,433 49,603 291,319 13,084 84,105 19,177Social sciences 2,366,571 898,576 145,563 908,025 50,569 282,278 81,560Anthropologyb 96,505 39,440 2,501 42,190 1,982 7,860 2,532Economics 396,393 112,338 37,543 166,032 8,910 54,860 16,710Political science and government 385,245 103,681 15,042 177,119 3,991 61,439 23,973Sociology, demography, and population studies 504,594 269,371 27,602 135,118 8,213 52,471 11,819Social sciences, nec 983,834 373,746 62,875 387,566 27,473 105,648 26,526

Sciences, nec 1,076,636 465,015 63,690 411,136 50,378 51,489 34,928Engineering 11,381,727 6,583,476 699,032 2,335,527 1,055,444 359,441 348,807Aerospace, aeronautical, and astronautical engineering 883,260 623,571 24,846 115,771 80,432 31,049 7,591Bioengineering and biomedical engineering 1,084,355 650,752 56,057 254,840 46,976 53,428 22,302Chemical engineering 885,273 467,678 40,386 199,334 121,432 34,328 22,115Civil engineering 1,331,155 591,637 221,119 348,873 84,724 46,354 38,448Electrical, electronic, and communications engineering 2,517,147 1,742,632 51,270 416,262 167,403 55,818 83,762Industrial and manufacturing engineeringb 239,078 148,464 10,846 56,714 16,372 3,652 3,030Mechanical engineering 1,435,828 860,745 55,454 279,079 169,124 33,587 37,839Metallurgical and materials engineering 771,683 442,893 29,270 181,287 74,702 19,512 24,019Engineering, nec 2,233,948 1,055,104 209,784 483,367 294,279 81,713 109,701

Non-S&E 4,160,919 1,119,712 303,220 2,098,351 124,023 387,208 128,405

Page 7: Automating NSF HERD Reporting Using Machine Learning and ... · CIMA Session: The Use of Advance Analytics to Drive Decisions 2018 APLU Annual Meeting New Orleans Marriott, New Orleans

Categorizing each project by purpose and field of research requires considerable time and effort as it is done “manually” at KU

• Labor intensive (expensive)

• Subjective

• Questionable reliability and validity

Goals

• Apply machine‐learning and text analysis tools to automate project classification

• Ease administrative burden

• Generate more objective classifications

A Proof of Concept Project

Page 8: Automating NSF HERD Reporting Using Machine Learning and ... · CIMA Session: The Use of Advance Analytics to Drive Decisions 2018 APLU Annual Meeting New Orleans Marriott, New Orleans

• We identified 1,700 historical awards that had been manually classified. We try to classify them using the project Title, SOW/Abstract, PI Home Department, and additional metadata. We treated the “purpose” and the “field” classification as two different tasks.

• After eliminating awards for which electronic abstracts were not available, we were left with a set of roughly 1,500 awards that could be used as a training data set.

• We used the “bag-of-words” model to represent the data; each word is considered as a separate “feature”. There were 17,046 separate data fields or features. However using tools for feature weighting and selection we reduced this number to a few hundreds. This feature extraction “pipeline” is configurable, allowing us to experiment with different means of producing features for the classification models.

Methods

Page 9: Automating NSF HERD Reporting Using Machine Learning and ... · CIMA Session: The Use of Advance Analytics to Drive Decisions 2018 APLU Annual Meeting New Orleans Marriott, New Orleans

We divide the awards into a “testing” set (of about 30% of the data) a “training” set (which is then divided into 5 parts for cross-validation). We explored the application of established machine-learning models:

• Decision Tree

• Support Vector Machine

• Logistic Regression

• Random Forest

• Naïve Bayes

• Neural Network

We evaluate the quality of the models on a per-category basis in terms of an F1-score by comparison with the “human” classification done by hand.

Methods (cont.)

Page 10: Automating NSF HERD Reporting Using Machine Learning and ... · CIMA Session: The Use of Advance Analytics to Drive Decisions 2018 APLU Annual Meeting New Orleans Marriott, New Orleans

Precision = TP / (TP + FP)Recall =TP / (TP + FN)F1 Score =2 (Precision * Recall) / (Precision + Recall)

Methods (cont.)

Actual Outcome Predicted Outcome

In Field Not in Field

In Field TP FN

Not in Field FP TN

Page 11: Automating NSF HERD Reporting Using Machine Learning and ... · CIMA Session: The Use of Advance Analytics to Drive Decisions 2018 APLU Annual Meeting New Orleans Marriott, New Orleans

• Greater success with Field of Study than Research Purpose

• Best models: Logistic Regression and Support Vector Machine models

• Surprisingly using the Title of the project alone we do better than with the SOW/Abstract

• Potentially compromising factors:

o SOW/Abstract not sufficiently clear

o Models cannot understand complex relationships between the words

o Words have different meaning in different contexts

o Insufficient sample size

Results

Page 12: Automating NSF HERD Reporting Using Machine Learning and ... · CIMA Session: The Use of Advance Analytics to Drive Decisions 2018 APLU Annual Meeting New Orleans Marriott, New Orleans

0.00

0.20

0.40

0.60

0.80

1.00

1.20

F1 S

core

Field Label

Training and Testing F1 Scores

Training Score Testing Score

Results (cont.)

Page 13: Automating NSF HERD Reporting Using Machine Learning and ... · CIMA Session: The Use of Advance Analytics to Drive Decisions 2018 APLU Annual Meeting New Orleans Marriott, New Orleans

0

20

40

60

80

100

120

140

160

180

200

Label Distribution

0.00

0.20

0.40

0.60

0.80

1.00

1.20

F1 S

core

Field Label

Training and Testing F1 Scores

Results (cont.)

F1 scores vs. sample size

Page 14: Automating NSF HERD Reporting Using Machine Learning and ... · CIMA Session: The Use of Advance Analytics to Drive Decisions 2018 APLU Annual Meeting New Orleans Marriott, New Orleans

Conclusions and Future Work

• It is feasible to classify the projects using machine-learning if enough data is available

• Need to collect more data points

• Need to understand better in which areas the tools do not perform well and why is so

• Recruit other universities:

o Expand training data

o Determine whether tool is applicable cross-university

Page 15: Automating NSF HERD Reporting Using Machine Learning and ... · CIMA Session: The Use of Advance Analytics to Drive Decisions 2018 APLU Annual Meeting New Orleans Marriott, New Orleans

Publication: Enhancing and Automating University Reporting Of R&D Expenditure Data Using Machine Learning Techniques.

Joshua L. Rosenbloom, Rodolfo H. Torres, Joseph St. Amand, and Adrienne Sadovsky

Merrill Advanced Studies Center Report, No. 121, 2017.

https://merrill.ku.edu/sites/merrill.ku.edu/files/docs/2017_whitepaper/University_Research_Planning_in_the_Data_Era_2017.pdf

Software and documentation: https://github.com/jstamand/KUHERD

More Information

Page 16: Automating NSF HERD Reporting Using Machine Learning and ... · CIMA Session: The Use of Advance Analytics to Drive Decisions 2018 APLU Annual Meeting New Orleans Marriott, New Orleans

Questions?

Page 17: Automating NSF HERD Reporting Using Machine Learning and ... · CIMA Session: The Use of Advance Analytics to Drive Decisions 2018 APLU Annual Meeting New Orleans Marriott, New Orleans

Field Code New Field

Spring 2016

Aerospace / Aeronautical / Astronautical Engineering A1

Bioengineering and Biomedical Engineering A2

Chemical Engineering A3

Civil Engineering A4

Electrical, Electronic, and Communications Engineering A5

Mechanical Engineering A6

Metallurgical & Materials Engineering A7

Other Engineering A8

Industrial and Manufacturing Engineering A9 *

Astronomy and Astrophysics B1

Chemistry B2

Physics B3

Other Physical Sciences B4

Materials Science B5 *

Atmospheric Sciences and Meteorology C1

Geological and Earth Sciences C2

Ocean Sciences and Marine Sciences C3

Other Geosciences, Atmospheric, and Ocean Sciences C4

Mathematics and Statistics D

Computer and Information Sciences E

Agricultural Sciences F1

Biological and Biomedical Sciences F2

Health Sciences F3

Other Life Sciences F4

Natural Resources and Conservation F5 *

Psychology G

Economics H1

Political Science and Government H2

Sociology, Demography, and Population Studies H3

Other Social Sciences H4

Anthropology H5 *

Other Sciences I

Education K

Law L

Humanities M

Visual and Performing Arts N

Business Management and Business Administration O

Communication and Communications Technologies P

Social Work Q

Other Non-S&E Fields R