bridging digital humanities research and big data repositories of digital text

49
Bridging Digital Humani/es Research and Large Repositories of Digital Text 2 nd Encuentro de Humanistas Digitales | 21.May.14 Biblioteca Vasconcelos, Mexico City Beth Plale Professor, School of Informa/cs and Compu/ng Director, Data To Insight Center Indiana University Tweet us @HathiTrust #HTRC HATHI TRUST RESEARCH CENTER

Upload: beth-plale

Post on 02-Dec-2014

1.049 views

Category:

Presentations & Public Speaking


1 download

DESCRIPTION

Keynote, 2014 Encuentro de Humanistas Digitales, Mexico City

TRANSCRIPT

Page 1: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

Bridging  Digital  Humani/es  Research  and  Large  Repositories  of  Digital  Text  

2nd  Encuentro  de  Humanistas  Digitales  |  21.May.14  Biblioteca  Vasconcelos,  Mexico  City  

 Beth  Plale  

Professor,  School  of  Informa/cs  and  Compu/ng  Director,  Data  To  Insight  Center    

Indiana  University  

Tweet  us  -­‐  @HathiTrust    #HTRC  

HATHI TRUST RESEARCH CENTER!

Page 2: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

SeHng  Stage  •  “InformaLcs”  is  the  applicaLon  of  computer  and  informaLon  science  (CIS)  to  the  data  that  consLtutes  the  primary  research  material  of  that  field.    

•  In  Europe,  digital  humaniLes  is  someLmes  called  “cultural  informaLcs”,  but  that  misses  point  that  informaLcs  researcher  brings  CIS  methodologies  to  problems  in  humaniLes,  whereas  DH  researchers  bring  humaniLes  methodologies  to  problems.    

•  I  am  an  informaLcs  researcher  (CIS  methodologies)  with  15  year  record  in  geo-­‐informaLcs,  and  over  last  5  years,  a  growing  understanding  of  methodology  and  moLvaLons  of  the  digital  humaniLes  researcher  

Page 3: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

Digital  humani,es  is  an  emerging  discipline  that  applies  computaLon  to  research  in  the  humaniLes.  More  than  simply  conducLng  research  with  computers,  digital  humaniLes  scholars  use  informaLon  technology  as  a  central  part  of  their  methodology.    

University  of  Illinois  Library  web  site,  2014  

Page 4: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

Digital  HumaniLes  acLviLes  categorized  

•  Access:      big  part  of  what  [digital  humaniLes  scholar]  does  is  study  cultural  heritage  materials  -­‐  books,  newspapers,  painLngs,  film,  sculptures,  music,  ancient  tablets,  buildings,  etc.  Pre\y  much  everything  on  that  list  is  being  digiLzed  in  very  large  numbers.    

•  Produc/on:    we're  already  seeing  more  and  more  scholars  producing  their  work  for  the  Web.  It  might  take  the  form  of  scholarly  websites,  blogs,  wikis,  or  whatever.    […]  the  enLre  producLon  cycle  uses  technology  (collecLng,  ediLng,  discussing  with  others)  before  the  final  product  is  created.  

•  Consump/on:    people  get  their  materials  in  all  kinds  of  new  ways.    Reading  has  changed  with  the  Web.    The  way  we  read  is  changing.    Bits  and  pieces  of  varied  content  from  so  many  places  and  perspecLves.      

Interview  with  Bre\  Bobley,  NEH,  2009  h\p://www.hastac.org/node/1934  

Page 5: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

Why  does  it  ma\er?    “If  I  had  to  predict  some  interesLng  things  for  the  future  in  the  area  of  access,  I'd  sum  it  up  in  one  word:    scale.    Big,  massive,  scale.    That's  what  digiLzaLon  brings  -­‐  access  to  far,  far  more  cultural  heritage  materials  than  you  could  ever  access  before.”      

2009  interview  with  Bre\  Bobley,  Nat’l  Endowment  of  HumaniLes,  US,  on  predicLons  for  the  future  for  Digital  HumaniLes  

Page 6: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

Bobley’s  PredicLon,  cont.  

In  a  world  of  big,  massive  scale,  he  asks:  •  “How  might  quanLtaLve  technology-­‐based  methodologies  like  data  mining  help  you  to  be\er  understand  a  giant  corpus?    Help  you  zero  in  on  issues?”      

•  “What  if  you  are  a  historian  and  you  now  have  access  to  every  newspaper  around  the  world?”      

•  “How  might  searching  and  mining  that  kind  of  dataset  radically  change  your  results?”      

Page 7: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

Goal  of  Talk  

Introduce  technical  architectural  big  data  developments  around  HathiTrust,  emerging  

examples  of  use,    

…  to  facilitate  discussion  around  whether  Bre\  Bobley’s  2009  predicLon  of  “scale.    Big,  massive,  scale”,  which  is  here  today,  can  now  deliver  on  

advances  for  digital  humaniLes      

Page 8: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

#HTRC    @HathiTrust  

HathiTrust  

•  HathiTrust  is  a  consorLum  of  academic  &  research  insLtuLons,  offering  a  collecLon  of  millions  of  Ltles  digiLzed  from  libraries  around  the  world.  – Founding  members:  University  of  Michigan,  Indiana  University,  University  of  California,  and  University  of  Virginia  

http://www.hathitrust.org/htrc  

http://www.hathitrust.org  

à  DisLnguished  from  

Page 9: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

#HTRC    @HathiTrust  

Page 10: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

#HTRC    @HathiTrust  

Content  of  HathiTrust  

•  Books  and  journals  – Plus  pilots  around  images,  audio,  born-­‐digital  

•  DigiLzaLon  sources  – Google  (96.8%,  10,162,104)  –  Internet  Archive  (2.9%,  301,972)  – Local  (0.3%,  31,840)  

Page 11: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

#HTRC    @HathiTrust  

Content  Sources  

Page 12: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

#HTRC    @HathiTrust  

Content  distribuLon  

360,000  volumes  in  Spanish  

Page 13: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

#HTRC    @HathiTrust  

Mo/va/on  for  HTRC  

à  HathiTrust repository is massive scale -- latent goldmine for text based research à  Restricted nature of parts of HathiTrust content suggests need for new forms of access that preserves intimate nature of interaction with texts while at same time honoring restrictions on access à  Size and restrictions demand new paradigm: computation moves to the data (not vice versa)

Page 14: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

#HTRC    @HathiTrust  

HathiTrust  Research  Center  

•   The  HathiTrust  Research  Center  (HTRC)  was  established  in  2011  to  enable  computaLonal  research  across  a  comprehensive  body  of  published  works,  for  the  purposes  of  scholarship,  educaLon,  and  invenLon.    

•  HTRC  ExecuLve  Commi\ee  –  Beth  Plale,  co-­‐Director,  Professor  of  InformaLcs  and  CompuLng,  Indiana  University  

–  J.  Stephen  Downie,  co-­‐Director,  Professor  of  InformaLon  Science,  University  of  Illinois  

–  Robert  McDonald,  Indiana  University  Libraries  –  Beth  Namachchivaya  Sandore,  University  of  Illinois  Library  –  John  Unsworth,  CIO,  Dean  of  Library,  Brandies  University  

   

Page 15: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

HTRC  system    

Complexity  hiding  interface  

The  complexity  

Tabular  info  

StaLsLcal  plots  

SpaLal  plots  

Request  

Page 16: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

   

Complexity

 hiding  interface  

   

Page 17: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

Return  to  categories  of  DH  acLvity  HTRC  in  current  form  best  at  suppor/ng:  •  Access:      by  narrowing  down  to  essenLal  materials  quickly  –  

separaLng  wheat  from  chaff  “big  part  of  what  [digital  humaniLes  scholar]  does  is  study  cultural  heritage  materials  -­‐  books,  newspapers,  painLngs,  film,  sculptures,  music,  ancient  tablets,  buildings,  etc.”    

•  Produc/on:  by  supporLng  computaLonal  invesLgaLon  over  massive  scale  of  texts  that  will  require  large-­‐scale  computers  (cloud  compuLng)  

•  Consump/on:    by  tracking  the  bits  and  pieces  (i.e.,  the  HTRC  workset)  “The  way  we  read  is  changing.    Bits  and  pieces  of  varied  content  from  so  many  places  and  perspecLves.”      

Interview  with  Bre\  Bobley,  NEH,  2009  

Page 18: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

Workset  manages  engagement  with  texts  

Page 19: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

EXAMPLES  OF  RESEARCH  THAT  IS  POSSIBLE  AT  SCALE  

•  Topic  modeling  •  Author  Gender  IdenLficaLon  •  Using  Topic  Modeling  to  Locate  (down  to  sentence  

level)  Philosophical  Arguments  in  Science  Texts  

Page 20: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

#HTRC    @HathiTrust  

Topic  Modeling  

•  Can  answer  more  complex  or  nuanced  quesLons  – What  are  the  primary  themes  of  an  author?  – What  are  the  primary  themes  of  a  research  domain?  

– When  did  a  new  topic  enter  a  research  domain?  •  Provides  more  data  than  word  counts  

– 100s  of  topics  can  be  extracted.      – Underlying  data  (topics,  volume,  and  page)  is  available  

Page 21: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

#HTRC    @HathiTrust  

Themes  for  Authors  Two  topics  with  idenLcal  centraliLes  (e.g.,  Dickens)  but  separate  themes  

More  strongly  focused  on  book  (illustraLons,  volume,  literature)  

More  strongly  focused  on  author  himself    (le\ers,  household,  house)  

Page 22: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

Ted Underwood, Univ of Illinois

Page 23: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

GENDER  IDENTIFICATION  OF  HTRC  AUTHORS  BY  NAMES    

Stacy  Kowalczyk,  Asst.  Professor,  Dominican  University  Zong  Peng,  HTRC,  Indiana  University  

Talk  by  Stacy  Kowalczyk,  h\p://www.hathitrust.org/htrc_uncamp2013  

Page 24: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

#HTRC    @HathiTrust  

Gender  IdenLficaLon  of  Text  

•  QuesLon  InvesLgated:  Can  we  use  author  names  in    bibliographic  records  to  idenLfy  gender?  

•  Looked  at  2.6  million  bibliographic  records  –  Extracted  personal  author  data    – Marc  100  abcd  and  700  abcd  

•  606,437  unique  personal  author  strings  •  Bibliographic  data  is  not  fielded  like  patent  names  •  Relying  on  Standard  cataloging  pracLce  

–  Last  name,  first  name  middle  name,    Ltles/honorifics,  dates  

Page 25: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

#HTRC    @HathiTrust  

Authors  vs  Names  There  is  the  author,  then  there  are  the  names  under  which  the  author  is  published…  •  Methuen,  Algernon  Methuen  Marshall,  Sir  bart.,  1856-­‐1924  •  Methuem,  Algernon    •  Methuen  Algernon    •  Methuen  Marshall,  Sir,  bart.,  1856-­‐    •  Methuen,  A.  Sir,  1856-­‐1924    •  Methuen,  A.  Sir,  bart.,  1856-­‐1924    •  Methuen  Marshall,  Sir  bart  1856-­‐1924    •  Methuen,  Algernon  Methuen  Marshall,  Sir,  1856-­‐1924  •  Methuen,  Algernon  Methuen  Marshall,  Sir,  bart.,  

1856-­‐1924  •  Methuen,  Algernon,  1856-­‐1924      

Page 26: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

#HTRC    @HathiTrust  

Sources  of  Data  •  The  Virtual  InternaLonal  Authority  File  

– Hosted  by  OCLC  •  Harvested  names  from  mulLple  data  sources  

–  Census  bureau    –  Baby  name  sites  

•  EU  Patent  Research  names  list  (Frietsch  et  al,  2009;  Naldi  et  al.  2005)  – Developed  an  extensive  list  of  European  names  

•  Titles  and  honorifics  – MulLple  web  resources    –  Sir,  Baron,  Count,  Duke,  Father,  Cardinal,  etc  –  Lady,  Mrs.  Miss,  Countess,  Duchess,  Sister,  etc  

Page 27: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

#HTRC    @HathiTrust  

IniLal  Gender  Results  

•  Approximately  80%  of  name  strings  have  iniLal  gender  idenLficaLon  –  Female  

•  59,365  •  10%  

– Male  •  425,994  •  70%  

–  Unknown  •  114,204  •  19%  

–  Ambiguous  •  5,965  •  Less  than  1%  

Page 28: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

#HTRC    @HathiTrust  

Results  by  Data  Source  

Against  the  whole  set  of  name  strings  •  VIAF      

– 19%  hit  rate    •  Web  Names  

– 54%  hit  rate  •  Patents  Names  

– 8%    

Page 29: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

Colin  Allen,  Jamie  Murdock  Cogni/ve  Science,  Indiana  University  

Ref  talk  by  Jamie  Murdock,  h\p://www.hathitrust.org/htrc_uncamp2013  

Page 30: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

Digging  into  philosophy  of  science  

•  Establish  points  of  contact  between  philosophy  and  science:  where  philosophical  arguments  on  anthropomorphism  appear  in  science  texts  

•  Use  topic  modeling  to  idenLfy  the  volumes  and  pages  within  these  volumes  that  are  “rich”  in  a  chosen  topic  

•  Use  semi-­‐formal  discourse  analysis  technique  to  idenLfy  key  arguments  in  selected  pages  to  incrementally  expose  and  represent  argument  structures  

Page 31: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

The  How  

•  1315  volumes  from  HTRC  selected  using  keyword  search  for  ‘darwin’,  ‘romanes’,  ‘anthropomorphism’,  and  ‘comparaLve  psychology’  

•  Set  contains  lots  of  uninteresLng  books:    e.g.,  college  course  catalogs  

•  Apply  topic  modeling  on  86  volume  subset    •  Using  iPy  Notebook  

Page 32: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

Volume  level  topic  modeling  on  ‘anthropomorphism’  yields  set  of  

topics  

Page 33: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

..  Of  set  of  topics,  choose  ‘16’  as  best  

Page 34: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

Volumes  most  similar  to  topic  16  

Page 35: Bridging Digital Humanities Research and Big Data Repositories of Digital Text
Page 36: Bridging Digital Humanities Research and Big Data Repositories of Digital Text
Page 37: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

Repeat  topic  modeling  at  page  level  

Page 38: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

Topic  model  at  page  level  for  topics  anthropomorphism,  animal,  and  psychology  

Page 39: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

Pick  top  3:  topics  16,  10,  26  

Page 40: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

Show  documents  of  topics  10,  16,  26  

Page 41: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

Drop  to  sentence  level  

•  Select  three  books*  with  highest  aggregate  of  20-­‐40  topic-­‐relevant  pages  for  more  precise  analysis  

•  Model  the  three  books  at  the  sentence  level  (uses  machine  learning)  

*  Start  from  1315  texts  to  start,  down  to  86,  then  down  to  most  relevant  3  

Page 42: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

Promising  early  results  …  

Page 43: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

Copyright:  A  Reality    Full  text  download  is  limited  by  both  

size  and  by  copyright  

Page 44: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

#HTRC    @HathiTrust  

CompuLng  with  Copyrighted  materials:  HTRC  Data  Capsule  

•  Copyrighted  materials  can  be  computed  on,  but  cannot  be  shared  by  humans  for  human  (reading)  consumpLon  

•  Needs  computaLonal  framework  to  enable  compuLng  but  restricLng  human  consumpLon  

•  A  secure  compuLng  framework  that:  –  Trusts  that  researcher  will  not  deliberately  leak  data  –  Prevents  malware  acLng  on  user's  behalf  from  leaking  data.  

•  Supports  Openness:  accepts  user-­‐contributed  analysis    •  Supports  Large-­‐scale  and  low  cost:    protecLons  can  be  

extended  to  uLlizaLon  of  public  supercomputers  

Page 45: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

VM  Image  Manager  

VM  Image  Store  

VM  Image  Builder  

VM  Manager  

VM  instance  

Secure  Capsule  cluster  

SSH   Research  results  

Researcher  

HTRC  Data  Capsule  Architectural  Components  

   

Registry    Services,  worksets  

 

 

Page 46: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

VM  Image  

Manager  

VM  Image  Store  

VM  Image  Builder  

VM  Manager  

VM  instance  

Upon  run,  Secure  Capsule:  

controls  I/O  behind  scenes  

SSH   Research  results  

Researcher  

HTRC  Data  Capsule  interacLon  

Researcher  requests    new  VM  of  type  X  

Researcher  install  tools  onto  VM  through  window  on  her  desktop.    

   

Registry    Services,  worksets  

 

 

Final  locaLon  of  results  is  registry  

1)  

2)  

Image  instance  is  created  

3)  

4)  

Page 47: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

47  

HTRC  secure  data  capsule:  view  from  researcher  desktop  

Page 48: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

Thanks  to  our  sponsors  

Page 49: Bridging Digital Humanities Research and Big Data Repositories of Digital Text

2009:  “If  I  had  to  predict  some  interesLng  things  for  the  future  in  the  area  of  access,  I'd  sum  it  up  in  one  word:    scale.    Big,  massive,  scale.    That's  what  digiLzaLon  brings  -­‐  access  to  far,  far  more  cultural  heritage  materials  than  you  could  ever  access  before.”    

à Paradigm: computation moves to the data (not vice versa)

2014:    We  are  at  massive  scale  of  data,  but  data  access  is  constrained.    Can  digital  humani/es  

researchers  work  within  constraints?    Will  they  find  it  worthwhile  to  do  so?  

Reality:    Full  text  download  is  limited  by  size  and  copyright