richard smiraglia: empirical methods for knowledge evolution across knowledge organization systems

49
Empirical methods for knowledge evolution across Knowledge Organization Systems Evolution and variation of classification systems – KnoweScape workshop March 45, 2015 Amsterdam Richard P. Smiraglia

Upload: cost-action-td1210

Post on 15-Jul-2015

225 views

Category:

Education


0 download

TRANSCRIPT

Empirical  methods  for  knowledge  evolution  across  Knowledge  

Organization  Systems  

Evolution  and  variation  of  classification  systems  –  KnoweScape  workshop  March  4-­‐5,  2015  Amsterdam  

 

Richard  P.  Smiraglia  

*  Knowledge  Organization  Systems  (KOSs),  including  classifications,  can  be  evaluated  and  explained  by  reference  to  concept  theory—i.e.,  knowledge  expresses  concepts,  which  are  represented  by  terms;  knowledge  elements  represent  predicates  and  referents  of  specific  knowledge  units,  knowledge  units  represent  the  synthesis  of  concept  characteristics.  Classes  are  large  knowledge  units  that  represent  groupings  of  concepts  according  to  prescribed  characteristics,  often  as  represented  in  texts  (Dahlberg  2006  and  1978;  Hjørland  2009).    

KOS  =  Concepts  

*  The  concepts,  their  representations,  and  their  groupings  as  represented  in  texts  or  other  contextual  environments  (or  domains)  are  derived  according  to  a  system  known  as  warrant  (see  Beghtol  2010).    

Warrant  

*  1)  how  well  it  represents  its  warranted  concepts  both  individually  and  in  contextual  groupings,  and    *  2)  how  well  the  individual  classes,  divisions  and  subdivisions  are  populated  by  target  objects.    

Two  tests  

*  Eleven  empirical  approaches  (Smiraglia  2015)  *  Subject  pathfinders  *  Special  classifications  and  thesauri  *  Empirical  user  studies  *  Informetric  studies  *  Historical  studies  *  Document  and  genre  studies  *  Epistemological  and  critical  studies  *  Terminological  studies  *  Database  semantics  *  Discourse  analyses  *  Cognition,  expert  knowledge  and  AI  

Domain  Analysis  

Smiraglia,  Richard  P.  2015.  Domain  analysis  for  knowledge  organization.  London:  Chandos-­‐Elsevier.  (Forthcoming)  

 

Actors,  products,  processes  in  a  pharmacy  

“Z-­‐pack”  

Voyant  word  cloud  from  Google  “Azithromycin”  

*  •An  ontological  base  that  reveals  an  underlying  teleology  *  Does  the  group  share  a  common  goal  that  is  implicit  or  explicit  in  its  knowledge  base?;  

*  •A  set  of  common  hypotheses  *  Is  there  a  theoretical  paradigm  in  operation?  If  so,  it  will  dictate  the  hypotheses  used  in  

the  domain  for  testing  theoretical  parameters.  In  non-­‐scholarly  domains,  we  can  consider  a  parallel  consideration  to  apply  to  means  employed  by  the  group  to  contribute  to  the  evolution  of  its  common  goal;  

*  •An  epistemological  consensus  on  methodological  approaches  *  Most  domains  that  embrace  a  single  theoretical  paradigm  (or  a  consistent  set  of  such  

paradigms)  will  share  methodological  approaches  rooted  in  different  epistemological  points  of  view;  and,  

*  •Social  semantics  *  At  the  simplest  level  this  simply  means  that  the  group  should  be  visibly  in  conversation  

utilizing  its  common  ontology.  At  higher  levels  of  complexity  it  means  that  there  should  be  records  of  communication  and  exchange  of  ideas;  in  scholarly  domains  citation,  intercitation,  and  co-­‐citation  will  be  evidence  of  social  semantics.  

Operationalizing  domains  for  analysis  

WoS  search  citation  map  “pharmacy  and  information  systems”  

SCOPUS  authors    “pharmacy  and  computer  and  

information  systems”  

Cognitive  Work  Analysis  

Domain  analysis:  just  beginning  in  KO  

*  Knowledge  Space  Lab  *  Studying  the  collectivity  of  the  UDC  as  a  stable  reference  

classification  

*  Big  Classification  *  Studying  the  population  of  the  UDC  *  Bibliographic  characteristics  associated  with  elements  of  UDC  *  WorldCat  *  Leuven  

*  Data  values  associated  with  elements  of  UDC  in  both  venues  

Empirical  “Ontogeny”  

*  1st  ed.,  1905  *  MRF:  1994,  1997,  1998,  2005,  2008,  2009  

There  is  no  single  “UDC”  

61  Medical sciences 62  Engineering 63  Agriculture …..

53  Physics 54  Chemistry 55  Earth Sciences

The spectral signature of the UDC (branch weight)

65 Communcation and transportation ind. … 66 Chemical technology … 67 Various industries

www.magnaview.nl

2005

2008

SCIENCE AND KNOWLEDGE

PHILOSOPHY. PSYCHOLOGY

RELIGION. THEOLOGY

SOCIAL SCIENCES

MATHEMATICS. NATURAL SCIENCES

APPLIED SCIENCES. MEDICINE. TECHNOLOGY

THE ARTS

LANGUAGE. LINGUISTICS. LITERATURE

GEOGRAPHY. BIOGRAPHY. HISTORY

1905

2008

Change  over  time  

Changes  in  the  Main  Classes  

Fig.  1:  (a)  Distribution  of  main  UDC  classes,  inner  ring  1905,  outer  ring  1994.  (b):  Distribution  of  main  UDC  classes  in  1994  (most  inner  ring),  1997,  1998,  2005,  2008  and  2009  (most  outer  ring).    

Fig.  2:  (a)  Distribution  of  special  auxiliaries  among  the  main  classes  over  the  years.  (b):  Percentage  of  special  auxiliaries  to  main  classes,  and  their  changes  over  time    

Figure  3:  (a)  Distribution  of  common  auxiliaries  over  the  years.  (b)  Changes  in  the  record  number  of  main  classes  from  1993  to  2009.  

*  394.4  is  a  main  UDC  number  standing  for  “Public  ceremonial,  coronations”  *  Colon  “:”    is  a  connecting  symbol  representing  “simple  relation”  *  Square  brackets  are  used  for  subgrouping.  Everything  within  the  [….]  brackets  is  a  unity.  This  

unity  starts  with  another  main  UDC  class  number  92,  standing  for  “Biographical  studies.  Genealogy.  Heraldry.Flags”  

*  The  ()  parentheses  when  starting  with  a  non-­‐zero  numeric  character  denote  a  common  auxiliary  number  of  place.  (100+437)  indicates  “(100)  All  countries  in  general”  AND  “(437)  Czechoslovakia  (1918-­‐1992)”  

*  329.15  is  for  “Political  parties  with  a  communist  attitude”  *  The  auxiliary  of  place  “(437)  Czechoslovakia  (1918-­‐1992)”  is  intercalated  between  329  and  15  to  

allow  for  collocation  of  all  Czechoslovakian  parties  irrespective  of  their  political  orientation,  and  then  ordered  by  a  type  -­‐  thus  the  entire  number  represents  a  topic  “Communist  party  of  Czechoslovakia”  which  is  then  further  specified  by  a  common  auxiliary  of  form  (091)  denoting  presentation  in  a  historical  form  to  express  "the  history  of  communist  party  of  Czechoslovakia”  

*  Plus  “+”  is  the  common  auxiliary  for  addition/coordination  introducing  the  next  UDC  number  combination  in  the  string  consisting  of  two  parts:  “327.32  International  solidarity  of  the  working  class”  and  "(100)  All  countries  in  general”  

Data  processing  394.4  :[92(100+437):329(437).15(091)+327.32(100)]    

Data  processing  

394.4  :[92(100+437):329(437).15(091)+327.32(100)]    

 Public  celebrations/ceremonies  with  significant  biographical  and  historical  elements,  or  even  artifacts  to  do  with  celebrations  (e.g.  flags,  banners)  and  which  involve  historical  personalities  (both  Czechoslovakian  and  international)  linked  to  the  history  of  Czechoslovakian  Communists  Party  and  international  movement  of  solidarity  of  the  working  class  -­‐  in  the  world.      

Matrix  

Distribution  of  UDC  numbers  across  classes,  in  design  as  well  as  in  practical  use  

!

Length  of  UDC  String  

!

Network  of  UDC  classes  in  the  OCLC  dataset  by  the  use  of  the  operations  “:”  (Simple  relation),  “/”  (Consecutive  extension),  

“+”  (Addition)    

!

OCLC:  Otlet’s  Grinder?  

facts  disassembled  and  hyperlinked  

Bibliographic  Data  

*  Methodology  *  Random  sample  of  400  from  9  million  UDC/OCLC  

number  pairs/95000  Leuven  strings  *  95%  confidence  ±5%    *  MARC  text  records  located  *  Bibliographic  characteristics  recorded  *  UDC  numbers  deconstructed  and  operators  recorded  *  IBM-­‐SPSS  used  to  look  for  correlations,  mostly  by  cross-­‐

tabulation  

Big  Classification:  Population  of  the  UDC  

OCLC  Publication  dates  

Correlated  characteristics  

    ISBN   Edition   Series   Bibliog.   Linked  record  

ISBN               .003  

Edition           .006  

Series       .006       .024  

Bibliog.   .003      

Linked  record   .024      

Correlated  subject  indicators  

    name   topic   place   Index  term   Genre  Form  

name      

topic       .046       .018  

place   .046      

Index  term       .063       .008  

Genre  Form   .018   .008      

Correlate  bibliographic  characteristics  and  subject  indicators  

    ISBN   Edi)on   Series   Bibliog   Linked  record  

name  

topic       .043  

place       .091  

Index  term       .001   .048      

Genre  Form          

UDC  population  

UDC  main  class   Frequency   Percentage  

0   38   .09  

1   15   .03  

2   15   .03  

3   85   .21  

5   31   .08  

6   70   .18  

7   38   .10  

8   62   .16  

9   34   .09  

Auxiliary  operators  and  UDC  main  classes  

    “+”   “:”   “/”  

“+”      

“:”  

“/”  

    “+”   “:”   “/”  

0       .001      

1              

2              

3           .006  

5              

6   .019   .001   .006  

7           .006  

8              

9   .019       .006  

Statistically  significant  correlations  occurred  among  most  of  the  

deconstructed  components  of  the  UDC  numbers.  

 Statistically  significant  correlations  were  

discovered  among  the  MARC-­‐designated  elements  of  

the  respective  bibliographic  records.    

Also,  statistically  significant  correlations  were  discovered  between  the  elements  of  

classification  and  the  bibliographic  elements.    

!

Correlate  Everything  

*  Classification  correlates  with  collection  and  catalog  characteristics  in  predictable  ways  

*  In  a  classified  bibliographic  dataset  predictable  pathways  of  association  exist  through  the  data.  

*  But,  These  UDC  numbers  are  recent  and  not  collection-­‐specific;  and,  *  Values  (e.g.,  subject  headings,  languages,  publisher  names)  were  not  tested    

*  Leuven  data  a  slightly  different  picture.  

It  worked  ….,  but  ….  

Some  bibliographic  characteristics  from  Leuven  data  

Language %

English 28.2

French 21

Dutch 19.5

German 14.8

Italian 3.8

Spanish 1.8

Latin 1.5

Chinese 0.8

Polish 0.8

Japanese 0.5

Korean 0.3

ISBN 39.2%

Edition statement 10.6%

Series statement 38.6%

Population  of  the  UDC  UDC  0  20%  

UDC  1  1%  

Language  4  56%  

UDC  5  6%  

UDC  6  8%  

UDC  7  6%  

UDC  8  3%   UDC  0  

UDC  1  

Language  4  

UDC  5  

UDC  6  

UDC  7  

UDC  8  

UDC  0  14%  

UDC  1  2%  

UDC  2  18%  

UDC  3  21%  

UDC  5  11%  

UDC  6  16%  

UDC  7  8%  

UDC  8  1%  

UDC  9  9%  

UDC  0  

UDC  1  

UDC  2  

UDC  3  

UDC  5  

UDC  6  

UDC  7  

UDC  8  

UDC  9   aux  0  20%  

aux  1  1%  

aux  4  57%  

aux  5  6%  

aux  6  7%  

aux  7  6%  

aux  8  3%  

aux  0  

aux  1  

aux  4  

aux  5  

aux  6  

aux  7  

aux  8  

Common  auxiliaries  linked  to  UDC  classes  

UDC  classes  linked  with  common  auxiliary  signs  

Correlated  characteristics  

UDC class ISBN Edition Series “:” “*” “-“ “/” “=” “<>”

0 X

1 X X

2 X X X

3 X X

5

6 X X

7 X

8

9

Network  of  classification  interaction  

Place  names  in  WorldCat  

0  

0.01  

0.02  

0.03  

0.04  

0.05  

0.06  

0.07  

0.08  

0.09  

0.1  

Series1  

Publisher  names  in  WorldCat  

0  

0.5  

1  

1.5  

2  

2.5  

3  

3.5  

4  

4.5  

5  

Series1  

Sborníky   9  

Marruecos   7  Učebnice  vysokých  škol   7  

Energía  de  la  biomasa   4  Křesťansky  život   4  

PainOng   4  PainOng,  Modern   4  Sborníky  konferenci   4  

Asociaciones   3  Brožury   3  

Česko   3  ChrisOan  life   3  

Español  (lengua)   3  Katalogy   3  

Literatura  española   3  

malířstvi   3  Populárne-­‐naučne  publikace   3  Příručky   3  Textbooks  (higher)   3  

turizem   3  

“Subject”  terms  in  65x  fields  in  WorldCat  

•  Czech  Republic  $x  cultural  relations  $y  to  1918  

•  Danmark  Illerup  Jylland  Arkeologi  Parallelltekst    

•  Catequesis  $v  Manuales  para  animadores  

•  Unilateral  acts  (International  law)  •  Neutrophils  •  Jews  $x  persecutions  $y  1939-­‐1945  •  Weapons  of  mass  destruction  

0 1   2   3   5   6   7   8 9 +   :   /  

Madrid   .  

Prague  

Barcelona      

Helsinki  

Ljublijana  

New  York  

Correlations:  Place  names,  Publisher  names  

0   1   2   3   5   6   7   8   9   :   +   /  

SPN  

Kirjalito  

-­‐41   PainOng,  Czech  

PainOng,  Modern-­‐-­‐20th  century-­‐-­‐Czech  Republic  

PainOng,  Modern-­‐-­‐21st  century-­‐-­‐Czech  Republic  

-­‐437.313   PainOng,  Modern-­‐-­‐19th  century-­‐-­‐Russia  

PainOng,  Modern-­‐-­‐20th  century-­‐-­‐Russia  

PainOng-­‐-­‐Czech  Republic-­‐-­‐Náchod   PainOng,  Russian  

7.03   PainOng-­‐-­‐19th-­‐20th  centuries  

72/76   PainOng  

75-­‐x(1/9)    Gothic  painOng-­‐-­‐mural  painOng-­‐-­‐Slovenia  

75-­‐x(1/9)    painOng-­‐-­‐mural  painOng—frescoes-­‐-­‐17th/18th  cent.-­‐-­‐Italy  

“painting”—semantic  cluster  with  UDC  strings  

*  Energía  de  la  biomasa-­‐-­‐Aspectos  361  ambientales-­‐-­‐Países  de  la  Unión  Europea.    *  Energías  renovables-­‐-­‐Aspecto  del  medio  ambiente.  *  35-­‐2â    *  Energiförbrukning.  *  663.42    

47            Richard  P.  Smiraglia,  Knowledge  Organization  Research  Group,  2014  

small  semantic  cluster  

The  iSchool  at  UWM  

*  Leuven  data  *  Semantic  clusters  from  populations  *  Logistic  regression  with  structural  features  

Still  to  be  done  

Beghtol,  Clare.  2010.  Classification  theory.  Encyclopedia  of  library  and  information  sciences,  3rd  ed.,  1no.1:  1045-­‐60.    Berry,  John  W.  1997.  Immigration,  acculturation,  and  adaptation.  Applied  psychology  46no.1:  5–34.    Dahlberg,  Ingetraut.  1978.  A  referent-­‐oriented,  analytical  concept  theory  for  Interconcept.  International  classification  5:  142-­‐51.    Dahlberg,  Ingetraut.  Knowledge  organization:  a  new  science?  Knowledge  organization  33:  11-­‐19.    Hjørland,  Birger.  2002.  Domain  analysis  in  information  science:  eleven  approaches,  traditional  as  well  as  innovative.  Journal  of  documentation  58:  422-­‐62.    Hjørland,  Birger.  2009.  Concept  theory.  Journal  of  the  American  Society  for  Information  Science  and  Technology  60:  1519-­‐36.    http://arizona.openrepository.com/arizona/bitstream/10150/105762/1/tennis_2007_dlist.pdf    Salah,  Almila  Akdag,  Cheng  Gao,  Kryzstof  Suchecki,  Andrea  Scharnhorst  and  Richard  P.  Smiraglia.  2012.  The  evolution  of  classification  systems:  ontogeny  of  the  UDC.  In  A.  Neelameghan  and  K.S.  Raghavan  eds.  Categories,  contexts,  and  relations  in  knowledge  organization:  Proceedings  of  the  Twelfth  International  ISKO  Conference,  6-­‐9  August  2012,  Mysore,  India.  Advances  in  knowledge  organization  13.  Würzburg:  Ergon  Verlag,  pp.  51-­‐57.    Scharnhorst,  Andrea  and  Richard  P.  Smiraglia.  2012.  Evolution  of  classification  systems.  In  Proceedings  of  the  ASIST  SIG/CR  Classification  Workshop,  Baltimore,  Md.  October  25,  2012.    Smiraglia,  Richard  P.  2013.  Big  classification:  using  the  empirical  power  of  classification  interaction.”  In  Campbell,  D.Grant  ed.,  Proceedings  of  the  ASIST  SIG/CR  Classification  Workshop,  Montréal,  2  November  2013.    Smiraglia,  Richard  P.  2014.  Classification  interaction  demonstrated  empirically.”  In  Wiesław  Babik  ed.,  Knowledge  organization  in  the  21st  century:  Between  Historical  Patterns  and  Future  Prospects,  Proceedings  of  the  13th  International  ISKO  Conference,  Krakow,  Poland,  May  19-­‐22,  2014.  Advances  in  knowledge  organization  v.  14.  Würzburg:  Ergon-­‐Verlag,  pp.  176-­‐83.    Smiraglia,  Richard  P.  2015.  Domain  analysis  for  knowledge  organization.  London:  Chandos-­‐Elsevier.  (Forthcoming)  Smiraglia,  Richard,  Andrea  Scharnhorst,  Almila  Akdag  Salah  and  Cheng  Gao.  2013.  UDC  in  action.  In  Slavic,  Aïda,  Almila  Akdag  Salah  and  Sylvie  Davies  eds.,  Classification  and  Visualization:  Interfaces  to  Knowledge,  Proceedings  of  the  International  UDC  Seminar,  24-­‐25  October  2013,  The  Hague,  The  Netherlands.  Würzburg:  Ergon  Verlag.    Tennis,  Joseph  T.  (2006)  Versioning  concept  schemes  for  persistent  retrieval.  Bulletin  of  the  American  Society  of  Information  Science  and  Technology,  32  (5),  pp.  13-­‐16.  Also  available  at:  http://www.asis.org/Bulletin/Jun-­‐06/tennis.html.    Tennis,  Joseph  T.  (2007)  Diachronic  and  synchronic  indexing:  modeling  conceptual  change  in  indexing  languages.  In:  Information  Sharing  in  a  Fragmented  World,  Crossing  Boundaries.  Proceedings  of  the  35th  Annual  Meeting  of  the  Canadian  Association  for  Information  Science/L’Association  Canadienne  Des  Sciences  De  L'information,  Montreal.  Edited  by  C.  Arsenault  and  K.  Dalkir.  Montreal:  Canadian  Association  for  Information  Science.      

References