materials informatics
TRANSCRIPT
Evgeny Blokhin
Chelyabinsk SUSU’2013 summer workshop
Max-Planck Institute for Solid State Research
Stuttgart, Germany
Materials informatics
Outlook
1. Data-mining in materials science
2. Blue Obelisk
3. Python programming language
What is data-mining?
statistics
databases
information theory machine learning
artificial in
telligence
optimization
Datamining
Tasks of data-mining
1. Classification
2. Prognosing
3. Visualization
4. Reasoning
5. Analysis
6. Expert systems
Big data in materials science
EXAMPLE: nearly for the last 4 years
with my colleagues-theoreticians we produced:
over 9000 simulation output files
over 50 articles
1. Accelrys Pipeline Pilot and Materials Studio, http://accelrys.com/products2. AFLOW framework and Aflowlib.org repository, http://www.aflowlib.org3. AIDA, Bosch LLC4. Blue Obelisk Data Repository (XSLT, XML), http://bodr.sourceforge.net5. CCLib (Python), http://cclib.sf.net6. CDF (Python), http://kitchingroup.cheme.cmu.edu/cdf7. CMR (Python), https://wiki.fysik.dtu.dk/cmr8. Comp. Chem. Comparison and Benchmark Database, http://cccbdb.nist.gov9. cctbx: Computational Crystallography Toolbox, http://cctbx.sourceforge.net10. ESTEST (Python, XQuery), http://estest.ucdavis.edu11. J-ICE online viewer (based on Jmol, Java), http://j-ice.sourceforge.net12. Materials Project (Python), http://www.materialsproject.org13. PAULING FILE world largest database for inorganic compounds, http://paulingfile.com14. Quixote, http://quixote.wikispot.org15. Scipio (Java), https://scipio.iciq.es16. WebMO: Web-based interface to computational chemistry packages (Java,
Perl), http://webmo.net
New type of modeling software
…and smart codesENCUT = 500IBRION = 2ISIF = 3NSW = 20IDIOT = 3NELMIN = 5EDIFF = 1.0e-08EDIFFG = -1.0e-08IALGO = 38ISMEAR = 0LREAL = .FALSE.LWAVE = .FALSE.
*** VASP MASTER: I AM SURE YOU KNOW WHAT YOU ARE DOING ***
d-metal oxides
band gap problem
standard DFT GGA approach
Hartree-Fockadmixing
LCAO approximation
Usage of Gaussian basis sets
good atomization energy
Example of inference over an ontology
Open data, open standards, open source in chemistry
Open data, open standards, open source in chemistry
1.Elsevier, Wiley, Springer publishers are “evil”
2.“The right to read is right to mine”
3.“Jailbreaking” the scientific data from PDFs: access, reuse, integrity
4.Why the level of collaboration is so low?
Materials Project
Prof. G. Ceder,
MIT, Boston
Guido van Rossum,
Google, Dropboxhttp://goo.gl/FtFS7h
Python programming language
Advantages of Python
Syntax: tabulation, syntactic sugar, speech-like, flexibility, expression
VERY fast prototyping
Great popularity in scientific community
100% cross-platform and portable
Disadvantages of Python
Relatively slow speed comparing to compiled languages like C++ or Fortran
Global Interpreter Lock (GIL)
Historically not popular in some narrow scientific areas (“reigns” of Java)
Two examples
list = [x**2 for x in range(10)]
numbers = [10, 4, 2, -1, 6]filter(lambda x: x < 5, numbers)
1. Multi-dimensional array manipulation (fast!)
2. Discrete fourier transform
3. Linear Algebra
4. Mathematical functions
5. Matrix library
6. Polynomials
7. Set routines
8. Sorting, searching and counting
9. Statistics
eigvals, eigvecs = numpy.linalg.eigh(dynmat)
Solving eigenvalue problem for a dynamical matrix (phonopy code):