SlideShare a Scribd company logo
1 of 55
Download to read offline
Chemical Databases and Open
   Chemistry on the Desktop
5th Meeting on US Government Chemical Databases & Open Chemistry
                       August 25, 2011
                   Dr. Marcus D. Hanwell
                   marcus.hanwell@kitware.com




                                                               1	
  
Outline
•    Background
•    Opening up chemistry
•    Workflows in computational chemistry
•    Avogadro – chemical editor
•    Databases on the desktop
•    Quixote
•    HPC resource integration
•    Advanced visualization


                                            2	
  
My Background
•  Ph.D. (Physics) – University of Sheffield
•  Google Summer of Code – Avogadro
•  Postdoc (Chemistry) – University of Pittsburgh
•  R&D engineer – Kitware, Inc
•  Passionate about physics, chemistry, and the
   growing need to improve computational tools
•  See the need for powerful open source, cross
   platform frameworks and applications
•  Develop(ed): Gentoo, KDE, Kalzium, Avogadro,
   Open Babel, VTK, ParaView, Titan, CMake

                                                    3	
  
Kitware
•  Founded in 1998: 5 former GE Research employees
•  95 employees: 42% PhD
•  Privately held, profitable from creation, no debt
•  Rapidly Growing: >30% in 2010, 7M web-visitors/quarter
•  Offices                                  •  2011 Small Business
   –  Albany, NY                               Administration’s
                                               Tibbetts Award
   –  Carrboro, NC
                                            •  HPCWire Readers
   –  Lyon, France                             and Editor’s Choice

   –  Bangalore, India                      •  Inc’s 5000 List: 2008
                                               to 2010
Kitware: Core Technologies




                CMake




                CDash




                             5	
  
Opening Up Chemistry
•  Computational chemistry is currently one
   of the more closed sciences
•  Lots of black box proprietary codes
  –  Only a few have access to the code
  –  Publishing results from black box codes
  –  Many file formats in use, little agreement
•  More papers should be including data
•  Growing need for open standards

                                                  6	
  
Movements for Open Chemistry
•  Formed an “unorganization” – Blue Obelisk
  –  Published first article in 2005
  –  Open data, open standards and open source
  –  Meet at ACS and other conferences when possible
  –  Follow-up article currently in press



•  Quixote collaboration more recently
  –  Provide meaningful data storage and exchange
  –  Principally targeting computational chemistry

                                                       7	
  
Typical Chemistry Workflow

        Log File                      Input File
                   Edit/Analyze	
  




   Results	
           Data	
         Job	
  Submission	
  




                                        Local
                   Calcula>on	
        Remote
                                                              8	
  
Problem: Pretty Complex/Manual
•    Most steps require user intervention
•    Obtain starting structure (previous work, databases)
•    Edit structure
•    Write input file
•    Move input file to cluster
•    Submit to queue
•    Wait for completion
•    Retrieve input file
•    Analyze output file
•    Extract the relevant data, change formats
•    Store results
•    Repeat

                                                            9	
  
Improved Chemistry Workflow

       Log File                      Input File
                  Edit/Analyze	
  



  Results	
           Data	
         Job	
  Submission	
  




                                       Local
                  Calcula>on	
        Remote

                                                         10	
  
Avogadro
•  Project began 2006
•  Split into library and
application (plugin based)
•  One of very few open source editors
•  Designed to be extensible from the start
•  Generate input & read output from many codes
•  An active and growing community
•  Chemistry needs a free, open framework


                                                  11	
  
Avogadro’s Roots
•  Avogadro projected started in 2006
•  First funded work in 2007 by Marcus Hanwell
  –  Google Summer of Code student
  –  Final year of Ph.D. spent the summer coding
  –  Funded as part of KDE project – Kalzium editor
•  Built on several other open source projects
  –  Qt, Eigen, Open Babel, Blue Obelisk Data Repository
•  Also uses open standards, e.g. OpenGL
•  Cross platform, open source stack

                                                       12	
  
Avogadro Vital Statistics
•    Supports Linux, Windows and Mac OS X
•    Contributions from over 20 developers
•    Over 180,000 downloads over 4 years
•    Translated into 19 languages
•    Used by Kalzium for molecular editor
•    Featured by Trolltech/Nokia,
     –  Qt in use
     –  Qt ambassador program
                                             13	
  
14	
  
Desktop Database
•  Use of “document store” NoSQL
  •  Doesn’t force too much structure
     •  Some entries have experimental data available
     •  Some have computational jobs
  •  Employ a “pile of stuff” approach
     •  Can store both source and derived data
     •  Calculate identifiers, QSAR properties, etc
•  MongoDB is a scalable, open solution
  •  Proven scaling with large web applications

                                                        15	
  
Chemistry Data Explorer
•    Qt application
•    Connects to local or remote database
•    Uses VTK for visual data exploration
•    Can ingest new data
     –  Uses Open Babel to generate descriptors
     –  Standard InChi, SMILES, molecular weight
     –  More could be added
        •  All derived from files stored in the database

                                                           16	
  
Chemistry Data Explorer




                          17	
  
Database Interaction on the Web
•  Avogadro directly accesses some (read-
   only) public databases:
  •  PDB, NIH “fetch by name”
  •  Resolve structure to common name using CIR
  •  More could be added
•  ChemData also uses NIH CIR for data
•  Quixote aims to support both public and
   private sharing models – open framework

                                              18	
  
Quixote Architecture




                       19	
  
Avogadro




           20	
  
OpenQube – Quantum Data
•  Reads in key quantum data
  –  Basis set used in calculation
  –  Eigenvectors for molecular orbitals
  –  Density matrix for electron density
  –  Standard geometry
•  Multithreaded calculation
  –  Produce regular grids of scalar data
  –  Molecular orbitals, electron density…

                                             21	
  
Molecular Orbitals and Electron Density

•  Quantum files store basis sets and
   matrices
                −αr 2
  GTO = ce
  φ i = ∑ c µiφ µ
        µ

  ρ(r) = ∑ ∑ Pµν φ µ φν
            µ   ν
•  Using these equations, and the supplied
   matrices – calculate cubes
                                             22	
  
Calling Stand Alone Programs
•  Many already supported:
  •  GAMESS, GAMESS-UK, Molpro, Q-Chem,
     MOPAC, NWChem, Gaussian, Dalton
  •  Easy to add more
•  Some codes writing Avogadro based
   custom applications,
  •  Q-Chem, Molpro…
•  DLPOLY author approached me:
  •  Open sourced DLPOLY2, want a GUI

                                          23	
  
Job Submission & Management
•  Take input file, submit to queue, monitor,
   retrieve, repeat
•  System tray resident Qt application
  •  Manage both local and remote jobs
•  Interest from developers
  •  Use in other applications
  •  Share development/maintenance burden


                                                24	
  
Open in Avogadro When Complete




                                 25	
  
Advanced Visualization: VTK
•  New Avogadro plugin:
     •  Takes volumetric data from Avogadro
     •  Uses GPU accelerated rendering in VTK
•    Excitement from many in the community
•    Several groups interested in collaborating
•    Google Summer of Code project
•    Leverage significant capabilities in VTK


                                                  26	
  
Volume Rendered With Contours




                                27	
  
Electron Density Volume Render




                                 28	
  
Electron Density Ray Tracing




                               29	
  
Conclusions
•  There is still a lot of work to do
•  Open databases are of critical importance
•  Need tools to make retrieving and
   depositing data easier
•  Improved data exchange is essential to
   improve reproducibility in chemistry
•  Create shared collaboration platforms
  –  Deliver improved workflows, enable research

                                                   30	
  
Extra Background Slides
•  Additional visualization and background
   slides




                                             31	
  
Standard Representations




                           32	
  
Standard Representations




                           33	
  
Biomolecules




               34	
  
Nanomaterials




                35	
  
Simplified Views




                   36	
  
Volumetric Data: Molecular Orbitals




                                  37	
  
Periodic Systems




                   38	
  
Hybrid Views: CPK + MO + Ball & Stick




                                    39	
  
Linked Views of Live Data




                            40	
  
2D: Graphs and Charts




                        41	
  
Informatics




              42	
  
3D Interaction Widgets




                         43	
  
VTK: The Toolkit
•  Collection of C++ libraries
  –  Leveraged by many applications
  –  Divided into logical areas, e.g.
     •  Filtering – data processing in visualization pipeline
     •  InfoVis – informatics visualization
     •  Widgets – 3D interaction widgets
     •  VolumeRendering – 3D volume rendering
•  Cross platform, using OpenGL
•  Wrapped in Python, Tcl and Java

                                                            44	
  
VTK Development Team
 • From Ohloh: Very large, active development team: Over the past twelve
  months, 100 developers contributed new code to VTK. This is one of the largest open-source
  teams in the world, and is in the top 2% of all project teams on Ohloh.




                                                      and many others...
                                                                                               45	
  
ParaView
•    Parallel visualization application
•    Open source, BSD licensed
•    Turn-key application wrapper around VTK
•    Parallel data processing and rendering




                                               46	
  
Large Data Visualization
•  BlueGene/L at LLNL
  –  65,536 compute nodes (32 bit PPC)
  –  1,024 I/O nodes (32 bit PPC)
  –  512 MB of RAM per node
•  Sandia Red Storm
  –  12,960 compute nodes (AMD Opteron dual)
  –  640 service and I/O nodes
  –  40 TB of DDR RAM per node

                                               47	
  
1 Billion Cell Asteroid Simulation




                                     48	
  
Tiled Displays




                 49	
  
Parallel Processing/Rendering




                                50	
  
3D Chemistry Visualization
•  Some existing features specific to chemistry
   –  Gaussian cube, PDB, and a few others
•  Excellent handling of volumetric data:
   –  Marching cubes
   –  Volume rendering
   –  Contouring
•  Advanced rendering:
   –  Point sprites
   –  Manta – real time ray tracing

                                                  51	
  
Titan: VTK and Informatics
•  Led by Sandia National Laboratories
•  Substantial expansion of VTK:
   –  Informatics & analysis
•  Actively developed, growing feature set
•  Improved 2D rendering and API
•  Database connectivity, client-server, pipeline
   based approach
•  Uses web technologies such as ProtoViz
•  Scalable, interactive infoviz

                                                    52	
  
Manta: Real Time Ray Tracing




                               53	
  
New Frontiers
•  New work porting VTK
  –  Use C++ as the common core
    •  iOS port in the early stages
    •  Android port
  –  Use OpenGL ES 2.0 – new rendering code
•  Also ParaViewWeb – delivering over web
  –  Use image delivery and rendering on server
  –  Also using WebGL for rendering (optionally)


                                                   54	
  
Future Directions
•  VTK modularization (in progress)
   –  Developing more agile build systems
   –  Automating more with CMake
•  Using Git more fully to improve stability
   –  Use of master and next
   –  Topic branches - merge when ready
•  Code review using Gerrit
   –  Integration with continuous integration
   –  Test before merge


                                                55	
  

More Related Content

What's hot

Introduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonIntroduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonJo-fai Chow
 
Introduction to H2O and Model Stacking Use Cases
Introduction to H2O and Model Stacking Use CasesIntroduction to H2O and Model Stacking Use Cases
Introduction to H2O and Model Stacking Use CasesJo-fai Chow
 
Evolution of database access technologies in Java-based software projects
Evolution of database access technologies in Java-based software projectsEvolution of database access technologies in Java-based software projects
Evolution of database access technologies in Java-based software projectsTom Mens
 
5 years of Dataverse evolution
5 years of Dataverse evolution 5 years of Dataverse evolution
5 years of Dataverse evolution vty
 
Survival analysis of database technologies in open source Java projects
Survival analysis of database technologies in open source Java projectsSurvival analysis of database technologies in open source Java projects
Survival analysis of database technologies in open source Java projectsTom Mens
 
Building COVID-19 Knowledge Graph at CoronaWhy
Building COVID-19 Knowledge Graph at CoronaWhyBuilding COVID-19 Knowledge Graph at CoronaWhy
Building COVID-19 Knowledge Graph at CoronaWhyvty
 
H2O at Poznan R Meetup
H2O at Poznan R MeetupH2O at Poznan R Meetup
H2O at Poznan R MeetupJo-fai Chow
 
From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek
From Kaggle to H2O - The True Story of a Civil Engineer Turned Data GeekFrom Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek
From Kaggle to H2O - The True Story of a Civil Engineer Turned Data GeekJo-fai Chow
 
Automated CI/CD testing, installation and deployment of Dataverse infrastruct...
Automated CI/CD testing, installation and deployment of Dataverse infrastruct...Automated CI/CD testing, installation and deployment of Dataverse infrastruct...
Automated CI/CD testing, installation and deployment of Dataverse infrastruct...vty
 
OpenVis Conference Report Part 1 (and Introduction to D3.js)
OpenVis Conference Report Part 1 (and Introduction to D3.js)OpenVis Conference Report Part 1 (and Introduction to D3.js)
OpenVis Conference Report Part 1 (and Introduction to D3.js)Keiichiro Ono
 
Cytoscape: Now and Future
Cytoscape: Now and FutureCytoscape: Now and Future
Cytoscape: Now and FutureKeiichiro Ono
 
H2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to EveryoneH2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to EveryoneJo-fai Chow
 
H2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to EveryoneH2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to EveryoneJo-fai Chow
 
Flexible metadata schemes for research data repositories - Clarin Conference...
Flexible metadata schemes for research data repositories  - Clarin Conference...Flexible metadata schemes for research data repositories  - Clarin Conference...
Flexible metadata schemes for research data repositories - Clarin Conference...Vyacheslav Tykhonov
 
Crafting bigdatabenchmarks
Crafting bigdatabenchmarksCrafting bigdatabenchmarks
Crafting bigdatabenchmarksTilmann Rabl
 
The world of Docker and Kubernetes
The world of Docker and Kubernetes The world of Docker and Kubernetes
The world of Docker and Kubernetes vty
 
Clariah Tech Day: Controlled Vocabularies and Ontologies in Dataverse
Clariah Tech Day: Controlled Vocabularies and Ontologies in DataverseClariah Tech Day: Controlled Vocabularies and Ontologies in Dataverse
Clariah Tech Day: Controlled Vocabularies and Ontologies in Dataversevty
 
TPC-DI - The First Industry Benchmark for Data Integration
TPC-DI - The First Industry Benchmark for Data IntegrationTPC-DI - The First Industry Benchmark for Data Integration
TPC-DI - The First Industry Benchmark for Data IntegrationTilmann Rabl
 

What's hot (20)

Introduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonIntroduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and Python
 
Introduction to H2O and Model Stacking Use Cases
Introduction to H2O and Model Stacking Use CasesIntroduction to H2O and Model Stacking Use Cases
Introduction to H2O and Model Stacking Use Cases
 
Evolution of database access technologies in Java-based software projects
Evolution of database access technologies in Java-based software projectsEvolution of database access technologies in Java-based software projects
Evolution of database access technologies in Java-based software projects
 
5 years of Dataverse evolution
5 years of Dataverse evolution 5 years of Dataverse evolution
5 years of Dataverse evolution
 
Survival analysis of database technologies in open source Java projects
Survival analysis of database technologies in open source Java projectsSurvival analysis of database technologies in open source Java projects
Survival analysis of database technologies in open source Java projects
 
Building COVID-19 Knowledge Graph at CoronaWhy
Building COVID-19 Knowledge Graph at CoronaWhyBuilding COVID-19 Knowledge Graph at CoronaWhy
Building COVID-19 Knowledge Graph at CoronaWhy
 
H2O at Poznan R Meetup
H2O at Poznan R MeetupH2O at Poznan R Meetup
H2O at Poznan R Meetup
 
From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek
From Kaggle to H2O - The True Story of a Civil Engineer Turned Data GeekFrom Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek
From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek
 
Automated CI/CD testing, installation and deployment of Dataverse infrastruct...
Automated CI/CD testing, installation and deployment of Dataverse infrastruct...Automated CI/CD testing, installation and deployment of Dataverse infrastruct...
Automated CI/CD testing, installation and deployment of Dataverse infrastruct...
 
OpenVis Conference Report Part 1 (and Introduction to D3.js)
OpenVis Conference Report Part 1 (and Introduction to D3.js)OpenVis Conference Report Part 1 (and Introduction to D3.js)
OpenVis Conference Report Part 1 (and Introduction to D3.js)
 
Cytoscape: Now and Future
Cytoscape: Now and FutureCytoscape: Now and Future
Cytoscape: Now and Future
 
H2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to EveryoneH2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to Everyone
 
Bioinformatics on Azure
Bioinformatics on AzureBioinformatics on Azure
Bioinformatics on Azure
 
H2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to EveryoneH2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to Everyone
 
Flexible metadata schemes for research data repositories - Clarin Conference...
Flexible metadata schemes for research data repositories  - Clarin Conference...Flexible metadata schemes for research data repositories  - Clarin Conference...
Flexible metadata schemes for research data repositories - Clarin Conference...
 
ResourceSync tutorial OAI8
ResourceSync tutorial OAI8ResourceSync tutorial OAI8
ResourceSync tutorial OAI8
 
Crafting bigdatabenchmarks
Crafting bigdatabenchmarksCrafting bigdatabenchmarks
Crafting bigdatabenchmarks
 
The world of Docker and Kubernetes
The world of Docker and Kubernetes The world of Docker and Kubernetes
The world of Docker and Kubernetes
 
Clariah Tech Day: Controlled Vocabularies and Ontologies in Dataverse
Clariah Tech Day: Controlled Vocabularies and Ontologies in DataverseClariah Tech Day: Controlled Vocabularies and Ontologies in Dataverse
Clariah Tech Day: Controlled Vocabularies and Ontologies in Dataverse
 
TPC-DI - The First Industry Benchmark for Data Integration
TPC-DI - The First Industry Benchmark for Data IntegrationTPC-DI - The First Industry Benchmark for Data Integration
TPC-DI - The First Industry Benchmark for Data Integration
 

Similar to Chemical Databases and Open Chemistry on the Desktop

Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNeo4j
 
Reproducibility - The myths and truths of pipeline bioinformatics
Reproducibility - The myths and truths of pipeline bioinformaticsReproducibility - The myths and truths of pipeline bioinformatics
Reproducibility - The myths and truths of pipeline bioinformaticsSimon Cockell
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesJan Aerts
 
Efficient & effective data management for research projects : ILRI's Data Ma...
Efficient & effective  data management for research projects : ILRI's Data Ma...Efficient & effective  data management for research projects : ILRI's Data Ma...
Efficient & effective data management for research projects : ILRI's Data Ma...CIARD Movement
 
GEO Analytics Canada Overview April 2020
GEO Analytics Canada Overview April 2020GEO Analytics Canada Overview April 2020
GEO Analytics Canada Overview April 2020GEO Analytics Canada
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation WorkflowsSCAPE Project
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Christopher Curtin
 
GlobusWorld 2020 Keynote
GlobusWorld 2020 KeynoteGlobusWorld 2020 Keynote
GlobusWorld 2020 KeynoteGlobus
 
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal HistoryData Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal HistoryCloudera, Inc.
 
How static analysis supports quality over 50 million lines of C++ code
How static analysis supports quality over 50 million lines of C++ codeHow static analysis supports quality over 50 million lines of C++ code
How static analysis supports quality over 50 million lines of C++ codecppfrug
 
IMPACT Interoperability Framework - Clemens Neudecker
IMPACT Interoperability Framework - Clemens NeudeckerIMPACT Interoperability Framework - Clemens Neudecker
IMPACT Interoperability Framework - Clemens NeudeckerIMPACT Centre of Competence
 
Webinar: How We Evaluated MongoDB as a Relational Database Replacement
Webinar: How We Evaluated MongoDB as a Relational Database ReplacementWebinar: How We Evaluated MongoDB as a Relational Database Replacement
Webinar: How We Evaluated MongoDB as a Relational Database ReplacementMongoDB
 
Clouds, Clusters, and Containers: Tools for responsible, collaborative computing
Clouds, Clusters, and Containers: Tools for responsible, collaborative computingClouds, Clusters, and Containers: Tools for responsible, collaborative computing
Clouds, Clusters, and Containers: Tools for responsible, collaborative computingMatthew Vaughn
 
QuantCell Research - The Big Data Spreadsheet
QuantCell Research - The Big Data SpreadsheetQuantCell Research - The Big Data Spreadsheet
QuantCell Research - The Big Data Spreadsheetinside-BigData.com
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4Michael Kehoe
 
Reproducibility of computational workflows is automated using continuous anal...
Reproducibility of computational workflows is automated using continuous anal...Reproducibility of computational workflows is automated using continuous anal...
Reproducibility of computational workflows is automated using continuous anal...Kento Aoyama
 
COMSODE networking session at ICT Lisbon 2015
COMSODE networking session at ICT Lisbon 2015COMSODE networking session at ICT Lisbon 2015
COMSODE networking session at ICT Lisbon 2015Comsode - FP7 project
 

Similar to Chemical Databases and Open Chemistry on the Desktop (20)

Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4j
 
Reproducibility - The myths and truths of pipeline bioinformatics
Reproducibility - The myths and truths of pipeline bioinformaticsReproducibility - The myths and truths of pipeline bioinformatics
Reproducibility - The myths and truths of pipeline bioinformatics
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutes
 
Efficient & effective data management for research projects : ILRI's Data Ma...
Efficient & effective  data management for research projects : ILRI's Data Ma...Efficient & effective  data management for research projects : ILRI's Data Ma...
Efficient & effective data management for research projects : ILRI's Data Ma...
 
Virtualization for HPC at NCI
Virtualization for HPC at NCIVirtualization for HPC at NCI
Virtualization for HPC at NCI
 
GEO Analytics Canada Overview April 2020
GEO Analytics Canada Overview April 2020GEO Analytics Canada Overview April 2020
GEO Analytics Canada Overview April 2020
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
 
GlobusWorld 2020 Keynote
GlobusWorld 2020 KeynoteGlobusWorld 2020 Keynote
GlobusWorld 2020 Keynote
 
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal HistoryData Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
 
How static analysis supports quality over 50 million lines of C++ code
How static analysis supports quality over 50 million lines of C++ codeHow static analysis supports quality over 50 million lines of C++ code
How static analysis supports quality over 50 million lines of C++ code
 
IMPACT Interoperability Framework - Clemens Neudecker
IMPACT Interoperability Framework - Clemens NeudeckerIMPACT Interoperability Framework - Clemens Neudecker
IMPACT Interoperability Framework - Clemens Neudecker
 
COPO - Collaborative Open Plant Omics, by Rob Davey
COPO - Collaborative Open Plant Omics, by Rob DaveyCOPO - Collaborative Open Plant Omics, by Rob Davey
COPO - Collaborative Open Plant Omics, by Rob Davey
 
Webinar: How We Evaluated MongoDB as a Relational Database Replacement
Webinar: How We Evaluated MongoDB as a Relational Database ReplacementWebinar: How We Evaluated MongoDB as a Relational Database Replacement
Webinar: How We Evaluated MongoDB as a Relational Database Replacement
 
Clouds, Clusters, and Containers: Tools for responsible, collaborative computing
Clouds, Clusters, and Containers: Tools for responsible, collaborative computingClouds, Clusters, and Containers: Tools for responsible, collaborative computing
Clouds, Clusters, and Containers: Tools for responsible, collaborative computing
 
QuantCell Research - The Big Data Spreadsheet
QuantCell Research - The Big Data SpreadsheetQuantCell Research - The Big Data Spreadsheet
QuantCell Research - The Big Data Spreadsheet
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4
 
Reproducibility of computational workflows is automated using continuous anal...
Reproducibility of computational workflows is automated using continuous anal...Reproducibility of computational workflows is automated using continuous anal...
Reproducibility of computational workflows is automated using continuous anal...
 
COMSODE networking session at ICT Lisbon 2015
COMSODE networking session at ICT Lisbon 2015COMSODE networking session at ICT Lisbon 2015
COMSODE networking session at ICT Lisbon 2015
 

Recently uploaded

Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 

Recently uploaded (20)

Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 

Chemical Databases and Open Chemistry on the Desktop

  • 1. Chemical Databases and Open Chemistry on the Desktop 5th Meeting on US Government Chemical Databases & Open Chemistry August 25, 2011 Dr. Marcus D. Hanwell marcus.hanwell@kitware.com 1  
  • 2. Outline •  Background •  Opening up chemistry •  Workflows in computational chemistry •  Avogadro – chemical editor •  Databases on the desktop •  Quixote •  HPC resource integration •  Advanced visualization 2  
  • 3. My Background •  Ph.D. (Physics) – University of Sheffield •  Google Summer of Code – Avogadro •  Postdoc (Chemistry) – University of Pittsburgh •  R&D engineer – Kitware, Inc •  Passionate about physics, chemistry, and the growing need to improve computational tools •  See the need for powerful open source, cross platform frameworks and applications •  Develop(ed): Gentoo, KDE, Kalzium, Avogadro, Open Babel, VTK, ParaView, Titan, CMake 3  
  • 4. Kitware •  Founded in 1998: 5 former GE Research employees •  95 employees: 42% PhD •  Privately held, profitable from creation, no debt •  Rapidly Growing: >30% in 2010, 7M web-visitors/quarter •  Offices •  2011 Small Business –  Albany, NY Administration’s Tibbetts Award –  Carrboro, NC •  HPCWire Readers –  Lyon, France and Editor’s Choice –  Bangalore, India •  Inc’s 5000 List: 2008 to 2010
  • 5. Kitware: Core Technologies CMake CDash 5  
  • 6. Opening Up Chemistry •  Computational chemistry is currently one of the more closed sciences •  Lots of black box proprietary codes –  Only a few have access to the code –  Publishing results from black box codes –  Many file formats in use, little agreement •  More papers should be including data •  Growing need for open standards 6  
  • 7. Movements for Open Chemistry •  Formed an “unorganization” – Blue Obelisk –  Published first article in 2005 –  Open data, open standards and open source –  Meet at ACS and other conferences when possible –  Follow-up article currently in press •  Quixote collaboration more recently –  Provide meaningful data storage and exchange –  Principally targeting computational chemistry 7  
  • 8. Typical Chemistry Workflow Log File Input File Edit/Analyze   Results   Data   Job  Submission   Local Calcula>on   Remote 8  
  • 9. Problem: Pretty Complex/Manual •  Most steps require user intervention •  Obtain starting structure (previous work, databases) •  Edit structure •  Write input file •  Move input file to cluster •  Submit to queue •  Wait for completion •  Retrieve input file •  Analyze output file •  Extract the relevant data, change formats •  Store results •  Repeat 9  
  • 10. Improved Chemistry Workflow Log File Input File Edit/Analyze   Results   Data   Job  Submission   Local Calcula>on   Remote 10  
  • 11. Avogadro •  Project began 2006 •  Split into library and application (plugin based) •  One of very few open source editors •  Designed to be extensible from the start •  Generate input & read output from many codes •  An active and growing community •  Chemistry needs a free, open framework 11  
  • 12. Avogadro’s Roots •  Avogadro projected started in 2006 •  First funded work in 2007 by Marcus Hanwell –  Google Summer of Code student –  Final year of Ph.D. spent the summer coding –  Funded as part of KDE project – Kalzium editor •  Built on several other open source projects –  Qt, Eigen, Open Babel, Blue Obelisk Data Repository •  Also uses open standards, e.g. OpenGL •  Cross platform, open source stack 12  
  • 13. Avogadro Vital Statistics •  Supports Linux, Windows and Mac OS X •  Contributions from over 20 developers •  Over 180,000 downloads over 4 years •  Translated into 19 languages •  Used by Kalzium for molecular editor •  Featured by Trolltech/Nokia, –  Qt in use –  Qt ambassador program 13  
  • 14. 14  
  • 15. Desktop Database •  Use of “document store” NoSQL •  Doesn’t force too much structure •  Some entries have experimental data available •  Some have computational jobs •  Employ a “pile of stuff” approach •  Can store both source and derived data •  Calculate identifiers, QSAR properties, etc •  MongoDB is a scalable, open solution •  Proven scaling with large web applications 15  
  • 16. Chemistry Data Explorer •  Qt application •  Connects to local or remote database •  Uses VTK for visual data exploration •  Can ingest new data –  Uses Open Babel to generate descriptors –  Standard InChi, SMILES, molecular weight –  More could be added •  All derived from files stored in the database 16  
  • 18. Database Interaction on the Web •  Avogadro directly accesses some (read- only) public databases: •  PDB, NIH “fetch by name” •  Resolve structure to common name using CIR •  More could be added •  ChemData also uses NIH CIR for data •  Quixote aims to support both public and private sharing models – open framework 18  
  • 20. Avogadro 20  
  • 21. OpenQube – Quantum Data •  Reads in key quantum data –  Basis set used in calculation –  Eigenvectors for molecular orbitals –  Density matrix for electron density –  Standard geometry •  Multithreaded calculation –  Produce regular grids of scalar data –  Molecular orbitals, electron density… 21  
  • 22. Molecular Orbitals and Electron Density •  Quantum files store basis sets and matrices −αr 2 GTO = ce φ i = ∑ c µiφ µ µ ρ(r) = ∑ ∑ Pµν φ µ φν µ ν •  Using these equations, and the supplied matrices – calculate cubes 22  
  • 23. Calling Stand Alone Programs •  Many already supported: •  GAMESS, GAMESS-UK, Molpro, Q-Chem, MOPAC, NWChem, Gaussian, Dalton •  Easy to add more •  Some codes writing Avogadro based custom applications, •  Q-Chem, Molpro… •  DLPOLY author approached me: •  Open sourced DLPOLY2, want a GUI 23  
  • 24. Job Submission & Management •  Take input file, submit to queue, monitor, retrieve, repeat •  System tray resident Qt application •  Manage both local and remote jobs •  Interest from developers •  Use in other applications •  Share development/maintenance burden 24  
  • 25. Open in Avogadro When Complete 25  
  • 26. Advanced Visualization: VTK •  New Avogadro plugin: •  Takes volumetric data from Avogadro •  Uses GPU accelerated rendering in VTK •  Excitement from many in the community •  Several groups interested in collaborating •  Google Summer of Code project •  Leverage significant capabilities in VTK 26  
  • 27. Volume Rendered With Contours 27  
  • 28. Electron Density Volume Render 28  
  • 29. Electron Density Ray Tracing 29  
  • 30. Conclusions •  There is still a lot of work to do •  Open databases are of critical importance •  Need tools to make retrieving and depositing data easier •  Improved data exchange is essential to improve reproducibility in chemistry •  Create shared collaboration platforms –  Deliver improved workflows, enable research 30  
  • 31. Extra Background Slides •  Additional visualization and background slides 31  
  • 34. Biomolecules 34  
  • 35. Nanomaterials 35  
  • 37. Volumetric Data: Molecular Orbitals 37  
  • 39. Hybrid Views: CPK + MO + Ball & Stick 39  
  • 40. Linked Views of Live Data 40  
  • 41. 2D: Graphs and Charts 41  
  • 42. Informatics 42  
  • 44. VTK: The Toolkit •  Collection of C++ libraries –  Leveraged by many applications –  Divided into logical areas, e.g. •  Filtering – data processing in visualization pipeline •  InfoVis – informatics visualization •  Widgets – 3D interaction widgets •  VolumeRendering – 3D volume rendering •  Cross platform, using OpenGL •  Wrapped in Python, Tcl and Java 44  
  • 45. VTK Development Team • From Ohloh: Very large, active development team: Over the past twelve months, 100 developers contributed new code to VTK. This is one of the largest open-source teams in the world, and is in the top 2% of all project teams on Ohloh. and many others... 45  
  • 46. ParaView •  Parallel visualization application •  Open source, BSD licensed •  Turn-key application wrapper around VTK •  Parallel data processing and rendering 46  
  • 47. Large Data Visualization •  BlueGene/L at LLNL –  65,536 compute nodes (32 bit PPC) –  1,024 I/O nodes (32 bit PPC) –  512 MB of RAM per node •  Sandia Red Storm –  12,960 compute nodes (AMD Opteron dual) –  640 service and I/O nodes –  40 TB of DDR RAM per node 47  
  • 48. 1 Billion Cell Asteroid Simulation 48  
  • 49. Tiled Displays 49  
  • 51. 3D Chemistry Visualization •  Some existing features specific to chemistry –  Gaussian cube, PDB, and a few others •  Excellent handling of volumetric data: –  Marching cubes –  Volume rendering –  Contouring •  Advanced rendering: –  Point sprites –  Manta – real time ray tracing 51  
  • 52. Titan: VTK and Informatics •  Led by Sandia National Laboratories •  Substantial expansion of VTK: –  Informatics & analysis •  Actively developed, growing feature set •  Improved 2D rendering and API •  Database connectivity, client-server, pipeline based approach •  Uses web technologies such as ProtoViz •  Scalable, interactive infoviz 52  
  • 53. Manta: Real Time Ray Tracing 53  
  • 54. New Frontiers •  New work porting VTK –  Use C++ as the common core •  iOS port in the early stages •  Android port –  Use OpenGL ES 2.0 – new rendering code •  Also ParaViewWeb – delivering over web –  Use image delivery and rendering on server –  Also using WebGL for rendering (optionally) 54  
  • 55. Future Directions •  VTK modularization (in progress) –  Developing more agile build systems –  Automating more with CMake •  Using Git more fully to improve stability –  Use of master and next –  Topic branches - merge when ready •  Code review using Gerrit –  Integration with continuous integration –  Test before merge 55