Design decisions relating to ChiToolbox, presented at the Kick-off Meeting for OpenVibSpec, 3 February 2020, in Bochum, Germany.
ChiToolbox is an open source MATLAB toolbox for handling data from hyperspectral imaging experiments.
https://bitbucket.org/AlexHenderson/ChiToolbox/
https://openvibspec.org/
1. 2020 VISION
ALEX HENDERSON
UNIVERSITY OF MANCHESTER
THE INTERNATIONAL SOCIETY FOR CLINICAL SPECTROSCOPY
SURFACESPECTRA LIMITED alexhenderson.info
@AlexHenderson00
@ChiToolbox
Kick-off Meeting
and Hackathon
3-6 Feb, 2020
Ruhr-University Bochum
2. DUBIOUS
DESIGN
DECISIONS
ALEX HENDERSON
UNIVERSITY OF MANCHESTER
THE INTERNATIONAL SOCIETY FOR CLINICAL SPECTROSCOPY
SURFACESPECTRA LIMITED alexhenderson.info
@AlexHenderson00
@ChiToolbox
Kick-off Meeting
and Hackathon
3-6 Feb, 2020
Ruhr-University Bochum
9. CLIRSPEC DATA
Online community for us to share
algorithms, code and ideas
Hosted on Slack
Request an invitation to join
Any member can add anyone else
http:// tiny.cc / clirspec-data
11. DESIGN DECISIONS
• What worked?
• What did not work?
• What were the compromises?
• What would I do differently?
12. OBJECT ORIENTED PROGRAMMING (OOP)
• Abstract base classes for spectra, spectral collections and images
• Concrete classes for above
• ‘Interface’ classes for ‘Raman character’, ‘IR character’ etc.
• Multiple inheritance to define technique specific classes
• eg. IRSpectrum, RamanImage, (ToF)MSSpectralCollection
• Separate classes for pictures, RMieS options, PCA or RF models etc.
• Model using classes where possible
• Provides type-identification and bespoke functionality
13. FILE FORMATS
• Agilent (FTIR)
• Single FTIR images and mosaicked FTIR images
• Biotof (ToFSIMS)
• Spectra, hyperspectral image files
• Bruker (FTIR)
• Opus files and multiple spectra exported as a MAT file
• Ionoptika (ToFSIMS)
• Hyperspectral image files exported in HDF5 format
• Mettler Toledo (FTIR)
• Spectra exported in ASCII
• Renishaw (Raman)
• WiRE Version 4, spectral and hyperspectral images
• Thermo Fisher Scientific GRAMS SPC (Generic)
• Data stored in spc files
Single files can be read using ChiFile. This works out the file format automatically.
In addition
Readable, but unreleased
• Photothermal (FTIR and Raman)
• mIRage spectra and hyperspectral images
• IONTOF (ToFSIMS)
• Hyperspectral images in grd format
14. FILE FORMAT ISSUES
• Some formats were hacked eg. Agilent
• What if example files were specific to certain instrumentation?
• Some formats are multi-purpose
• Some formats hold only one data type (spectrum, line scan, image etc)
• Eg. Agilent single tile format
• Some formats can contain any of these data types
• Eg. Renishaw
• If we read multiple files, what should we do if their contents are of different types?
16. SOFTWARE LICENCE
• ChiToolbox released under GNU General Public License 3.0 (GPL)
• External code is GPL, or more liberal (eg. MIT)
• GPL ‘infects’ the codebase
• User must release any code that intrinsically links to this code
• Prefer GNU Lesser General Public License (LGPL)
• Your codebase is not affected, but changes must be shared
• Unfortunately, LGPL and GPL are not compatible
17. MATLAB ISSUES
• Tried to make backwardly compatible with R2009a
• Too painful
• Roughly compatible with R2016a
• Trying to reduce toolbox dependencies (eg. Statistics toolbox)
• MATLAB OOP not great
• Variables pass by value, but handle classes pass by reference. Makes copying difficult
• Rolled my own deep copy mechanism (clone)
18. DATA TYPES
• Single spectrum, spectral collection, hyperspectral image
• Continuous data
• Did not consider multispectral data (discrete wavenumber)
• Discontinuous, cannot take first derivative etc.
• Data is a property of the object, not a pointer/function to a data storage type
19. METADATA
• Separate class from data type
• Automatically label plots (eg PCA scores)
• Build lists of labels manually
labels = ChiClassMembership('mylabels','beta',1, 'gamma',2, 'beta',3, 'alpha',2);
• Automatically read from specially designed Excel spreadsheet
• Handles logical, category and numeric types
• Need to remove label from metadata if removing spectrum from collection
20. Users not sure of difference between numeric and
category types, when using numbered samples
21. DEFAULTS
• Try to provide ‘reasonable’ default values
• PCA denoising defaults to 30% of PCs retained
• Random Forest defaults to 80% training and 20% test sets
• Should default to 5-fold cross validation, but takes time
• Random Forest defaults to using parallel processing if data set is large
• MATLAB is slow to initialise worker pool
• All parameters are user-configurable
22. VISUALISATION
• Graphics use perceptually neutral colormaps
• Caters for colour vision deficiency (colour blindness)
• Colour-mapped PCA image scores and loadings plots
• Dialog box for Raman baseline removal
• Asymmetric least squares baseline modelling requires user input
• Confidence limits on PCA/CVA* scores plots
• Default = 95%, but user variable
• RMieS iteration change plot
*Canonical variates analysis
26. 2020 VISION
(IF I HAD A TIME MACHINE)
• Developed more tests
• Added support for discrete wavenumber data
• Separated data storage from data manipulation
• Used database (SQLite) to manage metadata
• Considered OOP for data storage, but functional programming for operations
27. 2020 VISION
(IF I HAD A TIME MACHINE)
Write it all in Python!
…or C++