Machine learning to solve bioinformatics problems

Data-Driven advice for applying machine
learning to bioinformatics problems
Randal S. Olson, William La Cava, Zairah Mustahsan, Akshay Varik and Jason H. Moore
Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
Pac Symp Biocomput. Author manuscript; available in PMC 2018 April 09.
Junaid Ahmed
181706016
M.Sc. Bioinformatics

What is Artificial Intelligence?
“Intelligence demonstrated by machines, in contrast to
the natural intelligence displayed by humans and other
animals”

“Enabling the computers to think”
“Statistical tools to learn from data”
“Multilayer neural network”

Decision Tree algorithm
belongs to the family
of supervised learning
algorithms.
Decision tree algorithm
can be used for
solving regression and
classification problems.
Classification and regression trees

• Machine learning techniques such as deep learning enable the algorithm to make use of
automatic feature learning- which means that based on the dataset alone, the algorithm can learn how
to combine multiple features of the input data into a more abstract set of features from which to
conduct further learning.
• Machine learning has been applied to six main subfields of bioinformatic:
genomics, proteomics, microarrays, systems biology, evolution, and text mining.
• Machine learning has also been used for the problem of multiple sequence alignment which involves
aligning many DNA or amino acid sequences in order to determine regions of similarity that could
indicate a shared evolutionary history. It can also be used to detect and visualize genome
rearrangements
• ML algorithms have been applied to great success in GWAS(detecting patterns of epistasis within the
human genome.)
• deep learning algorithms were used to detect cancer metastases on higher resolution pathology images
at levels comparable to human pathologists

Goals
In this paper, we take a detailed look at 13 popular open source ML
algorithms and analyse their performance across a set of 165 supervised
classification problems in order to provide data-driven advice to
practitioners who wish to apply ML to their datasets.
The results highlight the importance of selecting the right ML algorithm
for each problem, which can improve prediction accuracy significantly on
some problems.
Finally, based on the results of the experiments, we provide a refined set
of recommendations for ML algorithms and parameters as a starting point
for future researchers.

 They compared 13 popular ML algorithms from scikit-learn (a widely used ML library
implemented in Python.) Each algorithm and its hyperparameters are described in Table 1
 The algorithms were compared on 165 supervised classification datasets from the Penn
Machine Learning Benchmark (PMLB)
 The algorithms include Naïve Bayes algorithms, common linear classifiers, tree-based
algorithms, distance-based classifiers, ensemble algorithms, and non-linear, kernel-based
strategies
 For each algorithm, the hyperparameters were tuned using a fixed grid search with 10-
fold cross-validation.
 The entire experimental design consisted of over 5.5 million ML algorithm and parameter
evaluations in total, resulting in a rich set of data that is analysed from several
viewpoints.(Algorithm performance, Tuning and model selection, algorithm coverage.)

Algorithm performance results through several lenses:
• Performance of each algorithm across all datasets in
terms of best balanced accuracy.(1)
• Effect of tuning and model selection.(2)
• Cluster across tested problems & produce Maximize
performance across the datasets.(3)

(1)Algorithm performance
we plot the mean rankings of the algorithms across all datasets
In order to assess the statistical significance of the observed
differences in algorithm performance across all problems, we
use the non-parametric Friedman test.

Our experimental results allow us to measure the extent to which
hyperparameter tuning via grid search improves each algorithm’s
performance compared to its baseline settings.
We also measure the effect that model selection has on improving classifier
performance.

 We perform hierarchical agglomerative clustering on the 10-fold CV
balanced accuracy results, which leads to the clusters shown in figure.
 We present a list of five recommended algorithms and parameter settings.

Hierarchical clustering of ML algorithms by accuracy rankings across datasets.

 We have empirically assessed 13 supervised classification algorithms on a set of 165 supervised
classification datasets in order to provide a contemporary set of recommendations to
bioinformaticians who wish to apply ML algorithms to their data.
 The analysis demonstrates the strength of state-of-the-art, tree-based ensemble algorithms, while
also showing the problem-dependent nature of ML algorithm performance.
 In addition, the analysis shows that selecting the right ML algorithm and thoroughly tuning its
parameters can lead to a significant improvement in predictive accuracy on most problems.
 Even with a large set of results, it is difficult to recommend specific algorithms or parameter
settings with a strong amount of generality
 As a starting point, we provided recommendations for 5 different ML algorithms and parameters
based on their collective coverage of the 165 datasets from PMLB.

REFERENCES
1.Bhaskar H, Hoyle DC, Singh S. Computers in Biology and Medicine. 2006; 36:1104. Intelligent Technologies in
Medicine and Bioinformatics. [PubMed: 16226240]
2. McKinney BA, Reif DM, Ritchie MD, Moore JH. Applied Bioinformatics. 2006; 5:77. [PubMed: 16722772]
3. Liu, Y., Gadepalli, K., Norouzi, M., Dahl, GE., Kohlberger, T., Boyko, A., Venugopalan, S., Timofeev, Corrado, GS.,
Hipp, JD., Peng, L., Stumpe, MC. Detecting cancer metastases on gigapixel pathology images. 2017
4. King RD, Feng C, Sutherland A. Applied Artificial Intelligence an International Journal. 1995; 9:289.
5. Tan AC, Gilbert D. An empirical comparison of supervised machine learning techniques in bioinformatics.
Proceedings of the First Asia-Pacific Bioinformatics Conference on Bioinformatics 2003. 2003; 19
6. Caruana R, Niculescu-Mizil A. An empirical comparison of supervised learning algorithms. Proceedings of the
23rd International Conference on Machine learning. 2006
7. Fernández-Delgado M, Cernadas E, Barro S, Amorim D. Journal of Machine Learning Research. 2014; 15:3133.
8. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R,
Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Journal of Machine Learning
Research. 2011; 12:2825. 9. Frank E, Hall M, Trigg L, Holmes G, Witten IH. Bioinformatics. 2004; 20:2479.
[PubMed: 15073010]
10. Hastie, TJ., Tibshirani, RJ., Friedman, JH. The Elements of Statistical Learning: Data Mining, Inference, and
Prediction. Springer; New York, USA: 2009.
11. Velez DR, et al. Genetic Epidemiology. 2007; 31:306. [PubMed: 17323372]

12. Vanschoren J, Van Rijn JN, Bischl B, Torgo L. ACM SIGKDD Explorations Newsletter. 2014; 15:49.
13. Olson, RS., La Cava, W., Orzechowski, P., Urbanowicz, RJ., Moore, JH. PMLB: A Large Benchmark Suite for
Machine Learning Evaluation and Comparison. 2017.
14. Demšar J. Journal of Machine Learning Research. 2006; 7:1.
15. Wolpert DH, Macready WG. IEEE Transactions on Evolutionary Computation. 1997; 1:67.
16. Olson RS, Bartley N, Urbanowicz RJ, Moore JH. Evaluation of a tree-based pipeline optimization tool for
automating data science. Proceedings of the 2016 on Genetic and Evolutionary Computation Conference.
2016
17. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F. Efficient and robust
automated machine learning. In: Cortes, C.Lawrence, ND.Lee, DD.Sugiyama, M., Garnett, R., editors. Advances
in Neural Information Processing Systems 28. Curran Associates, Inc; 2015. p. 2962-2970.
18. Olson, RS., Sipper, M., La Cava, W., Tartarone, S., Vitale, S., Fu, JH., Holmes, Weixuan, Moore, JH. A system
for accessible artificial intelligence. 2017. arXiv e-print https://arxiv.org/abs/ 1705.00594
19. Ribeiro MT, Singh S, Guestrin C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier.
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
KDD ’16 (ACM, New York, NY, USA. 2016
20. La Cava W, Moore JH. Ensemble representation learning: an analysis of fitness and survival for wrapper-
based genetic programming methods. Proceedings of the Genetic and Evolutionary Computation Conference
2017.

Machine learning to solve bioinformatics problems

Machine learning to solve bioinformatics problems

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Machine learning to solve bioinformatics problems

Similar to Machine learning to solve bioinformatics problems (20)

Recently uploaded

Recently uploaded (20)

Machine learning to solve bioinformatics problems

Editor's Notes