SlideShare a Scribd company logo
1 of 31
Data-Driven advice for applying machine
learning to bioinformatics problems
Randal S. Olson, William La Cava, Zairah Mustahsan, Akshay Varik and Jason H. Moore
Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
Pac Symp Biocomput. Author manuscript; available in PMC 2018 April 09.
Junaid Ahmed
181706016
M.Sc. Bioinformatics
What is Artificial Intelligence?
“Intelligence demonstrated by machines, in contrast to
the natural intelligence displayed by humans and other
animals”
“Enabling the computers to think”
“Statistical tools to learn from data”
“Multilayer neural network”
Unsupervised learning
Reinforcement learning
Decision Tree algorithm
belongs to the family
of supervised learning
algorithms.
Decision tree algorithm
can be used for
solving regression and
classification problems.
Classification and regression trees
• Machine learning techniques such as deep learning enable the algorithm to make use of
automatic feature learning- which means that based on the dataset alone, the algorithm can learn how
to combine multiple features of the input data into a more abstract set of features from which to
conduct further learning.
• Machine learning has been applied to six main subfields of bioinformatic:
genomics, proteomics, microarrays, systems biology, evolution, and text mining.
• Machine learning has also been used for the problem of multiple sequence alignment which involves
aligning many DNA or amino acid sequences in order to determine regions of similarity that could
indicate a shared evolutionary history. It can also be used to detect and visualize genome
rearrangements
• ML algorithms have been applied to great success in GWAS(detecting patterns of epistasis within the
human genome.)
• deep learning algorithms were used to detect cancer metastases on higher resolution pathology images
at levels comparable to human pathologists
Goals
In this paper, we take a detailed look at 13 popular open source ML
algorithms and analyse their performance across a set of 165 supervised
classification problems in order to provide data-driven advice to
practitioners who wish to apply ML to their datasets.
The results highlight the importance of selecting the right ML algorithm
for each problem, which can improve prediction accuracy significantly on
some problems.
Finally, based on the results of the experiments, we provide a refined set
of recommendations for ML algorithms and parameters as a starting point
for future researchers.
 They compared 13 popular ML algorithms from scikit-learn (a widely used ML library
implemented in Python.) Each algorithm and its hyperparameters are described in Table 1
 The algorithms were compared on 165 supervised classification datasets from the Penn
Machine Learning Benchmark (PMLB)
 The algorithms include Naïve Bayes algorithms, common linear classifiers, tree-based
algorithms, distance-based classifiers, ensemble algorithms, and non-linear, kernel-based
strategies
 For each algorithm, the hyperparameters were tuned using a fixed grid search with 10-
fold cross-validation.
 The entire experimental design consisted of over 5.5 million ML algorithm and parameter
evaluations in total, resulting in a rich set of data that is analysed from several
viewpoints.(Algorithm performance, Tuning and model selection, algorithm coverage.)
Algorithm performance results through several lenses:
• Performance of each algorithm across all datasets in
terms of best balanced accuracy.(1)
• Effect of tuning and model selection.(2)
• Cluster across tested problems & produce Maximize
performance across the datasets.(3)
(1)Algorithm performance
we plot the mean rankings of the algorithms across all datasets
In order to assess the statistical significance of the observed
differences in algorithm performance across all problems, we
use the non-parametric Friedman test.
Our experimental results allow us to measure the extent to which
hyperparameter tuning via grid search improves each algorithm’s
performance compared to its baseline settings.
We also measure the effect that model selection has on improving classifier
performance.
 We perform hierarchical agglomerative clustering on the 10-fold CV
balanced accuracy results, which leads to the clusters shown in figure.
 We present a list of five recommended algorithms and parameter settings.
Hierarchical clustering of ML algorithms by accuracy rankings across datasets.
 We have empirically assessed 13 supervised classification algorithms on a set of 165 supervised
classification datasets in order to provide a contemporary set of recommendations to
bioinformaticians who wish to apply ML algorithms to their data.
 The analysis demonstrates the strength of state-of-the-art, tree-based ensemble algorithms, while
also showing the problem-dependent nature of ML algorithm performance.
 In addition, the analysis shows that selecting the right ML algorithm and thoroughly tuning its
parameters can lead to a significant improvement in predictive accuracy on most problems.
 Even with a large set of results, it is difficult to recommend specific algorithms or parameter
settings with a strong amount of generality
 As a starting point, we provided recommendations for 5 different ML algorithms and parameters
based on their collective coverage of the 165 datasets from PMLB.
REFERENCES
1.Bhaskar H, Hoyle DC, Singh S. Computers in Biology and Medicine. 2006; 36:1104. Intelligent Technologies in
Medicine and Bioinformatics. [PubMed: 16226240]
2. McKinney BA, Reif DM, Ritchie MD, Moore JH. Applied Bioinformatics. 2006; 5:77. [PubMed: 16722772]
3. Liu, Y., Gadepalli, K., Norouzi, M., Dahl, GE., Kohlberger, T., Boyko, A., Venugopalan, S., Timofeev, Corrado, GS.,
Hipp, JD., Peng, L., Stumpe, MC. Detecting cancer metastases on gigapixel pathology images. 2017
4. King RD, Feng C, Sutherland A. Applied Artificial Intelligence an International Journal. 1995; 9:289.
5. Tan AC, Gilbert D. An empirical comparison of supervised machine learning techniques in bioinformatics.
Proceedings of the First Asia-Pacific Bioinformatics Conference on Bioinformatics 2003. 2003; 19
6. Caruana R, Niculescu-Mizil A. An empirical comparison of supervised learning algorithms. Proceedings of the
23rd International Conference on Machine learning. 2006
7. Fernández-Delgado M, Cernadas E, Barro S, Amorim D. Journal of Machine Learning Research. 2014; 15:3133.
8. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R,
Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Journal of Machine Learning
Research. 2011; 12:2825. 9. Frank E, Hall M, Trigg L, Holmes G, Witten IH. Bioinformatics. 2004; 20:2479.
[PubMed: 15073010]
10. Hastie, TJ., Tibshirani, RJ., Friedman, JH. The Elements of Statistical Learning: Data Mining, Inference, and
Prediction. Springer; New York, USA: 2009.
11. Velez DR, et al. Genetic Epidemiology. 2007; 31:306. [PubMed: 17323372]
12. Vanschoren J, Van Rijn JN, Bischl B, Torgo L. ACM SIGKDD Explorations Newsletter. 2014; 15:49.
13. Olson, RS., La Cava, W., Orzechowski, P., Urbanowicz, RJ., Moore, JH. PMLB: A Large Benchmark Suite for
Machine Learning Evaluation and Comparison. 2017.
14. Demšar J. Journal of Machine Learning Research. 2006; 7:1.
15. Wolpert DH, Macready WG. IEEE Transactions on Evolutionary Computation. 1997; 1:67.
16. Olson RS, Bartley N, Urbanowicz RJ, Moore JH. Evaluation of a tree-based pipeline optimization tool for
automating data science. Proceedings of the 2016 on Genetic and Evolutionary Computation Conference.
2016
17. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F. Efficient and robust
automated machine learning. In: Cortes, C.Lawrence, ND.Lee, DD.Sugiyama, M., Garnett, R., editors. Advances
in Neural Information Processing Systems 28. Curran Associates, Inc; 2015. p. 2962-2970.
18. Olson, RS., Sipper, M., La Cava, W., Tartarone, S., Vitale, S., Fu, JH., Holmes, Weixuan, Moore, JH. A system
for accessible artificial intelligence. 2017. arXiv e-print https://arxiv.org/abs/ 1705.00594
19. Ribeiro MT, Singh S, Guestrin C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier.
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
KDD ’16 (ACM, New York, NY, USA. 2016
20. La Cava W, Moore JH. Ensemble representation learning: an analysis of fitness and survival for wrapper-
based genetic programming methods. Proceedings of the Genetic and Evolutionary Computation Conference
2017.
Machine learning to solve bioinformatics problems

More Related Content

What's hot

GRAPHICAL MODEL AND CLUSTERINGREGRESSION BASED METHODS FOR CAUSAL INTERACTION...
GRAPHICAL MODEL AND CLUSTERINGREGRESSION BASED METHODS FOR CAUSAL INTERACTION...GRAPHICAL MODEL AND CLUSTERINGREGRESSION BASED METHODS FOR CAUSAL INTERACTION...
GRAPHICAL MODEL AND CLUSTERINGREGRESSION BASED METHODS FOR CAUSAL INTERACTION...ijaia
 
Efficiency of Prediction Algorithms for Mining Biological Databases
Efficiency of Prediction Algorithms for Mining Biological  DatabasesEfficiency of Prediction Algorithms for Mining Biological  Databases
Efficiency of Prediction Algorithms for Mining Biological DatabasesIOSR Journals
 
A Study on Cancer Perpetuation Using the Classification Algorithms
A Study on Cancer Perpetuation Using the Classification AlgorithmsA Study on Cancer Perpetuation Using the Classification Algorithms
A Study on Cancer Perpetuation Using the Classification Algorithmspaperpublications3
 
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...IRJET Journal
 
The Evaluated Measurement of a Combined Genetic Algorithm and Artificial Immu...
The Evaluated Measurement of a Combined Genetic Algorithm and Artificial Immu...The Evaluated Measurement of a Combined Genetic Algorithm and Artificial Immu...
The Evaluated Measurement of a Combined Genetic Algorithm and Artificial Immu...IJECEIAES
 
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTIONCATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTIONIJDKP
 
A Survey on Various Disease Prediction Techniques
A Survey on Various Disease Prediction TechniquesA Survey on Various Disease Prediction Techniques
A Survey on Various Disease Prediction Techniquesijtsrd
 
IRJET- Classification of Chemical Medicine or Drug using K Nearest Neighb...
IRJET-  	  Classification of Chemical Medicine or Drug using K Nearest Neighb...IRJET-  	  Classification of Chemical Medicine or Drug using K Nearest Neighb...
IRJET- Classification of Chemical Medicine or Drug using K Nearest Neighb...IRJET Journal
 
An Heterogeneous Population-Based Genetic Algorithm for Data Clustering
An Heterogeneous Population-Based Genetic Algorithm for Data ClusteringAn Heterogeneous Population-Based Genetic Algorithm for Data Clustering
An Heterogeneous Population-Based Genetic Algorithm for Data Clusteringijeei-iaes
 
Volume 14 issue 03 march 2014_ijcsms_march14_10_14_rahul
Volume 14  issue 03  march 2014_ijcsms_march14_10_14_rahulVolume 14  issue 03  march 2014_ijcsms_march14_10_14_rahul
Volume 14 issue 03 march 2014_ijcsms_march14_10_14_rahulDeepak Agarwal
 
Iganfis Data Mining Approach for Forecasting Cancer Threats
Iganfis Data Mining Approach for Forecasting Cancer ThreatsIganfis Data Mining Approach for Forecasting Cancer Threats
Iganfis Data Mining Approach for Forecasting Cancer Threatsijsrd.com
 
EVOLVING EFFICIENT CLUSTERING AND CLASSIFICATION PATTERNS IN LYMPHOGRAPHY DAT...
EVOLVING EFFICIENT CLUSTERING AND CLASSIFICATION PATTERNS IN LYMPHOGRAPHY DAT...EVOLVING EFFICIENT CLUSTERING AND CLASSIFICATION PATTERNS IN LYMPHOGRAPHY DAT...
EVOLVING EFFICIENT CLUSTERING AND CLASSIFICATION PATTERNS IN LYMPHOGRAPHY DAT...ijsc
 
Drug Discovery and Development Using AI
Drug Discovery and Development Using AIDrug Discovery and Development Using AI
Drug Discovery and Development Using AIDatabricks
 
Srge most important publications 2020
Srge most important  publications 2020Srge most important  publications 2020
Srge most important publications 2020Aboul Ella Hassanien
 

What's hot (18)

GRAPHICAL MODEL AND CLUSTERINGREGRESSION BASED METHODS FOR CAUSAL INTERACTION...
GRAPHICAL MODEL AND CLUSTERINGREGRESSION BASED METHODS FOR CAUSAL INTERACTION...GRAPHICAL MODEL AND CLUSTERINGREGRESSION BASED METHODS FOR CAUSAL INTERACTION...
GRAPHICAL MODEL AND CLUSTERINGREGRESSION BASED METHODS FOR CAUSAL INTERACTION...
 
Efficiency of Prediction Algorithms for Mining Biological Databases
Efficiency of Prediction Algorithms for Mining Biological  DatabasesEfficiency of Prediction Algorithms for Mining Biological  Databases
Efficiency of Prediction Algorithms for Mining Biological Databases
 
woot2
woot2woot2
woot2
 
A Study on Cancer Perpetuation Using the Classification Algorithms
A Study on Cancer Perpetuation Using the Classification AlgorithmsA Study on Cancer Perpetuation Using the Classification Algorithms
A Study on Cancer Perpetuation Using the Classification Algorithms
 
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
 
The Evaluated Measurement of a Combined Genetic Algorithm and Artificial Immu...
The Evaluated Measurement of a Combined Genetic Algorithm and Artificial Immu...The Evaluated Measurement of a Combined Genetic Algorithm and Artificial Immu...
The Evaluated Measurement of a Combined Genetic Algorithm and Artificial Immu...
 
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTIONCATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
 
[IJCT-V3I2P26] Authors: Sunny Sharma
[IJCT-V3I2P26] Authors: Sunny Sharma[IJCT-V3I2P26] Authors: Sunny Sharma
[IJCT-V3I2P26] Authors: Sunny Sharma
 
A Survey on Various Disease Prediction Techniques
A Survey on Various Disease Prediction TechniquesA Survey on Various Disease Prediction Techniques
A Survey on Various Disease Prediction Techniques
 
Dissertation
DissertationDissertation
Dissertation
 
IRJET- Classification of Chemical Medicine or Drug using K Nearest Neighb...
IRJET-  	  Classification of Chemical Medicine or Drug using K Nearest Neighb...IRJET-  	  Classification of Chemical Medicine or Drug using K Nearest Neighb...
IRJET- Classification of Chemical Medicine or Drug using K Nearest Neighb...
 
An Heterogeneous Population-Based Genetic Algorithm for Data Clustering
An Heterogeneous Population-Based Genetic Algorithm for Data ClusteringAn Heterogeneous Population-Based Genetic Algorithm for Data Clustering
An Heterogeneous Population-Based Genetic Algorithm for Data Clustering
 
Volume 14 issue 03 march 2014_ijcsms_march14_10_14_rahul
Volume 14  issue 03  march 2014_ijcsms_march14_10_14_rahulVolume 14  issue 03  march 2014_ijcsms_march14_10_14_rahul
Volume 14 issue 03 march 2014_ijcsms_march14_10_14_rahul
 
Iganfis Data Mining Approach for Forecasting Cancer Threats
Iganfis Data Mining Approach for Forecasting Cancer ThreatsIganfis Data Mining Approach for Forecasting Cancer Threats
Iganfis Data Mining Approach for Forecasting Cancer Threats
 
EVOLVING EFFICIENT CLUSTERING AND CLASSIFICATION PATTERNS IN LYMPHOGRAPHY DAT...
EVOLVING EFFICIENT CLUSTERING AND CLASSIFICATION PATTERNS IN LYMPHOGRAPHY DAT...EVOLVING EFFICIENT CLUSTERING AND CLASSIFICATION PATTERNS IN LYMPHOGRAPHY DAT...
EVOLVING EFFICIENT CLUSTERING AND CLASSIFICATION PATTERNS IN LYMPHOGRAPHY DAT...
 
Drug Discovery and Development Using AI
Drug Discovery and Development Using AIDrug Discovery and Development Using AI
Drug Discovery and Development Using AI
 
Srge most important publications 2020
Srge most important  publications 2020Srge most important  publications 2020
Srge most important publications 2020
 
Introductionedited
IntroductioneditedIntroductionedited
Introductionedited
 

Similar to Machine learning to solve bioinformatics problems

Multivariate sample similarity measure for feature selection with a resemblan...
Multivariate sample similarity measure for feature selection with a resemblan...Multivariate sample similarity measure for feature selection with a resemblan...
Multivariate sample similarity measure for feature selection with a resemblan...IJECEIAES
 
Evaluation of Logistic Regression and Neural Network Model With Sensitivity A...
Evaluation of Logistic Regression and Neural Network Model With Sensitivity A...Evaluation of Logistic Regression and Neural Network Model With Sensitivity A...
Evaluation of Logistic Regression and Neural Network Model With Sensitivity A...CSCJournals
 
Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...Damian R. Mingle, MBA
 
Charleston Conference 2016
Charleston Conference 2016Charleston Conference 2016
Charleston Conference 2016Anita de Waard
 
CSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning ProjectCSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning Projectbutest
 
machine learning algorithm.pptx
machine learning algorithm.pptxmachine learning algorithm.pptx
machine learning algorithm.pptxSasmitaDash28
 
machine_learning_section1_ebook.pdf
machine_learning_section1_ebook.pdfmachine_learning_section1_ebook.pdf
machine_learning_section1_ebook.pdfagfi
 
Health Care Application using Machine Learning and Deep Learning
Health Care Application using Machine Learning and Deep LearningHealth Care Application using Machine Learning and Deep Learning
Health Care Application using Machine Learning and Deep LearningIRJET Journal
 
Supervised machine learning based liver disease prediction approach with LASS...
Supervised machine learning based liver disease prediction approach with LASS...Supervised machine learning based liver disease prediction approach with LASS...
Supervised machine learning based liver disease prediction approach with LASS...journalBEEI
 
A Framework for Statistical Simulation of Physiological Responses (SSPR).
A Framework for Statistical Simulation of Physiological Responses (SSPR).A Framework for Statistical Simulation of Physiological Responses (SSPR).
A Framework for Statistical Simulation of Physiological Responses (SSPR).Waqas Tariq
 
Human-centered AI: how can we support end-users to interact with AI?
Human-centered AI: how can we support end-users to interact with AI?Human-centered AI: how can we support end-users to interact with AI?
Human-centered AI: how can we support end-users to interact with AI?Katrien Verbert
 
Embi cri review-2012-final
Embi cri review-2012-finalEmbi cri review-2012-final
Embi cri review-2012-finalPeter Embi
 
Artificial Intelligence in pathology
Artificial Intelligence in pathologyArtificial Intelligence in pathology
Artificial Intelligence in pathologynehaSingh1543
 
Heart Failure Prediction using Different MachineLearning Techniques
Heart Failure Prediction using Different MachineLearning TechniquesHeart Failure Prediction using Different MachineLearning Techniques
Heart Failure Prediction using Different MachineLearning TechniquesIRJET Journal
 
AI-augmented Drug Discovery.pdf
AI-augmented Drug Discovery.pdfAI-augmented Drug Discovery.pdf
AI-augmented Drug Discovery.pdfCandy Swift
 
AI-Based Antibody Screening.pdf
AI-Based Antibody Screening.pdfAI-Based Antibody Screening.pdf
AI-Based Antibody Screening.pdfCandy Swift
 
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...Sara Alvarez
 
Prediction of dementia using machine learning model and performance improvem...
Prediction of dementia using machine learning model and  performance improvem...Prediction of dementia using machine learning model and  performance improvem...
Prediction of dementia using machine learning model and performance improvem...IJECEIAES
 

Similar to Machine learning to solve bioinformatics problems (20)

Multivariate sample similarity measure for feature selection with a resemblan...
Multivariate sample similarity measure for feature selection with a resemblan...Multivariate sample similarity measure for feature selection with a resemblan...
Multivariate sample similarity measure for feature selection with a resemblan...
 
Evaluation of Logistic Regression and Neural Network Model With Sensitivity A...
Evaluation of Logistic Regression and Neural Network Model With Sensitivity A...Evaluation of Logistic Regression and Neural Network Model With Sensitivity A...
Evaluation of Logistic Regression and Neural Network Model With Sensitivity A...
 
Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...
 
Charleston Conference 2016
Charleston Conference 2016Charleston Conference 2016
Charleston Conference 2016
 
Journal
JournalJournal
Journal
 
CSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning ProjectCSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning Project
 
machine learning algorithm.pptx
machine learning algorithm.pptxmachine learning algorithm.pptx
machine learning algorithm.pptx
 
machine_learning_section1_ebook.pdf
machine_learning_section1_ebook.pdfmachine_learning_section1_ebook.pdf
machine_learning_section1_ebook.pdf
 
Health Care Application using Machine Learning and Deep Learning
Health Care Application using Machine Learning and Deep LearningHealth Care Application using Machine Learning and Deep Learning
Health Care Application using Machine Learning and Deep Learning
 
Supervised machine learning based liver disease prediction approach with LASS...
Supervised machine learning based liver disease prediction approach with LASS...Supervised machine learning based liver disease prediction approach with LASS...
Supervised machine learning based liver disease prediction approach with LASS...
 
A Framework for Statistical Simulation of Physiological Responses (SSPR).
A Framework for Statistical Simulation of Physiological Responses (SSPR).A Framework for Statistical Simulation of Physiological Responses (SSPR).
A Framework for Statistical Simulation of Physiological Responses (SSPR).
 
Human-centered AI: how can we support end-users to interact with AI?
Human-centered AI: how can we support end-users to interact with AI?Human-centered AI: how can we support end-users to interact with AI?
Human-centered AI: how can we support end-users to interact with AI?
 
Embi cri review-2012-final
Embi cri review-2012-finalEmbi cri review-2012-final
Embi cri review-2012-final
 
Artificial Intelligence in pathology
Artificial Intelligence in pathologyArtificial Intelligence in pathology
Artificial Intelligence in pathology
 
Comparison of breast cancer classification models on Wisconsin dataset
Comparison of breast cancer classification models on Wisconsin  datasetComparison of breast cancer classification models on Wisconsin  dataset
Comparison of breast cancer classification models on Wisconsin dataset
 
Heart Failure Prediction using Different MachineLearning Techniques
Heart Failure Prediction using Different MachineLearning TechniquesHeart Failure Prediction using Different MachineLearning Techniques
Heart Failure Prediction using Different MachineLearning Techniques
 
AI-augmented Drug Discovery.pdf
AI-augmented Drug Discovery.pdfAI-augmented Drug Discovery.pdf
AI-augmented Drug Discovery.pdf
 
AI-Based Antibody Screening.pdf
AI-Based Antibody Screening.pdfAI-Based Antibody Screening.pdf
AI-Based Antibody Screening.pdf
 
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
 
Prediction of dementia using machine learning model and performance improvem...
Prediction of dementia using machine learning model and  performance improvem...Prediction of dementia using machine learning model and  performance improvem...
Prediction of dementia using machine learning model and performance improvem...
 

Recently uploaded

Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 

Recently uploaded (20)

Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 

Machine learning to solve bioinformatics problems

  • 1. Data-Driven advice for applying machine learning to bioinformatics problems Randal S. Olson, William La Cava, Zairah Mustahsan, Akshay Varik and Jason H. Moore Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA Pac Symp Biocomput. Author manuscript; available in PMC 2018 April 09. Junaid Ahmed 181706016 M.Sc. Bioinformatics
  • 2. What is Artificial Intelligence? “Intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and other animals”
  • 3.
  • 4. “Enabling the computers to think” “Statistical tools to learn from data” “Multilayer neural network”
  • 5.
  • 6.
  • 9.
  • 10. Decision Tree algorithm belongs to the family of supervised learning algorithms. Decision tree algorithm can be used for solving regression and classification problems. Classification and regression trees
  • 11.
  • 12.
  • 13. • Machine learning techniques such as deep learning enable the algorithm to make use of automatic feature learning- which means that based on the dataset alone, the algorithm can learn how to combine multiple features of the input data into a more abstract set of features from which to conduct further learning. • Machine learning has been applied to six main subfields of bioinformatic: genomics, proteomics, microarrays, systems biology, evolution, and text mining. • Machine learning has also been used for the problem of multiple sequence alignment which involves aligning many DNA or amino acid sequences in order to determine regions of similarity that could indicate a shared evolutionary history. It can also be used to detect and visualize genome rearrangements • ML algorithms have been applied to great success in GWAS(detecting patterns of epistasis within the human genome.) • deep learning algorithms were used to detect cancer metastases on higher resolution pathology images at levels comparable to human pathologists
  • 14. Goals In this paper, we take a detailed look at 13 popular open source ML algorithms and analyse their performance across a set of 165 supervised classification problems in order to provide data-driven advice to practitioners who wish to apply ML to their datasets. The results highlight the importance of selecting the right ML algorithm for each problem, which can improve prediction accuracy significantly on some problems. Finally, based on the results of the experiments, we provide a refined set of recommendations for ML algorithms and parameters as a starting point for future researchers.
  • 15.  They compared 13 popular ML algorithms from scikit-learn (a widely used ML library implemented in Python.) Each algorithm and its hyperparameters are described in Table 1  The algorithms were compared on 165 supervised classification datasets from the Penn Machine Learning Benchmark (PMLB)  The algorithms include Naïve Bayes algorithms, common linear classifiers, tree-based algorithms, distance-based classifiers, ensemble algorithms, and non-linear, kernel-based strategies  For each algorithm, the hyperparameters were tuned using a fixed grid search with 10- fold cross-validation.  The entire experimental design consisted of over 5.5 million ML algorithm and parameter evaluations in total, resulting in a rich set of data that is analysed from several viewpoints.(Algorithm performance, Tuning and model selection, algorithm coverage.)
  • 16.
  • 17.
  • 18. Algorithm performance results through several lenses: • Performance of each algorithm across all datasets in terms of best balanced accuracy.(1) • Effect of tuning and model selection.(2) • Cluster across tested problems & produce Maximize performance across the datasets.(3)
  • 19. (1)Algorithm performance we plot the mean rankings of the algorithms across all datasets In order to assess the statistical significance of the observed differences in algorithm performance across all problems, we use the non-parametric Friedman test.
  • 20.
  • 21.
  • 22. Our experimental results allow us to measure the extent to which hyperparameter tuning via grid search improves each algorithm’s performance compared to its baseline settings. We also measure the effect that model selection has on improving classifier performance.
  • 23.
  • 24.
  • 25.  We perform hierarchical agglomerative clustering on the 10-fold CV balanced accuracy results, which leads to the clusters shown in figure.  We present a list of five recommended algorithms and parameter settings.
  • 26. Hierarchical clustering of ML algorithms by accuracy rankings across datasets.
  • 27.
  • 28.  We have empirically assessed 13 supervised classification algorithms on a set of 165 supervised classification datasets in order to provide a contemporary set of recommendations to bioinformaticians who wish to apply ML algorithms to their data.  The analysis demonstrates the strength of state-of-the-art, tree-based ensemble algorithms, while also showing the problem-dependent nature of ML algorithm performance.  In addition, the analysis shows that selecting the right ML algorithm and thoroughly tuning its parameters can lead to a significant improvement in predictive accuracy on most problems.  Even with a large set of results, it is difficult to recommend specific algorithms or parameter settings with a strong amount of generality  As a starting point, we provided recommendations for 5 different ML algorithms and parameters based on their collective coverage of the 165 datasets from PMLB.
  • 29. REFERENCES 1.Bhaskar H, Hoyle DC, Singh S. Computers in Biology and Medicine. 2006; 36:1104. Intelligent Technologies in Medicine and Bioinformatics. [PubMed: 16226240] 2. McKinney BA, Reif DM, Ritchie MD, Moore JH. Applied Bioinformatics. 2006; 5:77. [PubMed: 16722772] 3. Liu, Y., Gadepalli, K., Norouzi, M., Dahl, GE., Kohlberger, T., Boyko, A., Venugopalan, S., Timofeev, Corrado, GS., Hipp, JD., Peng, L., Stumpe, MC. Detecting cancer metastases on gigapixel pathology images. 2017 4. King RD, Feng C, Sutherland A. Applied Artificial Intelligence an International Journal. 1995; 9:289. 5. Tan AC, Gilbert D. An empirical comparison of supervised machine learning techniques in bioinformatics. Proceedings of the First Asia-Pacific Bioinformatics Conference on Bioinformatics 2003. 2003; 19 6. Caruana R, Niculescu-Mizil A. An empirical comparison of supervised learning algorithms. Proceedings of the 23rd International Conference on Machine learning. 2006 7. Fernández-Delgado M, Cernadas E, Barro S, Amorim D. Journal of Machine Learning Research. 2014; 15:3133. 8. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Journal of Machine Learning Research. 2011; 12:2825. 9. Frank E, Hall M, Trigg L, Holmes G, Witten IH. Bioinformatics. 2004; 20:2479. [PubMed: 15073010] 10. Hastie, TJ., Tibshirani, RJ., Friedman, JH. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer; New York, USA: 2009. 11. Velez DR, et al. Genetic Epidemiology. 2007; 31:306. [PubMed: 17323372]
  • 30. 12. Vanschoren J, Van Rijn JN, Bischl B, Torgo L. ACM SIGKDD Explorations Newsletter. 2014; 15:49. 13. Olson, RS., La Cava, W., Orzechowski, P., Urbanowicz, RJ., Moore, JH. PMLB: A Large Benchmark Suite for Machine Learning Evaluation and Comparison. 2017. 14. Demšar J. Journal of Machine Learning Research. 2006; 7:1. 15. Wolpert DH, Macready WG. IEEE Transactions on Evolutionary Computation. 1997; 1:67. 16. Olson RS, Bartley N, Urbanowicz RJ, Moore JH. Evaluation of a tree-based pipeline optimization tool for automating data science. Proceedings of the 2016 on Genetic and Evolutionary Computation Conference. 2016 17. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F. Efficient and robust automated machine learning. In: Cortes, C.Lawrence, ND.Lee, DD.Sugiyama, M., Garnett, R., editors. Advances in Neural Information Processing Systems 28. Curran Associates, Inc; 2015. p. 2962-2970. 18. Olson, RS., Sipper, M., La Cava, W., Tartarone, S., Vitale, S., Fu, JH., Holmes, Weixuan, Moore, JH. A system for accessible artificial intelligence. 2017. arXiv e-print https://arxiv.org/abs/ 1705.00594 19. Ribeiro MT, Singh S, Guestrin C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16 (ACM, New York, NY, USA. 2016 20. La Cava W, Moore JH. Ensemble representation learning: an analysis of fitness and survival for wrapper- based genetic programming methods. Proceedings of the Genetic and Evolutionary Computation Conference 2017.

Editor's Notes

  1. Today we know that AI is a very popular subject that is widely discussed in technology and business. Unlike general perception, AI is not just limited to IT or technology industry; Instead it is being extensively used in other areas such as Medical, Business, Education, Law and Manufacturing. etc At present stage Medical AI mainly uses computer techniques to perform clinical diagnoses and suggest treatments. So, why do we use AI in Medical field? Because As we know that huge data is being generated due to vast technological advancements. So AI has the capability of analysing the complex medical data and the potential to exploit meaningful relationship with in the data set, which can be used in diagnosis , treatment and predicting the outcome in many clinical scenarios.
  2. Okay, SO types of AI , actually this is a much wider classification of AI Deep Blue was a chess-playing computer developed by IBM. It is known for being the first computer chess-playing system to win both a chess game and a chess match against a world champion under regular time controls. Google’s self driving vehicle has experience of more than 300,000 kilometres driving in city traffic, busy highways, and mountainous roads with only occasional human intervention. It has a laser range-finder mounted on the roof of the car. with high resolution maps of the world. And also has many different sensors. 3,4 are faaar from future these are fictional. 3. Theory of mind refers to the ability to attribute mental states such as beliefs, desires, goals, and intentions to others, and to understand that these states are different from one's own. Chitti robo- Sophia : Its social humanoid robot that uses artificial intelligence to see people, understand conversation, and form relationships. the bot is not a true AI. It is just a fine art of work. 4.This is the most advanced level which we can imagine Example : TARS in interstellar. The Matrix – Agent smith.
  3. Artificial Intelligence is a way of making a computer, a computer-controlled robot, or a software think intelligently, in the similar manner the intelligent humans think. Machine Learning at its most basic is the practice of using algorithms to parse data, learn from it, and then make a determination or prediction about something in the world. So rather than hand-coding software routines with a specific set of instructions to accomplish a particular task, the machine is “trained” using large amounts of data and algorithms that give it the ability to learn how to perform the task. Machine learning came directly from the early AI, and the algorithmic approaches over the years included decision tree learning, inductive logic programming. clustering, reinforcement learning, and Bayesian networks among others. Deep Learning — A Technique for Implementing Machine Learning It uses Artificial Neural Networks –these Artificial Neural Networks are inspired by our understanding of the biology of our brains – all those interconnections between the neurons. But, unlike a biological brain where any neuron can connect to any other neuron within a certain physical distance, these artificial neural networks have discrete layers, connections, and directions of data propagation. If you are wondering what is data science Data science : making sense of data/ visualising of data(to extract knowledge and insights from data in various forms.)
  4. So, apart from the fidget spinner what we can infer from this picture is: Machine learning can be simply classified into 3 types. I’ll start with supervised learning.as it is the easiest one to understand: Supervised Machine Learning: The majority of practical machine learning uses supervised learning. Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output Y = f(X) . So here The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data. Supervised learning problems can be further grouped into Regression and Classification problems A regression problem is when the output variable is a real or continuous value, such as “salary” or “weight”. A classification problem is when the output variable is a category, such as “red” or “blue” or “disease” and “no disease”. Classification is the task of predicting a discrete class label. Regression is the task of predicting a continuous quantity. Unsupervised learning – the machine aims to find patterns, within a dataset without an explicit input from a human as to what these patterns might look like. Clustering is the assignment of objects to homogeneous groups (called clusters) while making sure that objects in different groups are not similar. Clustering is considered an unsupervised task as it aims to describe the hidden structure of the objects. Each object is described by a set of characters called features Example of algorithm. K means Dimensionally reduction: it is useful to apply a process called dimensionality reduction to highly dimensional data. The purpose of this process is to reduce the number of features under consideration, where each feature is a dimension that partly represents the objects. Dimensionality reduction can be executed using two different methods: Selecting from the existing features (feature selection) Extracting new features by combining the existing features (feature extraction) Reinforcement learning (RL) is an area of machine learning concerned with how software agents or machines ought to take actions in an environment so as to maximize some notion of cumulative reward. where an agent learn how to behave in a environment by performing actions and seeing the results. RL is concerned with how software agents or machines take actions in an environment so as to maximize some notion of cumulative reward
  5. Siri, Alexa, Google Now are some of the popular examples of virtual personal assistants. As the name suggests, they assist in finding information, when asked over voice. All you need to do is activate them and ask “What is my schedule for today?”, or similar questions. For answering, your personal assistant looks out for the information, recalls your related queries, or send a command to other resources (like phone apps) to collect info. You can even instruct assistants for certain tasks like “Set an alarm for 6 AM next morning”, *Machine learning is an important part of these personal assistants as they collect and refine the information on the basis of your previous involvement with them. Later, this set of data is utilized to render results that are tailored to our preferences. Turning Sounds into Bits The first step in speech recognition is obvious — we need to feed sound waves into a computer. Sound waves are one-dimensional. At every moment in time, they have a single value based on the height of the wave. To turn this sound wave into numbers, we just record of the height of the wave at equally-spaced points: This is called sampling
  6. Unsupervised learning is where you only have input data (X) and no corresponding output variables. The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data. These are called unsupervised learning because unlike supervised learning above there is no correct answers and there is no teacher. Algorithms are left to their own devises to discover and present the interesting structure in the data.
  7. Using this algorithm, the machine is trained to make specific decisions. It works this way: the machine is exposed to an environment where it trains itself continually using trial and error. This machine learns from past experience and tries to capture the best possible knowledge to make accurate decisions. Example of Reinforcement Learning: Markov Decision Process Let’s imagine an agent learning to play Super Mario Bros as a working example. The Reinforcement Learning (RL) process can be modelled as a loop that works like this: Our Agent receives state S0 from the Environment (In our case we receive the first frame of our game (state) from Super Mario Bros (environment)) Based on that state S0, agent takes an action A0 (our agent will move right) Environment transitions to a new state S1 (new frame) Environment gives some reward R1 to the agent (not dead: +1) This RL loop outputs a sequence of state, action and reward. The goal of the agent is to maximize the expected cumulative reward. 3 approaches to RL Value Based, policy based and model based. The value function is a function that tells us the maximum expected future reward the agent will get at each state.
  8. ML algorithms are those that can learn from data and improve from experience, without human intervention. Learning tasks may include learning the function that maps the input to the output, learning the hidden structure in unlabelled data; or ‘instance-based learning’, where a class label is produced for a new instance by comparing the new instance (row) to instances from the training data, which were stored in memory. ‘Instance-based learning’ does not create an abstraction from specific instances. SVR- support vector machine. Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
  9. Machine learning Algorithm Classification trees: Learning decision tree from data is called classification tree. Regression trees: Decision tree where target variable can take continuous values. Decision trees are important type of algorithm for predictive modelling machine learning This is a binary tree.(1 or 2) Everything starts a root. Internal nodes : based on them the tree split into branches. End of the branch that doesn’t split anymore is decision./ Leaf Each node represents a single input variable (x) Leaf nodes are the output variables Y .
  10. Neural networks are a set of algorithms, modelled loosely after the human brain, that are designed to recognize patterns. They interpret sensory data through a kind of machine perception, labelling or clustering raw input.  The patterns they recognize are numerical, contained in vectors, into which all real-world data, be it images, sound, text are translated. Neural networks help us cluster and classify. You can think of them as a clustering and classification layer on top of the data you store and manage. They help to group unlabelled data according to similarities among the example inputs, and they classify data when they have a labelled dataset to train on. So you can see this above example The no of nodes in the input layer is determined by the dimensionality of the data. , output layer number of classes we have. The example is a Classification type All classification tasks depend upon labelled datasets; that is, humans must transfer their knowledge to the dataset in order for a neural to learn the correlation between labels and data. This is known as supervised learning.
  11. So what is Bioinformatics: There are many definitions of BI, so one the simple definition is: Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data. As an interdisciplinary field of science, bioinformatics combines biology, computer science, information engineering, mathematics and statistics to analyse and interpret biological data. Bioinformatics also deals with- collection, classification, storage, and analysis of biological information using computers especially as applied to molecular genetics and genomics
  12. Prior to the machine learning algorithms, Bioinformatics algorithms had to be explicitly programmed by hand which, for problems such as protein structure prediction, proves extremely difficult. And these are some of the applications of ML algorithms in bioinformatics. 1.Genome Wide Association Studies also called whole genome association study. observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. When applied to human data, GWA studies compare the DNA of participants having varying phenotypes for a particular trait or disease. These participants may be people with a disease (cases) and similar people without the disease (controls), or they may be people with different phenotypes for a particular trait, for example blood pressure. Epistatis: the interaction of genes that are not alleles, in particular the suppression of the effect of one such gene by another. 2.Metastases: the development of secondary malignant growths at a distance from a primary site of cancer
  13. Hyperparameters- contain the data that govern the training process itself. PMLB is a collection of publicly available classification problems that have been standardized to the same format and collected in a central location with easy access via Python PMLB includes many biomedical classification problems, including tasks such as disease diagnosis, post-operative decision making, and exon boundary identification in DNA, among others. A sample of the biomedical classification tasks contained in PMLB is listed in Table 2. 10 fold cross validation: Normally in a machine learning process, data is divided into training and test sets; the training set is then used to train the model and the test set is used to evaluate the performance of a model. However, this approach may lead to variance problems. In simpler words, a variance problem refers to the scenario where our accuracy obtained on one test is very different to accuracy obtained on another test set using the same algorithm. The solution to this problem is to use K-Fold Cross-Validation for performance evaluation where K is any number. The process of K-Fold Cross-Validation is straightforward. You divide the data into K folds. Out of the K folds, K-1 sets are used for training while the remaining set is used for testing. The algorithm is trained and tested K times, each time a new set is used as testing set while remaining sets are used for training. Finally, the result of the K-Fold Cross-Validation is the average of the results obtained on each set. Suppose we want to perform 5-fold cross validation. To do so, the data is divided into 5 sets, for instance we name them SET A, SET B, SET C, SET D, and SET E. The algorithm is trained and tested K times. In the first fold, SET A to SET D are used as training set and SET E is used as testing set as shown in the figure below:
  14. Hyperparameter tuning/ optimisation In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. The same kind of machine learning model can require different constraints, weights or learning rates to generalize different data patterns. Hyperparameter tuning works by running multiple trials in a single training job.
  15. PMLB is a collection of publicly available classification problems that have been standardized to the same format and collected in a central location with easy access via Python PMLB includes many biomedical classification problems, including tasks such as disease diagnosis, post-operative decision making, and exon boundary identification in DNA, among others. A sample of the biomedical classification tasks contained in PMLB is listed in Table 2.
  16. Mean Ranking is determined by the 10-fold CV balanced accuracy of each algorithm on a given dataset, with a lower ranking indicating higher accuracy The Friedman test is a non-parametric statistical test Similar to the parametric repeated measures ANOVA, it is used to detect differences in treatments across multiple test attempts. The procedure involves ranking each row together, then considering the values of ranks by columns.
  17. The rankings show the strength of ensemble-based tree algorithms in generating accurate models: The first, second, and fourth-ranked algorithms belong to this class of algorithms. The three worst-ranked algorithms also belong to the same class of Naïve Bayes algorithms.
  18. The Friedman test is a non-parametric statistical test Similar to the parametric repeated measures ANOVA, it is used to detect differences in treatments across multiple test attempts. The procedure involves ranking each row together, then considering the values of ranks by columns.
  19. 1.The traditional way of performing hyperparameter optimization has been grid search, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by cross-validation on the training set or evaluation on a held-out validation set. 2.  model would be the manifestation of this guess to test this hypothesis. Classifier: A classifier is a special case of a hypothesis (nowadays, often learned by a machine learning algorithm). A classifier is a hypothesis or discrete-valued function that is used to assign (categorical) class labels to particular data points. In an email classification example, this classifier could be a hypothesis for labelling emails as spam or non-spam. Yet, a hypothesis must not necessarily be synonymous to the term classifier. In a different application, our hypothesis could be a function for mapping study time and educational backgrounds of students to their future, continuous-valued, SAT scores – a continuous target variable, suited for regression analysis.
  20. 1.The traditional way of performing hyperparameter optimization has been grid search, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by cross-validation on the training set or evaluation on a held-out validation set.
  21. 2.  model would be the manifestation of this guess to test this hypothesis. Classifier: A classifier is a special case of a hypothesis (nowadays, often learned by a machine learning algorithm). A classifier is a hypothesis or discrete-valued function that is used to assign (categorical) class labels to particular data points. In an email classification example, this classifier could be a hypothesis for labeling emails as spam or non-spam. Yet, a hypothesis must not necessarily be synonymous to the term classifier. In a different application, our hypothesis could be a function for mapping study time and educational backgrounds of students to their future, continuous-valued, SAT scores – a continuous target variable, suited for regression analysis.
  22. we find that algorithms with similar underlying assumptions or methodologies cluster in terms of their performance across the datasets. For example, the Naïve Bayes algorithms (i.e., Multinomial, Gaussian, and Bernoulli) perform most similarly to each other, and the linear algorithms (i.e., passive aggressive and logistic regression) also cluster.
  23. The five algorithms and parameters here are those that maximize the coverage of the 165 benchmark datasets, meaning that they perform within 1% of the best 10-fold CV balanced accuracy obtained on the maximum number of datasets in the experiment. For the datasets in PMLB, these five algorithms and associated parameters cover 106 out of 165 datasets to within 1% balanced accuracy. Notably, 163 out of 165 datasets can be covered by tuning the parameters of the five listed algorithms. Based on the available evidence, these recommended algorithms should be a good starting point for achieving reasonable predictive accuracy on a new dataset.