Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Statistical & Machine Learning

DA 5230 – Statistical & Machine Learning
Lecture 3 – Exploratory Data Analytics
Maninda Edirisooriya
manindaw@uom.lk

What is Exploratory Data Analysis (EDA)?
• When an analyst/data scientist is given a dataset he has to do some
initial analysis to start the data analysis process
• This involves very basic data filtering, processing and visualization
• Results from this analysis he gets the intuition on further analysis
• Depending on the results, this process can be iterative with newly
discovered patterns and knowledge
• This process is known as Exploratory Data Analytics (EDA)

Why Exploratory Data Analysis?
• To understand data: to know metadata like size, types, structures of it
• To Identify patterns of data: to identify visible trends and
relationships among data
• To detect anomalies and outliers
• To clean data: to de-duplicate, remove/fill missing values and
inconsistent values
• For Feature Engineering: to discover combinations of features and
create new features to improve performance
• Visualization and communication

When to do Exploratory Data Analysis?
• If the given dataset easily fits the memory, Pandas and Numpy
(Python libraries) are used
• If the dataset cannot be fit the memory but fits the secondary storage
SQL can be used
• If the dataset is large so that it cannot be stored in a single machine,
big data analytics has to be used
• Scope of this lesson is applies only for the first scenario
• In other cases data samples taken from larger dataset can also be
used for some extend

Structure of Data
• Structured Data: tabular data or data with feature labels
• E.g: RDBMS data table
• Unstructured Data: data without feature labels
• E.g: Image pixel data, video data
• Semi-structured Data: has a certain structure but not tabular
• E.g: XML, JSON
• In this subject module we mainly focus on Structured Data

Numpy
• A Python library named as “Numerical Python”
• Used to efficiently store and process numerical data
• Numerical data is stored in memory-efficient arrays (tensors)
• Has efficient array processing capabilities accelerated with hardware
support
• Has a rich API for processing data
• Broadcasting operations for replacing many loop requirements
• Interoperable with other languages like C, C++ and Fortran
• Used by most other data related Python libraries like Pandas

Pandas
• A high performance Python library for data processing
• Highly supports Numpy arrays (tensors)
• Supports typed, rich, tabular, structured data over Numpy using
DataFrames
• Rich APIs for,
• Loading data from data source like files and storing back to them
• Transforming data in-memory like sorting, merging, pivoting and aggregation
• Data selection, slicing, filtering and indexing
• Data cleaning like de-duplication, filling missing value and outlier removal
• Integrating well with visualization libraries like Matplotlib and Seaborn

Data Types
Source: https://medium.com/@simranjeetsingh1497/the-ultimate-guide-to-machine-learning-from-eda-to-model-deployment-part-2-e56ac58785f8

Data Types
• Pandas can represent most of the naturally occurring data types
• Types mentioned earlier
• Therefore, data can be very easily loaded into Python DataFrames
• Data is stored in column-major manner
• each column of data is stored as a contiguous block in memory, and values
within a column are stored consecutively
• Sometimes categorical data has to be encoded as continuous data
• E.g.: date-time as time
• Sometimes continuous data has to be encoded as categorical data
• E.g.: income as income levels

One Hour Homework
• Officially we have one more hour to do after the end of the lectures
• When it comes to ML self-learning is very important
• Therefore, for this week’s extra hour you have a homework
• After today’s Pandas tutorial figure out all the difficult sections in it
• Try yourself to complete it and refer Internet when needed
• Ask questions from ChatGPT for even difficult questions
• Play with Pandas and EDA with your own code and get familiar with them
• We need you to be comfortable with EDA for learning ML and SL ahead
• Good Luck!

Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Statistical & Machine Learning

Recommended

Recommended

More Related Content

Similar to Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Statistical & Machine Learning

Similar to Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Statistical & Machine Learning (20)

More from Maninda Edirisooriya

More from Maninda Edirisooriya (18)

Recently uploaded

Recently uploaded (20)

Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Statistical & Machine Learning