Exploratory Data Analytics (EDA) is a data Pre-Processing, manual data summarization and visualization related discipline which is an earlier phase of data processing. This was one of the lectures of a full course I taught in University of Moratuwa, Sri Lanka on 2023 second half of the year.
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Statistical & Machine Learning
1. DA 5230 – Statistical & Machine Learning
Lecture 3 – Exploratory Data Analytics
Maninda Edirisooriya
manindaw@uom.lk
2. What is Exploratory Data Analysis (EDA)?
• When an analyst/data scientist is given a dataset he has to do some
initial analysis to start the data analysis process
• This involves very basic data filtering, processing and visualization
• Results from this analysis he gets the intuition on further analysis
• Depending on the results, this process can be iterative with newly
discovered patterns and knowledge
• This process is known as Exploratory Data Analytics (EDA)
3. Why Exploratory Data Analysis?
• To understand data: to know metadata like size, types, structures of it
• To Identify patterns of data: to identify visible trends and
relationships among data
• To detect anomalies and outliers
• To clean data: to de-duplicate, remove/fill missing values and
inconsistent values
• For Feature Engineering: to discover combinations of features and
create new features to improve performance
• Visualization and communication
4. When to do Exploratory Data Analysis?
• If the given dataset easily fits the memory, Pandas and Numpy
(Python libraries) are used
• If the dataset cannot be fit the memory but fits the secondary storage
SQL can be used
• If the dataset is large so that it cannot be stored in a single machine,
big data analytics has to be used
• Scope of this lesson is applies only for the first scenario
• In other cases data samples taken from larger dataset can also be
used for some extend
5. Structure of Data
• Structured Data: tabular data or data with feature labels
• E.g: RDBMS data table
• Unstructured Data: data without feature labels
• E.g: Image pixel data, video data
• Semi-structured Data: has a certain structure but not tabular
• E.g: XML, JSON
• In this subject module we mainly focus on Structured Data
6. Numpy
• A Python library named as “Numerical Python”
• Used to efficiently store and process numerical data
• Numerical data is stored in memory-efficient arrays (tensors)
• Has efficient array processing capabilities accelerated with hardware
support
• Has a rich API for processing data
• Broadcasting operations for replacing many loop requirements
• Interoperable with other languages like C, C++ and Fortran
• Used by most other data related Python libraries like Pandas
7. Pandas
• A high performance Python library for data processing
• Highly supports Numpy arrays (tensors)
• Supports typed, rich, tabular, structured data over Numpy using
DataFrames
• Rich APIs for,
• Loading data from data source like files and storing back to them
• Transforming data in-memory like sorting, merging, pivoting and aggregation
• Data selection, slicing, filtering and indexing
• Data cleaning like de-duplication, filling missing value and outlier removal
• Integrating well with visualization libraries like Matplotlib and Seaborn
9. Data Types
• Pandas can represent most of the naturally occurring data types
• Types mentioned earlier
• Therefore, data can be very easily loaded into Python DataFrames
• Data is stored in column-major manner
• each column of data is stored as a contiguous block in memory, and values
within a column are stored consecutively
• Sometimes categorical data has to be encoded as continuous data
• E.g.: date-time as time
• Sometimes continuous data has to be encoded as categorical data
• E.g.: income as income levels
10. One Hour Homework
• Officially we have one more hour to do after the end of the lectures
• When it comes to ML self-learning is very important
• Therefore, for this week’s extra hour you have a homework
• After today’s Pandas tutorial figure out all the difficult sections in it
• Try yourself to complete it and refer Internet when needed
• Ask questions from ChatGPT for even difficult questions
• Play with Pandas and EDA with your own code and get familiar with them
• We need you to be comfortable with EDA for learning ML and SL ahead
• Good Luck!