SlideShare a Scribd company logo
1 of 52
Download to read offline
Analysis on Implementation of different
CNN Architectures on FPGAs
UNDERGRADUATE THESIS
Submitted in partial fulfillment of the requirements
of BITS F421T Thesis
By
PRAYAG MOHANTY
ID No. 2020A3PS0566G
Under the supervision of:
Dr. AMALIN PRINCE A.
BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE PILANI, GOA CAMPUS
December 2023
1
Declaration of Authorship
I, Prayag Mohanty, declare that this Undergraduate Thesis titled, ‘Analysis on
implementation of different CNN Architectures on FPGA’ and the work presented in it
are my own. This was undertaken in the First Semester of 2023-24. I confirm that:
● This research was primarily conducted while I was a candidate for a research
degree at this University.
● Any portions of this thesis previously submitted for a degree or qualification at
this or another institution are explicitly identified.
● I consistently and clearly credit any consulted published works of others.
● All quotations are attributed to their original sources. With the exception of such
quotations, the content of this thesis is entirely my own original work.
● I have expressed my gratitude for all significant sources of assistance.
● If the thesis draws on work I conducted collaboratively with others, I have clearly
outlined each individual's contribution, including my own.
Signed:
Date: 12 / 12 / 23
i
Certificate
This is to certify that the thesis entitled, “Analysis on implementation of different CNN
Architectures on FPGA” and submitted by Prayag Mohanty ID No. 2020A3PS0566G in
partial fulfillment of the requirements of BITS F421T Thesis embodies the work done by
him under my supervision.
_____________________________
Supervisor
Dr. Amalin Prince A.
Professor, Dept. of EEE
BITS-Pilani K.K.Birla Goa Campus
Date: 12 / 12 / 23
ii
“Knowledge is a tool, best shared. So is my thesis :) ”
-Prayag Mohanty
iii
BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE PILANI, K.K.BIRLA GOA
CAMPUS
Abstract
Bachelor of Engineering (Hons.)
Analysis on implementation of different CNN Architectures on FPGA
by Prayag Mohanty
Convolutional Neural Networks (CNNs) are a special type of neural networks that are
exceptionally good at working with data, like images, signals etc. The usage of
Field-Programmable Gate Arrays (FPGAs) in high-performance computing has garnered
significant attention with the advent of Artificial Intelligence. This thesis investigates
the performance and resource utilization of various convolutional neural network (CNN)
models for implementation on Field-Programmable Gate Arrays (FPGAs). The primary
objective is to identify optimal CNN models for FPGA deployment based on their
performance, resource utilization, and other relevant parameters. Two prominent CNN
models, AlexNet and MobileNet, were chosen for analysis. Both models were
implemented on an FPGA platform. Performance metrics such as resource utilization
metrics, including logic slices, memory blocks, and DSP slices, were monitored to assess
the hardware requirements of each model. The evaluation results demonstrate that
MobileNet exhibits significantly lower resource utilization compared to AlexNet while
maintaining a commendable level of performance. This suggests that MobileNet is a
more efficient option for deploying CNN models on FPGAs with limited hardware
resources. AlexNet, on the other hand, offers superior performance but at the expense of
higher resource consumption. This makes it a suitable choice for applications where
performance is paramount and resources are less restricted.This analysis provides
valuable insights into the suitability of different CNN models for FPGA implementation
based on their performance and resource utilization characteristics.
Keywords: Convolutional Neural Networks, FPGA, Performance, Resource Utilization,
AlexNet, MobileNet
iv
Acknowledgements
The journey of completing this thesis has been a rewarding but challenging one, and I
would like to express my heartfelt gratitude to those who have supported me throughout
the process. First and foremost, I want to thank my family for their unwavering love and
support. Their constant encouragement and belief in me have been instrumental in
helping me overcome obstacles and persevere through difficulties. I am especially grateful
for the sacrifices they made to enable me to pursue my educational goals.I extend my
sincere thanks to my relatives and friends for their encouragement and understanding. I
owe a debt of immense gratitude to my thesis supervisor, Professor Amalin Prince A.
whose guidance, expertise, and patience have been invaluable in shaping my research and
helping me refine my work. I am deeply grateful for their insightful feedback,
constructive criticism, and unwavering support throughout the research process. Finally, I
would like to express my sincere appreciation to my institute, BITS Pilani KK Birla Goa
Campus. The institution's excellent academic environment, equipment, and dedicated
faculty have provided me with the foundation and resources necessary to conduct my
research.
Thank you all for your invaluable contributions.
v
Contents
Declaration of Authorship i
Certificate ii
Abstract iv
Acknowledgements v
Contents vi
List of Figures viii
List of Tables ix
Abbreviations x
1 Introduction 1
1.1 Motivation.................................................................................................................. 1
1.2 Scope & Structure......................................................................................................2
2 Fundamentals 3
2.1 Current Work............................................................................................................ 3
2.1.1 Theoretical Background...................................................................................3
2.1.2 FPGA................................................................................................................ 3
2.2 Literature Review..................................................................................................... 4
1.2.1 AlexNet.........................................................................................................10
1.2.2 ResNet...........................................................................................................11
1.2.3 MobileNet..................................................................................................... 14
3 Design and Implementation 15
3.1 Design......................................................................................................................15
3.2 Implementation.......................................................................................................16
4 Hardware Implementation 20
4.1 Design Methodology................................................................................................20
4.2 HLS Methodology....................................................................................................20
4.3 Design Overview .................................................................................................... 20
4.4 Caching Strategy.....................................................................................................21
vi
5 Conclusion 24
Appendix 26
Bibliography 30
vii
List of Figures
1.1 Figure 1: Neuron Architecture…………………………….................................................................................. 4
1.2 Figure 2: Layers in a CNN model. .................................................................................................................. 6
1.3 Figure 3: A typical convolutional neural network layer's components.......................................................... 6
1.4 Figure 4: Visual representation of AlexNet architecture................................................................................ 10
1.5 Figure 5: Visual representation of AlexNet layers……...................................................................................11
1.6 Figure 6: Residual Learning: a building block of the ResNet architecture...................................................11
1.7 Figure 7: Representation of the ResNet architecture.….................................................................................13
1.8 Figure 8: Xilinx Vivado 2021.3 IDE………………………...............................................................................17
1.9 Figure 9: Digilent Zedboard Avnet Evaluation Kit Zynq-7000 System-on-Chip (SoC)...............................18
1.10 Figure 10: AlexNet Lite……………………......................................................................................................18
1.11 Figure 11: Chinese Academy logic architecture............................................................................................ 19
1.12 Figure 12: Angel-Eye architecture................................................................................................................. 20
1.13 Figure 13: MobileNet Lite……..............................................................................................……………..…. 20
1.14 Figure 14: Xilinx Vivado IDE..........................................................................................…………………..... 21
1.15 Figure 15: Final Top-Level FPGA Design..................................................................................................... 22
1.16 Figure 16: Convolutional / Affine Layer Virtual Memory Test Bench…………………………………….….. 23
1.17 Figure 17: Convolutional / Affine Layer Block RAM Test Bench……………………………..…………….… 23
1.18 Figure 18: Max Pool Layer Virtual Memory Test Bench………………………………………….……….….… 23
1.19 Figure 19: Max Pool Layer Block RAM Test Bench……………………………………………….……….….… 23
viii
viii
List of Tables
1.20 Table 1: Specifications of Zedboard……………………………................................................................ 19
1.21 Table 2: Resource Utilization of Final Design. ...................................................................................... 21
1.22 Table 3: Hardware execution times of each AlexNet Layer…………..................................................... 24
1.23 Table 4: Simulation Model vs Hardware Implementation.................................................................... 25
1.24 Table 5: Comparing AlexNet vs MobileNet……..................................................................................... 25
1.25 Table 6: Comparison of other works to this work………………………….............................................. 26
ix
Abbreviations
CNN Convolutional Neural Networks
FPGA Field Programmable Gate Arrays
AI Artificial Intelligence
ML Machine Learning
HLS High Level Design
DSP Digital Signal Processing
Dedicate this to my family, friends, relatives
and electronics.
14
1.Introduction
1.1 Motivation
The field of high-performance computing (HPC) has witnessed a significant shift in recent
years, driven by the ever-increasing demand for processing power across diverse application
domains. This growth is fueled by advancements in various fields, including science,
engineering, finance, and healthcare, each requiring the ability to analyze and process
massive datasets in real-time. To address this growing demand, researchers have turned to
Field-Programmable Gate Arrays (FPGAs) as a promising alternative to traditional CPUs
and GPUs.
FPGAs offer several key advantages over traditional computing architectures. Their
reconfigurable nature allows them to be tailored to specific tasks, leading to significant
performance improvements compared to general-purpose CPUs. Additionally, FPGAs excel
in energy efficiency due to their parallel processing capabilities and optimized hardware
design. This combination of performance and efficiency makes FPGAs ideal candidates for
accelerating computationally intensive workloads in HPC.
Over the past few decades, the field of Artificial Intelligence (AI) has experienced
tremendous progress, revolutionizing numerous aspects of our lives. From image and
speech recognition to natural language processing and autonomous vehicles, AI has
demonstrably impacted various industries and scientific domains. This rapid advancement
is fueled by the increasing availability of computing resources and data, enabling the
development and deployment of complex machine learning algorithms and neural
networks.
However, the growing demand for AI applications necessitates the development of efficient
and scalable neural networks. Traditional software-based implementations often struggle to
handle the demands of real-time processing and resource limitations on mobile and
embedded systems. This is where FPGAs present a compelling solution. With their inherent
1
parallelism and hardware flexibility, FPGAs can be leveraged to implement efficient neural
networks that deliver superior performance and energy savings compared to software-based
approaches.
The motivation for this project stems from the desire to explore the potential of FPGAs in
accelerating Convolutional Neural Networks (CNNs), a class of neural networks widely
used in various AI applications, particularly image and video processing. CNNs excel in
extracting features and identifying patterns in images, making them instrumental for tasks
such as image recognition, object detection, and image segmentation.
My primary objective is to analyze and compare different CNN architectures available for
implementation on FPGAs. This analysis focuses on key performance metrics like resource
utilization, scalability, and real-time processing capabilities. The ultimate goal is to identify
and optimize a CNN model that delivers the best performance on the Zedboard, a popular
FPGA development platform.
Additionally, the potential for deploying CNNs on low-resource systems like smartphones
motivates this project. This enables the processing of sensitive data directly on the device,
eliminating the need for internet data transmission and ensuring data privacy.
Furthermore, integrating CNNs into embedded systems opens up exciting possibilities for
real-time applications in areas like robotics, autonomous vehicles, and smart home
technologies.
By exploring the implementation of various CNNs on FPGAs, this project aims to
contribute to the development of efficient and scalable AI solutions for resource-constrained
environments. The insights and findings will provide valuable knowledge and pave the way
for future research in the field of hardware-accelerated AI.
1.2 Scope & Structure
The prospect of creating a whole framework capable of analyzing data in real time
piqued my interest. However, due to the task's complexity and the lack of intensive
experience with Neural Networks earlier, the scope was reduced to the following points.
2
1. The data set was restricted to numbers. This would be a simple & good starter for
other forms of information like written language, signals etc.
2. Only individual pre-existing images were used for static analysis. The main
reason for this decision is because, while there exist Neural Networks capable of properly
analyzing video, their complexity has risen and the analysis for their use in embedded
systems has not yet been fully established, which would add an additional risk to the
project.
The project needs to be broken down into two independent sub-problems that can be tackled
separately. However, when combined, they will provide the desired overall outcome.
1. This work aims to develop a system configured to run as many layers as desired and
test it using a currently defined CNN configuration, AlexNet. This type of system would
allow a developer to scale a design to fit any size of FPGA.
2. Comparing two CNN architectures, AlexNet and MobileNet on the basis of their
measurable parameters like performance, speed, DSP slice, LUTs etc. on a Zedboard.
This would help determine the compatibility of these models on a sample Zedboard.
3
2.Fundamentals
2.1 Current Work
2.1.1 Background
Convolutional Neural Networks (CNNs) are a type of artificial intelligence that fall within
the field of machine learning and are also categorized as a deep learning technique.
Neural networks: Inspired by the human brain, neural networks are computational
structures composed of interconnected nodes called neurons. These neurons receive and
process information from each other, mimicking the way synapses in the brain facilitate
communication. This intricate network of connections, numbering in the millions, underlies
the complex thought processes and behavior observed in humans and other intelligent
beings.
Artificial Neural Networks use the way neurons interact - to construct systems in which
each of the building blocks (usually referred to as neurons) receives several inputs that are
weighed using weights and produces an output that is sent to several other building blocks.
Fig 1. shows the hardware architecture of a neuron.
Fig.1 Neuron Architecture (Reddy, 2019)
4
A neuron receives multiple inputs, such as pixel values or sound data, depending on the
application. It multiplies the inputs (say x) with suitable weights (w) and adds bias (b) . The
function σ(w⋅x+b)is obtained.
Functionality: Neural networks excel at classifying inputs into predetermined categories.
This ability stems from assigned weights to each neuron within the network. A crucial step
called training determines the specific combination of weights that enables accurate
classification. During this phase, the network receives numerous inputs with known
outputs, and the weights are adjusted iteratively until an optimal configuration is achieved.
Topology: To provide all neurons with a suitable structure for analyzing input data, they
can be organized in various ways. In our project, we will focus on networks where neurons
are arranged in ordered layers, only receiving input from the preceding layer and sending
output to the subsequent one. Consequently, the network's topology is defined by how the
layers are interconnected and the operations performed within each layer, often utilizing
previously learned weights.
Convolutional Neural Networks (CNNs) are a special type of neural networks that are
really good at working with 2D data, like images. They are commonly used for tasks like
identifying objects in images or labeling scenes.
Imagine a 256x256 image with three color channels (RGB). Feeding this pixel data into a
conventional neural network would require millions of weights, due to the typical
connectivity between neurons across layers. However, CNNs leverage the inherent spatial
locality of information in images. For instance, to identify a car in an image, analyzing
pixels in the top-right corner isn't crucial. Features like edges, lines, circles, and contours
provide enough context.
This is where convolutional layers come in. These specialized layers replace fully-connected
layers, allowing the network to focus on local information and extract meaningful features.
Each convolutional layer receives a stack of images as input and generates another stack as
output. These layers utilize small filters (kernels) to scan the input and extract features.
These filters, equipped with learned weights, help the network recognize patterns and
objects in the images.
5
In essence, CNNs employ convolutional layers to efficiently capture key features in images,
facilitating accurate image understanding and classification.
Convolutional Layer Details:
● Each input layer receives a stack of 2D images (chin) with dimensions hin×win,
referred to as input feature maps.
● Each layer outputs a stack of 2D images (chout) with dimensions hout×wout, called
output feature maps.
● Each layer utilizes a stack of chin×chout kernels (or 2D filters) with dimensions k×k
(typically ranging from 1x1 to 11x1) containing the trained weights.
By focusing on local information and utilizing efficient convolutional layers, CNNs achieve
exceptional performance in image-related tasks, solidifying their position as a powerful tool
for image processing and computer vision applications.
Fig.2 Layers in a CNN model (Goodfellow,2016)
Activation and Pooling
Activation: Each linear activation is then passed through a non-linear activation function.
This stage, also known as the "detector stage," introduces non-linearity into the network,
allowing it to learn complex relationships between features. A popular choice for the
activation function is the rectified linear unit (ReLU), which outputs the input value if it is
positive, and zero otherwise.
Pooling: This stage further modifies the layer's output by applying a pooling function.
Pooling functions summarize the output within a specific neighborhood, often reducing the
6
dimensionality of the data. Common pooling functions include:
● Max pooling: Replaces each output with the maximum value within its rectangular
neighborhood.
● Average pooling: Replaces each output with the average value within its rectangular
neighborhood.
● L2-norm pooling: Replaces each output with the L2 norm of the values within its
rectangular neighborhood.
● Weighted average pooling: Replaces each output with a weighted average based on
the distance from the central pixel.
By performing these stages sequentially, CNNs extract and learn features from input data,
enabling them to perform complex tasks like image recognition and natural language
processing.[4] (Shown below Fig.3)
Fig.3: A typical convolutional neural network layer's components (Goodfellow, 2016)
Convolutional networks (ConvNets) can be described using two distinct sets of terminology.
Left-hand View: This perspective treats the ConvNet as a collection of relatively complex layers, each containing
multiple "stages." Each kernel tensor directly corresponds to a network layer in this interpretation.
Right-hand View: This perspective presents the ConvNet as a sequence of simpler layers. Every processing step
within the network is considered its own individual layer. Consequently, not every "layer" possesses learnable
parameters.[4]
Practical Convolution
Convolution in the context of neural networks transcends a singular operation. It involves
the parallel application of multiple convolutions, leveraging the strength of extracting
diverse features across multiple spatial locations. A single kernel can only identify one type
of feature, limiting the richness of extracted information. By employing multiple kernels in
parallel, the network extracts a broader spectrum of features, enhancing its
representational power.
7
Neural networks often handle data with a richer structure than mere grids of real values.
The input typically consists of "vector-valued observations," where each data point holds
additional information beyond a single value. For instance, a color image presents red,
green, and blue intensity values at each pixel, creating a 3-dimensional tensor. One index
denotes the different channels (red, green, blue), while the other two specify the spatial
coordinates within each channel[4].
Software implementations of convolution often employ "batch mode," processing multiple
data samples simultaneously. This introduces an additional dimension (the "batch axis") to
the tensor, representing different examples within the batch. For clarity, we will disregard
the batch axis in our subsequent discussion [4].
A crucial element of convolutional networks is "multi-channel convolution," where both the
input and output possess multiple channels. This multi-channel nature introduces an
interesting property: the linear operations involved are not guaranteed to be commutative,
even with the implementation of "kernel flipping." Commutativity only holds true when
each operation involves the same number of input and output channels.
To illustrate these concepts, consider a 3-channel color image as the input to a convolutional
layer with multiple kernels. Each kernel extracts a specific type of feature from each
channel, resulting in multiple "feature maps." These feature maps, when combined, form
the output of the convolution operation[4].
Training a Neural Network
Because training is computationally expensive, there are frameworks and tools available to
help with this process. Two popular ones are Caffe and Tensorflow.
In this thesis, different frameworks were explored gradually, starting with simpler ones and
gradually moving towards more advanced ones, as we had limited prior knowledge.
There exist two primary forms of training for neural networks:
1. Full training: In situations where an ample amount of data is accessible, it is possible
to train all the network weights to enhance results tailored to the specific application.
2. Transfer learning: Frequently, insufficient data is available to train all the weights
from the ground up. In such instances, a prevalent strategy involves employing a
pre-trained network designed for a distinct application. The majority of layer weights are
repurposed, with only the final layer being adjusted to align with the requirements of the
8
new application.
2.1.2 FPGAs
Field-Programmable Gate Arrays (FPGAs) are a type of integrated circuit that can be
reprogrammed and reconfigured countless times after they have been manufactured.
These devices form the foundation of reconfigurable computing, a computing approach
that emphasizes splitting applications into parallel, application-specific pipelines. FPGAs
have reconfigurable logic resources like LUT (Look-Up Tables), DSPs (Digital Signal
Processors), and BRAMs (Block RAMs). These resources can be connected and configured
in various ways, allowing the implementation of different electronic circuits. The allure
of reconfigurable computing lies in its ability to merge the rapidity of hardware with the
adaptability of software, essentially bringing together the most advantageous features of
both hardware and software.
Harnessing the computational power of FPGAs takes a leap forward with distributed
computing. This strategy clusters FPGAs, dividing problems into smaller tasks for
parallel processing. By working as a team, this distributed network unlocks significant
performance gains through parallelization.
This approach offers key benefits:
● Scalability: Easily add FPGAs to the cluster as computational demands grow.
● Efficiency: Shared resources and coordinated tasks optimize resource utilization.
● Flexibility: Adapt and optimize the configuration to meet specific needs.
● Performance: Parallelization boosts processing speed for quicker results.
Distributed FPGAs hold promise in various fields:
● HPC: Solve complex scientific and engineering problems faster.
● AI: Train and deploy AI models with the necessary power and scalability.
● Real-Time Applications: Meet the demanding requirements of latency-sensitive
fields like robotics and autonomous systems.
9
High Level Synthesis and FPGAs
For over three decades, engineers have relied on Hardware Description Languages (HDLs)
to design electronic circuits implemented in FPGAs. This approach, while established,
requires a significant investment of time and expertise. Writing detailed descriptions of
each hardware component can be tedious and demands a deep understanding of the
underlying hardware structure.
However, a fresh and promising paradigm shift has emerged in recent years: High-Level
Synthesis (HLS). This innovative approach leverages the familiarity and convenience of
high-level languages like C to design hardware. Dedicated tools then translate this
high-level code into an equivalent hardware description in a lower-level language, known as
Register Transfer Level (RTL).
Several compelling advantages make HLS an increasingly attractive choice for hardware
design:
● Maturity and Stability: HLS tools have evolved significantly, offering improved
reliability and a clearer understanding of the generated hardware behavior.
● Efficiency and Performance: HLS can often produce hardware that rivals, or even
surpasses, the efficiency achieved by manually crafted HDL code. This efficiency
gain, combined with the significantly faster development cycle, makes HLS a
compelling option.
Given these undeniable benefits, HLS has been chosen as the technology of choice for this
thesis, paving the way for a more efficient and accessible approach to FPGA design.
2.2 Literature Review
Deep Learning
Deep learning utilizes artificial neural networks, inspired by the human brain, to perform
10
machine learning tasks. These networks consist of multiple layers organized hierarchically,
enabling them to learn complex patterns from data.
Each layer progressively builds upon the knowledge acquired by the previous layer. The
initial layers extract fundamental features, like edges or lines, from the input data.
Subsequent layers combine these basic features into more complex shapes and objects,
culminating in the identification of the desired target.
Imagine training a deep learning model to recognize hands in images. The initial layers
would learn to detect edges and lines, the building blocks of shapes. Moving up the
hierarchy, the network would combine these basic elements into more complex features, like
ovals and rectangles, which could represent whiskers, paws, and tails. Finally, the topmost
layers would recognize these combined features as specific to hands, allowing the network
to differentiate them from other animals.
While focusing on hand identification, the network simultaneously learns about other
objects present in the training data. This allows it to generalize its knowledge and apply it
to other contexts, recognizing hands in diverse environments and situations. This
hierarchical learning process, where simple features are gradually combined to form
complex representations, is the core of deep learning's success. It allows the network to
effortlessly handle complex tasks, making it a powerful tool for various applications.
Derived from the SqueezeNet topology [8], initially designed for embedded systems,
Zynqnet is tailored to be FPGA-friendly through modifications during development. The
topology comprises an initial convolutional layer, 8 identical fire modules, and a final
classification layer, each containing 3 convolutional layers. Notably, efforts were made to
align hyperparameters with power-of-two values.
Key points of improvement in this thesis include:
1. HW Definition: Zynqnet's original hardware accelerator is only partially implemented
on the Xilinx Zynq board [6], working closely with an ARM processor. In contrast, the
presented accelerator is fully hardware-designed, adapting to runtime layer variations
without software intervention.
2. Fixed Point: To mitigate FPGA overhead, fixed-point computations replace the 32-bit
11
floating-point implementation used in Zynqnet. The Ristretto tool [7] guides bit width and
fractional bits, applying manual fine-tuning.
3. Data vs. Mem: Significant size and memory reductions occur by reducing classification
items and employing 8-bit fixed-point weights. This optimization simplifies the system,
eliminating external memory access and prioritizing computation speed over memory
volume in the accelerator.
2.2.1 AlexNet
Introduced in 2012, AlexNet, a pioneering Deep Learning architecture developed using the
ImageNet database. They trained a deep convolutional neural network on 1.2 million
high-resolution images, each with dimensions of 224x224 RGB pixels (Li, F., et. al, 2017).
Achieving a worst error rate of 37.5% and a best average error rate of 17.0%, the neural
network boasted 60 million parameters, 650,000 neurons, five Convolutional Layers
followed by ReLu and Max Pool Layers, three Fully Connected Layers, and a 1000-way
Softmax Classifier [9]. The architecture, illustrated below, marked the first use of a
Rectified Linear Unit as an activation layer, deviating from the conventional Sigmoid
Activation function. Their groundbreaking implementation secured victory in the ImageNet
LSVRC-2012 competition. The entire project was conducted using two GTX 580 GPUs.
Fig.4: Visual representation of AlexNet architecture.
Illustration shows the layers used and their interconnectivity.(Krizhevsky, 2012)
12
Fig.5: Visual representation of AlexNet architecture.
Illustration shows the layers used and their interconnectivity. (Li, F., et. al, 2017)
2.2.2 VGGNet
In 2014, Karen Simonyan and Andrew Zisserman, researchers at the University of Oxford's
Visual Geometry Group, introduced VGGNet, a groundbreaking architecture that
significantly improved upon the capabilities of its predecessor, AlexNet. VGGNet's key
innovation was its increased depth, achieved by adding more convolutional layers. These
layers utilized smaller receptive fields, primarily 3x3 and 1x1 filters, enabling the network
to extract more detailed and nuanced features from the input images.
Simonyan and Zisserman tested various configurations of their network, all adhering to a
general design but differing in depth. They experimented with 11, 13, and 19 weight layers,
with each depth further divided into sub-configurations. Among these configurations,
VGG16 and VGG19 emerged as the top performers. VGG16 achieved a remarkable
maximum error rate of 27.3% and an impressive average error rate of 8.1%. VGG19, with
its increased depth, further improved upon these results, achieving a maximum error rate
of 25.5% and an average error rate of 8.0%.[7]
As expected, the increased depth of VGG16 and VGG19 led to a significant rise in the
number of parameters. VGG16 boasts 138 million parameters, while VGG19 possesses an
even more impressive 144 million parameters.
13
Figure 6. provides a visual comparison of VGG16, VGG19, and their predecessor AlexNet,
highlighting the significant architectural advancements made by VGGNet. This innovative
architecture ultimately led to its victory in the 2014 ImageNet LSRVC Challenge,
solidifying its place as a landmark achievement in the field of deep learning.
Fig. 6: Visual representation of VGGnet architecture & AlexNet (right)
Illustration shows the layers used and their interconnectivity.
(Li, F., et. al, 2017)
2.2.3 ResNet
In 2015, a team from Microsoft, including Kaiming He, Xiangyu Zhang, Shaoqing Ren, and
Jian Sun, developed the ResNet architecture as an enhancement to VGGnet. Recognizing
the importance of network depth for accuracy, they addressed the "vanishing gradient"
problem during backpropagation by introducing "deep residual learning." This novel
framework incorporated "Shortcut Connections," hypothesized to simplify training and
optimization while overcoming the gradient issue (He et al., 2015; Li, F., et al., 2017).
14
Fig.7: Residual Learning: a building block of the ResNet architecture. (He et. al., 2015)
For their experimentation they constructed a 34-layer plain network with no Shortcut
connections and a 34 Layer network with shortcut connections, a Resnet. They also
configured several networks with incrementally increasing layer count from 34 layers to
152 layers. Overall, 34-layer ResNet outperformed the 34-layer plain network and the
average error rate achieved on the 152-layer network for the 2015 ImageNet LSVRC
competition was 3.57%. This network architecture won the 2015 ImageNet LSVRC
challenge. (Li, F., et. al, 2017)
Fig.8 represents the building block of the ResNet Architecture.
15
Fig.8: Residual Learning: a building block of the ResNet architecture. [11]
16
2.2.4 MobileNet
MobileNets, which were originally developed by Google for mobile and embedded vision
applications [12], are distinguished by their use of depth-wise separable convolutions,
which reduce trainable parameters when compared to networks with regular convolutions
of the same depth. MobileNetv2 introduced linear bottlenecks and inverted residuals,
resulting in lightweight deep neural networks that are ideal for the scenario under
consideration in this work.
Fig. 9: Visual representation of MobileNet architecture
2.2.5 Other work
Several research groups have explored implementing Convolutional Neural Networks
(CNNs) on FPGAs, achieving impressive results in terms of performance and efficiency.
Here's a summary of five notable works:
1. Real-Time Video Object Recognition System (Neurocoms, South Korea, 2015)
● Architecture: Custom 5-layer CNN developed in Matlab.
● Input: Grayscale images (28x28).
● Platform: Xilinx KC705 evaluation board.
● Frequency: 250MHz.
● Power consumption: 3.1 watts.
● Resource utilization: 42,616 LUTs, 32 BRAMs, 326 DSP48s.
● Data format: 16-bit fixed point.
● Performance: Focused on frames per second.
17
Figure 10: Neurocoms work using 6 neurons and 2 receptor units
(Ahn, B,2015)
This paper describes a real-time video object recognition system implemented on an FPGA.
The system consists of a receiver, a feature map, and a detector. The receiver decodes and
pre-processes the video stream, the feature map extracts features using a CNN, and the
detector identifies objects by comparing the features to a database.
Key takeaways:
● Real-time performance
● Very efficient (3.1 watts power consumption)
● FPGA implementation enables high performance
2. Small CNN Implementation (Institute of Semiconductors, Chinese Academy of
Sciences, Beijing, China, 2015)
● Architecture: 3 convolutional layers with activation, 2 pooling layers, 1 softmax
classifier.
● Input: 32x32 images.
● Platform: Altera Arria V FPGA board.
● Frequency: 50MHz.
● Data format: 8-bit fixed point.
● Performance: Focused on images per second.
18
Figure 11: Chinese Academy logic architecture
(Li, H et.al., 2015)
This paper presents a small CNN implementation on an FPGA. The CNN consists of three
convolutional layers with activation, two pooling layers, and a softmax classifier. The input
images are 32x32 and the data format is 8-bit fixed point. The CNN is implemented on an
Altera Arria V FPGA board and operates at a frequency of 50MHz.
Key takeaways:
● The CNN achieves a frames per second rate of 50, which is sufficient for real-time
video processing.
● The CNN achieves an accuracy of 93.6% on the MNIST handwritten digit
classification task.
● The CNN uses 118K LUTs, 112K BRAMs, and 13K DSPs.
3. Angel-Eye System (Tsinghua University and Stanford University, 2016)
● Architecture: Array of custom processing elements.
● Platform: Xilinx Zynq XC7Z045.
● Frequency: 150MHz.
● Data format: 16-bit fixed point.
● Power consumption: 9.63 watts.
● Performance: 187.80 GFLOPS (VGG16 ConvNet).
● Custom compiler: Minimizes external memory access.
19
Figure 12: Angel-Eye (Left) Angel-Eye architecture. (Right) Processing Element
(Guo, K. et. al., 2016)
4. Customized Software Tools for CNN Accelerator (Purdue University, 2016)
● Platform: Xilinx Kintex-7 XC7K325T.
● Performance: 58-115 GFLOPS.
● Architecture: Custom software tools for optimization.
● Data format: Not specified.
5. Scalable FPGA Implementation of CNN (Arizona State University, 2016)
● Platform: Stratix-V GXA7.
● Frequency: 100MHz.
● Data format: 16-bit fixed point.
● Power consumption: 19.5 watts.
● Performance: 114.5 GFLOPS.
● Resource utilization: 256 DSPs, 112K LUTs, 2,330 BRAMs.
● Shared multiplier bank: Optimizes multiplication operations.
So far
One major challenge in deploying Deep Learning (DL) models on FPGAs has been their
limited design size. The inherent trade-off between reconfigurability and density restricts
the implementation of large neural networks on FPGAs. However, advancements in
fabrication technology, particularly the use of smaller feature sizes, are enabling denser
FPGAs. Additionally, the integration of specialized computational units alongside the
general FPGA fabric enhances processing capabilities. These advancements are paving the
way for the implementation of complex DL models on single FPGA systems, opening up
new possibilities for hardware-accelerated AI.
20
3. Design and Implementation
3.1 Design
Let's delve deeper into the individual layers of a convolutional neural network:
1. Input: The network begins with the input image, typically represented as a 3D matrix
with dimensions representing width, height, and color channels (e.g., RGB). In this case,
the image size is 32x32 pixels with three color channels.
2. Convolutional Layer: This layer applies filters to the input image, extracting features
through localized dot product calculations. Applying 12 filters would result in a new 3D
volume with dimensions 32x32x12, where each element represents the activation of a
specific feature at a specific location.
3. ReLU Layer: The rectified linear unit (ReLU) layer applies a non-linear activation
function, typically max(0,x), to each element in the previous volume. This introduces
non-linearity and sparsity into the feature representation, enhancing the network's ability
to learn complex patterns. The volume size remains unchanged (32x32x12).
4. Max Pooling Layer: This layer performs downsampling by selecting the maximum value
within a predefined neighborhood in the input volume. By reducing the spatial dimensions
(e.g., by a factor of 2), the network can achieve translational invariance and reduce
computational complexity. In this case, the resulting volume would be 16x16x12.
5. Affine/Fully Connected Layer: This layer connects all neurons in the previous volume to
each output neuron, essentially performing a weighted sum followed by a bias addition.
This final step calculates the class scores for each possible category, resulting in a 1x1x10
volume where each element represents the score for a specific class.
Sequential Processing and Parameter Learning:
Convolutional Neural Networks transform the input image through a series of layers,
gradually extracting features and building increasingly complex representations. While
21
some layers like ReLU and Max Pooling operate with fixed functions, others like
Convolutional and Fully Connected layers involve trainable parameters (weights and
biases). These parameters are adjusted through gradient descent optimization during
training, allowing the network to learn optimal representations based on labeled data.
3.2 Implementation
Having covered the foundational aspects of Deep Learning and reviewed prominent Deep
Convolutional Neural Network architectures, along with their implementations on FPGA,
let's delve into the specifics of this design. This section outlines the implementation of Deep
Convolutional Neural Networks in FPGA, discussing similarities & distinctions from prior
works, design goals, and the tools employed. Following this, we provide an overview of the
overall architecture intended for implementation on the FPGA. Due to a focus on hardware
implementation and constraints in time, along with the availability of pre-existing, trained
image system data for CNN, code was sourced from the internet. Finally, we
comprehensively examine three key sub-designs: Convolutional/Affine Layer, ReLu Layer,
Max Pooling Layer, and Softmax Layer.
Similarities
In scrutinizing previous works where groups implemented DCNNs on FPGAs, numerous
similarities emerge between their implementations and the present work. Certain aspects
of DCNNs are inherently common across designs aimed at accelerating DCNNs.
Consequently, essential elements like required layers (e.g., convolution, ReLu, max pool)
and adder trees for summing channel products will not be explicitly discussed in this
section.
Bus Protocol
Firstly, prior works showcase designs employing sub-module intercommunication. Several
designs that utilized separate sub-modules in their overall architecture employed a
communication bus protocol. This approach leverages existing intellectual property from
FPGA manufacturers such as Intel or AMD, allowing the focus to be on the DCNN portion
of the task rather than the infrastructure. Additionally, hardware microprocessors or
implemented co-processors can communicate with the submodules, providing valuable
insights for both software and hardware developers during debugging and verification. The
22
drawback, however, is that a bus protocol introduces additional overhead to the design due
to handshaking between sub-modules for reliable communication. Moreover, the presence of
the bus protocol necessitates more signal routing, utilizing overall FPGA resources and
potentially leading to increased dwell time with no task being executed. Despite these
drawbacks, effective management can be achieved by carefully planning the overall design's
concept of operations.
DSP Slices
A prevalent aspect shared among prior works and the present study involves the utilization
of Digital Signal Processing (DSP) Slices. These dedicated hardware components excel in
performing multiply and add operations for both floating-point and fixed-precision
numbers. DSP slices outperform custom designs implemented in hardware description
language (HDL). FPGAs benefit from maximizing available DSP slices, enhancing the
speed of designs, especially in Deep Convolutional Neural Networks (DCNNs).
Data Format
In the software domain, Deep Learning research employs 64-bit double precision floating
point signed digits for weight data. While some works have employed 32-bit single
numbers, there is mounting evidence suggesting that reducing the bit size and format can
significantly impact overall performance. A common alteration is the use of 16-bit
fixed-precision numbers. Alternatively, truncating the 32-bit single number to a 16-bit
"Half" precision number is proposed, presenting a potentially more effective design.
Scalability
Scalability, a crucial feature in previous works and this study, revolves around navigating
through the CNN.As witnessed in other works, the increasing size of software
implementations of DCNNs, exemplified by the 152-layer ResNet design, poses a challenge
for FPGA implementation. To address this, strategies involve implementing reusable
designs capable of performing the functions of all necessary layers in the DCNN
architecture.
Simple Interface
Unlike many previous works, considerable effort has been invested in creating a custom
compiler to completely describe a Deep Convolutional Neural Network in this design. The
aim is to make the DCNN accessible to both software and hardware designers by making
23
FPGA hardware programmable. The FPGA can be commanded through function calls in the
microprocessor, performing register writes to the FPGA implementation.
Flexible Design
Unlike prior works where CNN designs are tailored to specific hardware boards, this work
aims for a configurable number of DSPs depending on the FPGA in use. Each layer in the
CNN is modular and can interact through a bus protocol, allowing developers to insert
multiple instances of the Convolutional Layer, Affine Layer, and Max Pooling Layer.
Tools
Throughout the development process of implementing a CNN on an FPGA, various tools
were employed. The choice of utilizing Xilinx chips was influenced by their extensive usage
and the author's prior experience with Xilinx products. Consequently, the tools selected for
this development were drawn from the diverse set offered by AMD Xilinx. The central
design environment was the Xilinx Vivado 2021.3 package (refer to Figure 13), serving as
the primary design hub throughout the developmental phase. Within Vivado, each neural
network layer type was crafted as an AXI-capable submodule. Additionally, Vivado
facilitated integration with pre-existing Xilinx Intellectual Property (IP), such as the Zynq
SoC Co-Processor and Memory Interface Generator (MIG). Lastly, Vivado acted as a
platform for software development, enabling the creation of straightforward software to run
on the Zynq SoC.
Fig.13 Xilinx Vivado 2021.3 IDE
Hardware: The FPGA chosen was a Zedboard. In the context of Digilent’s Zedboard
Development Kit, it consists of a matrix of programmable logic blocks and programmable
interconnections. Zedboard is a built-around Zynq-7000 SoC Xilinx that combines a
two-core ARM Cortex-A9 processor along with a FPGA fabric.
24
Fig. 14 Digilent Zedboard Avnet AES-series Evaluation Kit Zynq-7000 System-on-Chip (SoC) (www.digilent.com)
Table.1:Specifications for Zedboard (www.digilent.com)
25
SPECIFICATION DESCRIPTION
SOC OPTIONS • XC7Z020 - CLG484-1
MEMORY • 512 MB DDR3
• 256 Mb Quad - SPI Flash
VIDEO DISPLAY •1080p HDMI
•8 - bit VGA
• 128 x 32 OLED
USER INPUTS • 8 User Switches and 7 user push buttons
AUDIO • 12S Audio CODEC
ANALOG • XADC Header
• Onboard USB JTAG
POWER • 12VDC
CERTIFICATION • CE
• RoHS
DIMENSIONS • 5.3 " x6.3 "
CONFIGURATION MEMORY • 256 Mb Quad - SPI Flash , SDCARD
ETHERNET • 10/100/1000 Ethernet
USB • USB 2.0
COMMUNICATIONS • USB 2.0
•USB - UART
•10/100/1000 Ethernet
USER I / O • ( See User Inputs )
OTHER • PetaLinux BSP
4. Hardware Implementation
4.1 Design Methodology
In extensive code projects with multiple instances and increasing complexity, defining the
order and scope of steps, along with how they will be executed, is crucial—comprising the
project's methodology.
4.2 HLS Methodology
As detailed in Section 2.2, High Level Synthesis (HLS) is chosen for hardware
implementation due to its suitability. Xilinx® Vivado HLS is employed in this project,
following its three-step methodology:
1. Software Simulation: This involves testing code execution using a regular software
compiler and CPU, aided by a test bench.
2. Synthesis: Generating HDL files crucial for code implementation and HLS pragmas.
This step is critical, executed after successful software simulation.
3. Co-Simulation: The most significant step, testing synthesized code functionality using a
hardware simulation. It leverages the test bench from software simulation, comparing
outputs and ensuring hardware-software consistency.
Table 2: Simulation Model vs Hardware Implementation
26
Layer SIM FOPs HW FOPs Diff
CONV1 0.7407 G 0.73530 G 0.74%
CONV2 126.897 M 113.796 M 12.89%
CONV3 35.158 M 29.106 M 27.66%
CONV4 26.645 M 20.830 M 27.91%
CONVS 26.574 M 20.763 M 27.99%
AFFINE1 176.322 M 113.884 M 54.83%
AFFINE2 87.677 M 38.077 M 130.26%
AFFINE3 33.919 M 20.229 M 83.23%
4.3 Design Overview
Fig.15 Final Top-Level FPGA Design
Before delving into the pipelined core and other enhancements, understanding the module's
top-level functionality is vital. The system comprises three modules and a group of
memories:
1. Pipelined Core: This module serves as the computational powerhouse, receiving layer
parameters, weight information, and input data from the Flow Control module. It executes
the necessary calculations and generates the desired outputs.
2. Convolution Flow Control: This module acts as the conductor, ensuring the proper
execution of the network topology. It determines whether update or classification tasks are
required and orchestrates access to all memory units and relevant layer parameters.
3. Memory Controller: This module acts as the memory interface, deciphering read/write
positions for data exchange with the memory units. It receives instructions from both the
Flow Control and Pipelined Core modules, ensuring smooth data flow and efficient memory
27
utilization.
By understanding the interactions and responsibilities of these modules, we gain a clear
understanding of how the system operates as a whole. This high-level perspective provides
a valuable foundation for delving deeper into the specific details of the individual
components and their contributions to the overall system performance.
4.4 Caching Strategy
Organized loop order and data reuse optimization lead to local storage of reused
information to avoid accessing on-chip memory overhead. Caches are needed for kernel and
bias, output, and input.
1. Kernel and Bias Caches: Simplest caches loaded at the beginning and updated during
channel changes.
2. Output Cache: More complex due to irregular access pattern, loading bias and
computing ReLU for performance maximization.
3. Input Cache: Most complex, addressing reuse issues with a group of multiple registers
that displace information every iteration.
Memory Controller: Arrays Merging
Adapting access patterns between layers and facilitating simultaneous access to multiple
elements are essential for varying memory requirements.
Fixed-Point Implementation
Following Ristretto's fixed-point analysis of the network, bit width and fractional bits are
defined. So, Xilinx® Vivado HLS utilizes a fixed-point arithmetic type definition
(ap_fixed<bit width, frac bits>). Runtime reconfiguration is managed using integers and bit
shifts for fixed-point operations since Vivado HLS requires a compile-time definition of
fractional bits.
28
29
Following is a detailed explanation of the four test benches shown in the diagrams:
1. Convolutional/Affine Layer Virtual Memory Test Bench
This test bench verifies the functionality of the convolutional/affine layer implementation
by comparing its outputs to the expected outputs generated by a reference software model.
The test bench loads the input and kernel data into virtual memory and then performs the
convolutions/affine operations. The outputs are then compared to the expected outputs to
ensure that the implementation is correct.
2. Convolutional/Affine Layer Block RAM Test Bench
This test bench is similar to the virtual memory test bench, but it stores the input and
kernel data in block RAM instead of virtual memory. This test bench is useful for verifying
the performance of the convolutional/affine layer implementation, as it can achieve higher
throughput by avoiding the overhead of accessing virtual memory.
3. Max Pool Layer Virtual Memory Test Bench
This test bench verifies the functionality of the max pool layer implementation by
comparing its outputs to the expected outputs generated by a reference software model. The
test bench loads the input data into virtual memory and then performs the max pooling
operation. The outputs are then compared to the expected outputs to ensure that the
implementation is correct.
4. Max Pool Layer Block RAM Test Bench
This test bench is similar to the virtual memory test bench, but it stores the input data in
block RAM instead of virtual memory. This test bench is useful for verifying the
performance of the max pool layer implementation, as it can achieve higher throughput by
avoiding the overhead of accessing virtual memory.
The diagram shows the four test benches connected to a common input and output
interface. This allows the test benches to be easily swapped in and out, depending on the
layer being tested.
30
Input and Output Interface: This interface provides a common way to load input data into
the test benches and to read the output data from the test benches. The interface can be
implemented using a variety of different methods, such as FIFO buffers, DMA transfers, or
direct memory access.
Virtual Memory: Virtual memory is used to store the input and kernel data for the
convolutional/affine layer and the max pool layer virtual memory test benches. Virtual
memory allows the test benches to access large amounts of data without having to load it
all into physical memory at once.
Block RAM: Block RAM is used to store the input data for the convolutional/affine layer and
the max pool layer block RAM test benches. Block RAM is a type of on-chip memory that is
faster than virtual memory, but it has a limited capacity.
Test Bench Control Logic:
The test bench control logic is responsible for loading the input and kernel data into the test
benches, performing the convolutions/affine operations or the max pooling operation, and
comparing the outputs to the expected outputs. The test bench control logic can be
implemented using a variety of different methods, such as a finite state machine, a
microcontroller, or a software program.
The four test benches described above are essential tools for verifying the functionality and
performance of convolutional neural network implementations on FPGAs. By using these
test benches, designers can ensure that their implementations are correct and that they
meet the desired performance requirements.
Performance Evaluation and Analysis
After implementing all optimization techniques, the accelerator was ready to classify
images using trained network weights. To simulate the hardware behavior, Xilinx® Vivado
HLS Co-simulation was employed. Images from the validation dataset, which achieved 73%
accuracy with Ristretto, were evaluated. The simulation process, spanning over 185 hours,
resulted in an overall 58% accuracy, requiring 26 million cycles per image.
31
With a relatively small critical path, a 100MHz clock can be utilized, enabling the
processing of approximately 4 frames per second. These results are deemed successful, as
the achieved accuracy meets the project's minimum threshold, and the performance
surpasses the lower limit by nearly fourfold. Consequently, no further modifications are
required, and the accelerator is prepared for deployment.
Table 3 Resource Utilization of Final Design (AlexNet)
Resource Utilization Optimization
While the accelerator described in Section 3.2 is functional and implementable, the
pipelined core's low resource footprint (35 DSPs, 41,000 Flip-flops, and 36,500 LUTs) allows
for potential modifications or duplications to reduce the pipeline depth. This situation is
particularly suited for HLS optimization, as it can sometimes surpass human design
capabilities.(See Table 3)
Initially, Vivado generated two core instances with a 4-stage pipeline, requiring 26,596,261
cycles, due to different memory inputs. To improve this design, various configurations were
explored using the function_instantiate pragma, creating four core instances. By sharing
resources effectively, only 15% more DSPs, 27% more flip-flops, and 33% more LUTs were
utilized compared to the double-mode core implementation. This configuration enabled
reducing two out of the four pipelines by one stage each. However, this modification
resulted in a negligible 0.2% performance improvement, ultimately leading to its rejection.
Here are some parameters to compare different CNN implementations on FPGA:
32
Resource Utilization Available Utilization %
LUT 36527 53200 68.66
LUTRAM 2594 46200 5.61
FF 41198 106400 38.72
BRAM 54 140 38.22
DSP 35 220 16.08
IO 69 285 24.21
BUFG 7 32 21.88
MMCM 2 10 20
PLL 1 10 10
● Throughput: Throughput is the number of input data that can be processed per
unit time. It is an important parameter to measure the performance of a CNN
implementation on an FPGA. Measured through FOPS
● Latency: Latency is the time taken by the CNN to process one input data.
● Resource utilization: It is an important parameter to measure the efficiency of a
CNN implementation on an FPGA.
● Power consumption: It is a crucial parameter to measure the energy efficiency
of a CNN implementation on an FPGA.
● Accuracy: It is an important parameter to measure the effectiveness of a CNN
implementation on an FPGA.
● Flexibility: Flexibility is the ability of the CNN implementation to adapt to
different CNN models and configurations.
● Ease of use
Table 4: Hardware execution times of each AlexNet Layer
Table.4 shows that the convolutional layers (CONV1-CONV5) are the most time-consuming
33
Layer Start Time End Time Total Time FOPS
CONV1 0
71198.67 us
Epoch = 0x1161e
Cycle = 0x43 71.19867 ms 0.7456 G
CONV2 0
547753.71 us
Epoch = 0x85BA9
Cycle = 0x47 547.75371 ms 108.806 M
CONV3 0
463776.90 us
Epoch = 0x713A0
Cycle = 0x5A 463.77690 ms 24.858 M
CONV4 0
697862.14 us
Epoch Oxaa606
Cycle = 0x0E 697.86214 ms 16.551 M
CONV5 0
466757.25 us
Epoch = 0x71f45
Cycle = 0x19 466.75725 ms 16.543 M
AFFINE1 0
796440.32 us
Epoch = 0xc2718
Cycle = 0x20 796.44032 ms 110.922 M
AFFINE2 0
1018890.52 us
Epoch Oxf8c0a
Cycle = 0x34 1018.89052 ms 33.446 M
AFFINE3 0
4682.26 us
Epoch = 0x124A
Cycle = 0x1A 4.68226 ms 17.769 M
layers in the network, accounting for over 90% of the total execution time. This is because
convolutional layers perform a large number of floating-point operations.
The fully connected layers (AFFINE1-AFFINE3) are much faster, but they still account for
a significant portion of the total execution time. This is because fully connected layers also
perform a large number of floating-point operations, and they also require more memory
bandwidth.
The table also shows that the FOPS of each layer is inversely proportional to the execution
time. This means that the layers with the longest execution times have the lowest FOPS.
Overall, the table provides insights into the performance of the AlexNet CNN when
implemented on an FPGA. It shows that the convolutional layers are the most
time-consuming layers in the network, and that the FOPS of each layer is inversely
proportional to the execution time.
Here are some specific observations from the table:
● The CONV1 layer has the longest execution time, at 71.1986 milliseconds. This is
because the CONV1 layer has the largest number of filters.
● The CONV5 layer has the shortest execution time, at 466.7572 milliseconds. This is
because the CONV5 layer has the smallest number of filters.
● The AFFINE1 layer has the highest FOPS, at 110.922 million FOPS. This is because
the AFFINE1 layer has the smallest number of connections.
● The AFFINE3 layer has the lowest FOPS, at 17.769 million FOPS. This is because
the AFFINE3 layer has the largest number of connections.
Table.4 also shows that the total execution time for the AlexNet CNN is 796.4403
milliseconds. This means that the network can process approximately 1.25 frames per
second.
Table5 AlexNet vs MobileNet
34
Layer
Ops
To Perform
AlexNet
FOPS
MobileNet
FOPS Difference
CONV1 210249696 0.7407 G 2.9530 G 34878.85
CONV2 62332672 126.897 M 113.796 M 287.09
CONV3 13498752 35.158 M 29.106 M 42.42
CONV4 14537088 26.645 M 20.830 M 30.31
CONV5 9691392 26.574 M 20.763 M 30.15
AFFINE1 90701824 176.322 M 113.884 M 115.9
AFFINE2 38797312 87.677 M 38.077 M 24.67
AFFINE3 94720 33.919 M 20.229 M 19.76
It is important to note that the performance of a CNN implementation on an FPGA can be
affected by a variety of factors, such as the FPGA platform, the CNN architecture, and the
optimization techniques used. Table5 only provides a comparison of 2 specific CNN
implementations on a specific FPGA platform.
Guo , K. et .
al . , 2016
Ma , Y.et. al . ,
2016
Zhang, C.et . al
. , 2015
Espinosa.M.,
2019 This Work
FPGA
Zynq
XC7Z045
Stratix - V
GXA7 FPGA
Virtex7
VX485T
Artix7
XC7A200T
Zedboard Zynq
AAES-Z7EV
Clock Freq 150 MHz 100MHz 100MHz 100MHz 100MHz
Data format 16 - bit fixed Fixed ( 8-16b ) 32 - bit Float 32 - bit Float 32 - bit Fixed
Power
9.63 W
( measured )
19.5 W
( measured )
18.61 W
( measured )
1.5 W
( estimated )
0.9 W
( estimated )
FF 127653 ? 205704 103610 41198
LUT 182616 121000 186251 91865 36527
BRAM 486 1552 1024 139.5 54
DSP 780 256 2240 119 35
Performance
187.80
GFOPS
114.5
GFOPS
61.62
GFOPS
2.93
GFOPS
0.74
GFOPS
Table 6: Comparison of other works to this work. (AlexNet)
Methods of Improvement / Scope
This implementation of a Convolutional Neural Network in an AlexNet configuration is a
first pass attempt and leaves a lot room for improvement and optimization. There are a few
ways the performance of this implementation can be increased which would be areas for
future work. Looking at Table 6, we can see the differences in resource utilization and
performance between other recent works and this one. Although this implementation
achieved a lower amount of GFOPs performance, the number of chip resources is far lower
than any of the other implementations. Also, the estimated power consumption is far lower.
35
5. Conclusions
5.1 Results
While Deep Learning and Convolutional Neural Networks (CNNs) have traditionally
resided within the realm of Computer Science, with massive computations performed on
GPUs housed in desktop computers, their increasing power demands raise concerns about
efficiency. Existing FPGA implementations for CNNs primarily focus on accelerating the
convolutional layer and often have rigid structures limiting their flexibility.
This work aims to address these limitations by proposing a scalable and modular FPGA
implementation for CNNs. Unlike existing approaches, this design seeks to configure the
system for running an arbitrary number of layers, offering greater flexibility and
adaptability.
The proposed architecture was evaluated on publicly available CNN architectures like
AlexNet, ResNet, and MobileNet on a Zedboard platform. Performance analysis revealed
MobileNet as the fastest among the three, achieving an accuracy of 47.5%. This
demonstrates the system's potential for efficient and adaptable execution of diverse CNN
architectures.
This work paves the way for further research in scalable and flexible FPGA
implementations for CNNs, offering promising avenues for resource-efficient deep learning
beyond traditional computing platforms.
36
Appendix
// CNN Sample Layer model
module Layer_1
#(parameterNN=30,numWeight=784,dataWidth=16,layerNum=1,sigmoidSize=10,weightIntWidth=
4,actType="relu")
(
input clk,
input rst,
input weightValid,
input biasValid,
input [31:0] weightValue,
input [31:0] biasValue,
input [31:0] config_layer_num,
input [31:0] config_neuron_num,
input x_valid,
input [dataWidth-1:0] x_in,
output [NN-1:0] o_valid,
output [NN*dataWidth-1:0] x_out
);
neuron
#(.numWeight(numWeight),.layerNo(layerNum),.neuronNo(0),.dataWidth(dataWidth),.sigmoidSize(
sigmoidSize),.weightIntWidth(weightIntWidth),.actType(actType),.weightFile("w_1_0.mif"),.biasFil
e("b_1_0.mif"))n_0(
.clk(clk),
.rst(rst),
.myinput(x_in),
.weightValid(weightValid),
.biasValid(biasValid),
.weightValue(weightValue),
.biasValue(biasValue),
.config_layer_num(config_layer_num),
.config_neuron_num(config_neuron_num),
.myinputValid(x_valid),
.out(x_out[0*dataWidth+:dataWidth]),
.outvalid(o_valid[0])
);
endmodule
Due to space constraints, all the data, references and code can be accessed here: Thesis_Appendix
37
Bibliography
[1] D. M. Harris and S. L. Harris, Digital Design and Computer Architecture. Elsevier,
(2007)
[2] S.Authors, History of artificial intelligence
[3] Farabet, C., Martini, B., Akselrod, P., Talay, S., LeCun, Y., Culurciello, E.: Hardware
accelerated convolutional neural networks for synthetic vision systems. In: Circuits
and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on. pp.
257–260. IEEE (2010)
[4] Goodfellow, I., & Bengio, Y., & Courville, A Convolutional Networks. In Dietterich,
T.,(Ed.), Deep Learning(326-339). Cambridge, Massachusetts: The MIT Press.(2016)
[5] D. Gschwend, Zynqnet: An fpga-accelerated embedded convolutional neural network.
[6] Xilinx (2017). Zynq-7000 All Programmable SoC Family Product Tables and Product
SelectionGuide. Retrieved from
https://www.xilinx.com/support/documentation/selection-guides/zynq-7000-product-se
lection-guide.pdf
[7] Romén Neris , Adrián Rodríguez , Raúl Guerra. FPGA-Based Implementation of a
CNN Architecture for the On-Board Processing of Very High-Resolution Remote
Sensing Images, IEEE Journal Of Selected Topics In Applied Earth Observations
And Remote Sensing, Vol. 15, 2022.
[8] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, And K. Keutzer,
Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5mb model size,
arXiv:1602.07360, (2016)
[9] Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). ImageNet Classification with Deep
Convolutional Neural Networks. Advances in Neural Information Processing Systems,
25 (NIPS 2012). Retrieved from
https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neu
ral-networks.pdf
[10] Zisserman, A. & Simonyan, K. (2014). Very Deep Convolutional Networks For
Large-Scale Image Recognition. Retrieved from https://arxiv.org/pdf/1409.1556.pdf
[11] Li, F., et. al. CNN Architectures [PDF document]. Retrieved from Lecture Notes Online
Website: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture9.pdf
38
[12] Qiao, Y., & Shen, J., & Xiao, T., & Yang, Q., & Wen, M., & Zhang, C. FPGA‐
accelerated deep convolutional neural networks for high throughput and energy
efficiency. Concurrency and Computation Practice and Experience. John Wiley & Sons
Ltd.(May 06, 2016).
[13] Lacey, G., & Taylor, G., & Areibi, S. Deep Learning on FPGAs: Past, Present and
Future. Cornell University Library. https://arxiv.org/abs/1602.04283 (Feb. 13, 2016)
[14] Gomez, P. Implementation of a Convolutional Neural Network (CNN) on a FPGA for
Sign Language's Alphabet recognition. Archivo Digital UPM. Retrieved December 6,
2023, from https://oa.upm.es/53784/1/TFG_PABLO_CORREA_GOMEZ.pdf (2018, July)
[15] Espinosa, M. A. Implementation of Convolutional Neural Networks in FPGA for Image
Classification. ScholarWorks. Retrieved December 6, 2023, from
https://scholarworks.calstate.edu/downloads/hd76s209r (2019, Spring)
[16] He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image
Recognition.Retrieved from https://arxiv.org/pdf/1512.03385.pdf
[17] Reddy, G. (2019, January 1). FPGA Implementation of Multiplier-Accumulator Unit
using Vedic multiplier and Reversible gates. Semantic Scholar.
https://www.semanticscholar.org/paper/FPGA-Implementation-of-Multiplier-Accumul
ator-Unit-Rajesh-Reddy/edab41b3600b2b51d6887042487bac32c80182b5
[18] Guo, K., & Sui, L., & Qiu, J., & Yao, S., & Han, S., & Wang, Y., & Yang, H. (July. 13,
2016). Angel-Eye: A Complete Design Flow for Mapping CNN onto Customized
Hardware. IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2016,
pp.24-29. doi:10.1109/ISVLSI.2016.129
[19] Ahn, B. (Oct. 01, 2015). Real-time video object recognition using convolutional neural
networks. International Joint Conference on Neural Networks (IJCNN), 2015.
doi:10.1109/IJCNN.2015.7280718
39

More Related Content

Similar to Analysis on Implementation of different CNN Architectures on FPGAs | Undergrad Thesis - BITS F421T Thesis | Author: Prayag Mohanty |BITS Pilani KK Birla Goa Campus

Thesies_Cheng_Guo_2015_fina_signed
Thesies_Cheng_Guo_2015_fina_signedThesies_Cheng_Guo_2015_fina_signed
Thesies_Cheng_Guo_2015_fina_signedCheng Guo
 
Master Arbeit_Chand _Piyush
Master Arbeit_Chand _PiyushMaster Arbeit_Chand _Piyush
Master Arbeit_Chand _PiyushPiyush Chand
 
Design and Development of a Knowledge Community System
Design and Development of a Knowledge Community SystemDesign and Development of a Knowledge Community System
Design and Development of a Knowledge Community SystemHuu Bang Le Phan
 
Design and Simulation of Local Area Network Using Cisco Packet Tracer
Design and Simulation of Local Area Network Using Cisco Packet TracerDesign and Simulation of Local Area Network Using Cisco Packet Tracer
Design and Simulation of Local Area Network Using Cisco Packet TracerAbhi abhishek
 
CS499_JULIUS_J_FINAL_YEAR_PROJETCT_L_DRAFT
CS499_JULIUS_J_FINAL_YEAR_PROJETCT_L_DRAFTCS499_JULIUS_J_FINAL_YEAR_PROJETCT_L_DRAFT
CS499_JULIUS_J_FINAL_YEAR_PROJETCT_L_DRAFTJosephat Julius
 
An investigation into the physical build and psychological aspects of an inte...
An investigation into the physical build and psychological aspects of an inte...An investigation into the physical build and psychological aspects of an inte...
An investigation into the physical build and psychological aspects of an inte...Jessica Navarro
 
Makgopa Setati_Machine Learning for Decision Support in Distributed Systems_M...
Makgopa Setati_Machine Learning for Decision Support in Distributed Systems_M...Makgopa Setati_Machine Learning for Decision Support in Distributed Systems_M...
Makgopa Setati_Machine Learning for Decision Support in Distributed Systems_M...Makgopa Gareth Setati
 
A Software Approach for Lower Power Consumption.pdf
A Software Approach for Lower Power Consumption.pdfA Software Approach for Lower Power Consumption.pdf
A Software Approach for Lower Power Consumption.pdfHanaTiti
 
Accelerated Prototyping of Cyber Physical Systems in an Incubator Context
Accelerated Prototyping of Cyber Physical Systems in an Incubator ContextAccelerated Prototyping of Cyber Physical Systems in an Incubator Context
Accelerated Prototyping of Cyber Physical Systems in an Incubator ContextSreyas Sriram
 
complete_project
complete_projectcomplete_project
complete_projectAnirban Roy
 
An investigation into the critical factors involved in developing a Knowledge...
An investigation into the critical factors involved in developing a Knowledge...An investigation into the critical factors involved in developing a Knowledge...
An investigation into the critical factors involved in developing a Knowledge...Gowri Shankar
 
ILIC Dejan - MSc: Secure Business Computation by using Garbled Circuits in a ...
ILIC Dejan - MSc: Secure Business Computation by using Garbled Circuits in a ...ILIC Dejan - MSc: Secure Business Computation by using Garbled Circuits in a ...
ILIC Dejan - MSc: Secure Business Computation by using Garbled Circuits in a ...Dejan Ilic
 
HEC Project Proposal_v1.0
HEC Project Proposal_v1.0HEC Project Proposal_v1.0
HEC Project Proposal_v1.0Awais Shibli
 
E.Leute: Learning the impact of Learning Analytics with an authentic dataset
E.Leute: Learning the impact of Learning Analytics with an authentic datasetE.Leute: Learning the impact of Learning Analytics with an authentic dataset
E.Leute: Learning the impact of Learning Analytics with an authentic datasetHendrik Drachsler
 
Project_ReportTBelle(1)
Project_ReportTBelle(1)Project_ReportTBelle(1)
Project_ReportTBelle(1)Tyler Belle
 
Integration of technical development within complex project environment
Integration of technical development within complex project environmentIntegration of technical development within complex project environment
Integration of technical development within complex project environmentJacobs Engineering
 

Similar to Analysis on Implementation of different CNN Architectures on FPGAs | Undergrad Thesis - BITS F421T Thesis | Author: Prayag Mohanty |BITS Pilani KK Birla Goa Campus (20)

Tr1546
Tr1546Tr1546
Tr1546
 
Thesies_Cheng_Guo_2015_fina_signed
Thesies_Cheng_Guo_2015_fina_signedThesies_Cheng_Guo_2015_fina_signed
Thesies_Cheng_Guo_2015_fina_signed
 
Master Arbeit_Chand _Piyush
Master Arbeit_Chand _PiyushMaster Arbeit_Chand _Piyush
Master Arbeit_Chand _Piyush
 
Masters_Raghu
Masters_RaghuMasters_Raghu
Masters_Raghu
 
Design and Development of a Knowledge Community System
Design and Development of a Knowledge Community SystemDesign and Development of a Knowledge Community System
Design and Development of a Knowledge Community System
 
Design and Simulation of Local Area Network Using Cisco Packet Tracer
Design and Simulation of Local Area Network Using Cisco Packet TracerDesign and Simulation of Local Area Network Using Cisco Packet Tracer
Design and Simulation of Local Area Network Using Cisco Packet Tracer
 
CS499_JULIUS_J_FINAL_YEAR_PROJETCT_L_DRAFT
CS499_JULIUS_J_FINAL_YEAR_PROJETCT_L_DRAFTCS499_JULIUS_J_FINAL_YEAR_PROJETCT_L_DRAFT
CS499_JULIUS_J_FINAL_YEAR_PROJETCT_L_DRAFT
 
An investigation into the physical build and psychological aspects of an inte...
An investigation into the physical build and psychological aspects of an inte...An investigation into the physical build and psychological aspects of an inte...
An investigation into the physical build and psychological aspects of an inte...
 
Makgopa Setati_Machine Learning for Decision Support in Distributed Systems_M...
Makgopa Setati_Machine Learning for Decision Support in Distributed Systems_M...Makgopa Setati_Machine Learning for Decision Support in Distributed Systems_M...
Makgopa Setati_Machine Learning for Decision Support in Distributed Systems_M...
 
A Software Approach for Lower Power Consumption.pdf
A Software Approach for Lower Power Consumption.pdfA Software Approach for Lower Power Consumption.pdf
A Software Approach for Lower Power Consumption.pdf
 
Accelerated Prototyping of Cyber Physical Systems in an Incubator Context
Accelerated Prototyping of Cyber Physical Systems in an Incubator ContextAccelerated Prototyping of Cyber Physical Systems in an Incubator Context
Accelerated Prototyping of Cyber Physical Systems in an Incubator Context
 
fac_alahari001_planczhaov1
fac_alahari001_planczhaov1fac_alahari001_planczhaov1
fac_alahari001_planczhaov1
 
complete_project
complete_projectcomplete_project
complete_project
 
An investigation into the critical factors involved in developing a Knowledge...
An investigation into the critical factors involved in developing a Knowledge...An investigation into the critical factors involved in developing a Knowledge...
An investigation into the critical factors involved in developing a Knowledge...
 
ILIC Dejan - MSc: Secure Business Computation by using Garbled Circuits in a ...
ILIC Dejan - MSc: Secure Business Computation by using Garbled Circuits in a ...ILIC Dejan - MSc: Secure Business Computation by using Garbled Circuits in a ...
ILIC Dejan - MSc: Secure Business Computation by using Garbled Circuits in a ...
 
HEC Project Proposal_v1.0
HEC Project Proposal_v1.0HEC Project Proposal_v1.0
HEC Project Proposal_v1.0
 
E.Leute: Learning the impact of Learning Analytics with an authentic dataset
E.Leute: Learning the impact of Learning Analytics with an authentic datasetE.Leute: Learning the impact of Learning Analytics with an authentic dataset
E.Leute: Learning the impact of Learning Analytics with an authentic dataset
 
Project_ReportTBelle(1)
Project_ReportTBelle(1)Project_ReportTBelle(1)
Project_ReportTBelle(1)
 
NMacgearailt Sumit_thesis
NMacgearailt Sumit_thesisNMacgearailt Sumit_thesis
NMacgearailt Sumit_thesis
 
Integration of technical development within complex project environment
Integration of technical development within complex project environmentIntegration of technical development within complex project environment
Integration of technical development within complex project environment
 

More from Prayag Mohanty

"Touch-Me-Not" by Ismat Chughtai: A Critical Analysis
"Touch-Me-Not"  by Ismat Chughtai: A Critical Analysis"Touch-Me-Not"  by Ismat Chughtai: A Critical Analysis
"Touch-Me-Not" by Ismat Chughtai: A Critical AnalysisPrayag Mohanty
 
Periodic Styles in Indian Traditional Art - Mughal, Kangra, Miniature
Periodic Styles in Indian Traditional Art - Mughal, Kangra, MiniaturePeriodic Styles in Indian Traditional Art - Mughal, Kangra, Miniature
Periodic Styles in Indian Traditional Art - Mughal, Kangra, MiniaturePrayag Mohanty
 
Pattachitra - Elements of Design | Intro to Contemporary Arts Presentation
Pattachitra - Elements of Design | Intro to Contemporary Arts PresentationPattachitra - Elements of Design | Intro to Contemporary Arts Presentation
Pattachitra - Elements of Design | Intro to Contemporary Arts PresentationPrayag Mohanty
 
Modern Indian Art - Principles of design
Modern Indian Art - Principles of designModern Indian Art - Principles of design
Modern Indian Art - Principles of designPrayag Mohanty
 
PYAAR KA PUNCHNAMA 2024 | BITS Goa Quiz Club | Modern Love Quiz | QM: Prayag ...
PYAAR KA PUNCHNAMA 2024 | BITS Goa Quiz Club | Modern Love Quiz | QM: Prayag ...PYAAR KA PUNCHNAMA 2024 | BITS Goa Quiz Club | Modern Love Quiz | QM: Prayag ...
PYAAR KA PUNCHNAMA 2024 | BITS Goa Quiz Club | Modern Love Quiz | QM: Prayag ...Prayag Mohanty
 
AN ANALYSIS ON SEABORNE VESSEL TRAFFIC & ECONOMY | Maritime Studies
AN ANALYSIS ON SEABORNE VESSEL TRAFFIC & ECONOMY | Maritime Studies AN ANALYSIS ON SEABORNE VESSEL TRAFFIC & ECONOMY | Maritime Studies
AN ANALYSIS ON SEABORNE VESSEL TRAFFIC & ECONOMY | Maritime Studies Prayag Mohanty
 
Quizzinga- Biz & Inno Quiz | Coalescence'23 | BITS Goa Quiz Club | QM:Prayag...
Quizzinga- Biz & Inno Quiz |  Coalescence'23 | BITS Goa Quiz Club | QM:Prayag...Quizzinga- Biz & Inno Quiz |  Coalescence'23 | BITS Goa Quiz Club | QM:Prayag...
Quizzinga- Biz & Inno Quiz | Coalescence'23 | BITS Goa Quiz Club | QM:Prayag...Prayag Mohanty
 
Potluck - Food & Culinary Quiz 2022 | QM Prayag Mohanty | BITS Pilani KK Birl...
Potluck - Food & Culinary Quiz 2022 | QM Prayag Mohanty | BITS Pilani KK Birl...Potluck - Food & Culinary Quiz 2022 | QM Prayag Mohanty | BITS Pilani KK Birl...
Potluck - Food & Culinary Quiz 2022 | QM Prayag Mohanty | BITS Pilani KK Birl...Prayag Mohanty
 
Aero Quiz 2022 | QM: Prayag Mohanty | BITS Pilani KK Birla Goa Campus | BITS ...
Aero Quiz 2022 | QM: Prayag Mohanty | BITS Pilani KK Birla Goa Campus | BITS ...Aero Quiz 2022 | QM: Prayag Mohanty | BITS Pilani KK Birla Goa Campus | BITS ...
Aero Quiz 2022 | QM: Prayag Mohanty | BITS Pilani KK Birla Goa Campus | BITS ...Prayag Mohanty
 
Development Economics Assignment (NAFTA + MERCOSUR): A look at the Industrial...
Development Economics Assignment (NAFTA + MERCOSUR): A look at the Industrial...Development Economics Assignment (NAFTA + MERCOSUR): A look at the Industrial...
Development Economics Assignment (NAFTA + MERCOSUR): A look at the Industrial...Prayag Mohanty
 

More from Prayag Mohanty (10)

"Touch-Me-Not" by Ismat Chughtai: A Critical Analysis
"Touch-Me-Not"  by Ismat Chughtai: A Critical Analysis"Touch-Me-Not"  by Ismat Chughtai: A Critical Analysis
"Touch-Me-Not" by Ismat Chughtai: A Critical Analysis
 
Periodic Styles in Indian Traditional Art - Mughal, Kangra, Miniature
Periodic Styles in Indian Traditional Art - Mughal, Kangra, MiniaturePeriodic Styles in Indian Traditional Art - Mughal, Kangra, Miniature
Periodic Styles in Indian Traditional Art - Mughal, Kangra, Miniature
 
Pattachitra - Elements of Design | Intro to Contemporary Arts Presentation
Pattachitra - Elements of Design | Intro to Contemporary Arts PresentationPattachitra - Elements of Design | Intro to Contemporary Arts Presentation
Pattachitra - Elements of Design | Intro to Contemporary Arts Presentation
 
Modern Indian Art - Principles of design
Modern Indian Art - Principles of designModern Indian Art - Principles of design
Modern Indian Art - Principles of design
 
PYAAR KA PUNCHNAMA 2024 | BITS Goa Quiz Club | Modern Love Quiz | QM: Prayag ...
PYAAR KA PUNCHNAMA 2024 | BITS Goa Quiz Club | Modern Love Quiz | QM: Prayag ...PYAAR KA PUNCHNAMA 2024 | BITS Goa Quiz Club | Modern Love Quiz | QM: Prayag ...
PYAAR KA PUNCHNAMA 2024 | BITS Goa Quiz Club | Modern Love Quiz | QM: Prayag ...
 
AN ANALYSIS ON SEABORNE VESSEL TRAFFIC & ECONOMY | Maritime Studies
AN ANALYSIS ON SEABORNE VESSEL TRAFFIC & ECONOMY | Maritime Studies AN ANALYSIS ON SEABORNE VESSEL TRAFFIC & ECONOMY | Maritime Studies
AN ANALYSIS ON SEABORNE VESSEL TRAFFIC & ECONOMY | Maritime Studies
 
Quizzinga- Biz & Inno Quiz | Coalescence'23 | BITS Goa Quiz Club | QM:Prayag...
Quizzinga- Biz & Inno Quiz |  Coalescence'23 | BITS Goa Quiz Club | QM:Prayag...Quizzinga- Biz & Inno Quiz |  Coalescence'23 | BITS Goa Quiz Club | QM:Prayag...
Quizzinga- Biz & Inno Quiz | Coalescence'23 | BITS Goa Quiz Club | QM:Prayag...
 
Potluck - Food & Culinary Quiz 2022 | QM Prayag Mohanty | BITS Pilani KK Birl...
Potluck - Food & Culinary Quiz 2022 | QM Prayag Mohanty | BITS Pilani KK Birl...Potluck - Food & Culinary Quiz 2022 | QM Prayag Mohanty | BITS Pilani KK Birl...
Potluck - Food & Culinary Quiz 2022 | QM Prayag Mohanty | BITS Pilani KK Birl...
 
Aero Quiz 2022 | QM: Prayag Mohanty | BITS Pilani KK Birla Goa Campus | BITS ...
Aero Quiz 2022 | QM: Prayag Mohanty | BITS Pilani KK Birla Goa Campus | BITS ...Aero Quiz 2022 | QM: Prayag Mohanty | BITS Pilani KK Birla Goa Campus | BITS ...
Aero Quiz 2022 | QM: Prayag Mohanty | BITS Pilani KK Birla Goa Campus | BITS ...
 
Development Economics Assignment (NAFTA + MERCOSUR): A look at the Industrial...
Development Economics Assignment (NAFTA + MERCOSUR): A look at the Industrial...Development Economics Assignment (NAFTA + MERCOSUR): A look at the Industrial...
Development Economics Assignment (NAFTA + MERCOSUR): A look at the Industrial...
 

Recently uploaded

College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N
 

Recently uploaded (20)

College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 

Analysis on Implementation of different CNN Architectures on FPGAs | Undergrad Thesis - BITS F421T Thesis | Author: Prayag Mohanty |BITS Pilani KK Birla Goa Campus

  • 1. Analysis on Implementation of different CNN Architectures on FPGAs UNDERGRADUATE THESIS Submitted in partial fulfillment of the requirements of BITS F421T Thesis By PRAYAG MOHANTY ID No. 2020A3PS0566G Under the supervision of: Dr. AMALIN PRINCE A. BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE PILANI, GOA CAMPUS December 2023 1
  • 2. Declaration of Authorship I, Prayag Mohanty, declare that this Undergraduate Thesis titled, ‘Analysis on implementation of different CNN Architectures on FPGA’ and the work presented in it are my own. This was undertaken in the First Semester of 2023-24. I confirm that: ● This research was primarily conducted while I was a candidate for a research degree at this University. ● Any portions of this thesis previously submitted for a degree or qualification at this or another institution are explicitly identified. ● I consistently and clearly credit any consulted published works of others. ● All quotations are attributed to their original sources. With the exception of such quotations, the content of this thesis is entirely my own original work. ● I have expressed my gratitude for all significant sources of assistance. ● If the thesis draws on work I conducted collaboratively with others, I have clearly outlined each individual's contribution, including my own. Signed: Date: 12 / 12 / 23 i
  • 3. Certificate This is to certify that the thesis entitled, “Analysis on implementation of different CNN Architectures on FPGA” and submitted by Prayag Mohanty ID No. 2020A3PS0566G in partial fulfillment of the requirements of BITS F421T Thesis embodies the work done by him under my supervision. _____________________________ Supervisor Dr. Amalin Prince A. Professor, Dept. of EEE BITS-Pilani K.K.Birla Goa Campus Date: 12 / 12 / 23 ii
  • 4. “Knowledge is a tool, best shared. So is my thesis :) ” -Prayag Mohanty iii
  • 5. BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE PILANI, K.K.BIRLA GOA CAMPUS Abstract Bachelor of Engineering (Hons.) Analysis on implementation of different CNN Architectures on FPGA by Prayag Mohanty Convolutional Neural Networks (CNNs) are a special type of neural networks that are exceptionally good at working with data, like images, signals etc. The usage of Field-Programmable Gate Arrays (FPGAs) in high-performance computing has garnered significant attention with the advent of Artificial Intelligence. This thesis investigates the performance and resource utilization of various convolutional neural network (CNN) models for implementation on Field-Programmable Gate Arrays (FPGAs). The primary objective is to identify optimal CNN models for FPGA deployment based on their performance, resource utilization, and other relevant parameters. Two prominent CNN models, AlexNet and MobileNet, were chosen for analysis. Both models were implemented on an FPGA platform. Performance metrics such as resource utilization metrics, including logic slices, memory blocks, and DSP slices, were monitored to assess the hardware requirements of each model. The evaluation results demonstrate that MobileNet exhibits significantly lower resource utilization compared to AlexNet while maintaining a commendable level of performance. This suggests that MobileNet is a more efficient option for deploying CNN models on FPGAs with limited hardware resources. AlexNet, on the other hand, offers superior performance but at the expense of higher resource consumption. This makes it a suitable choice for applications where performance is paramount and resources are less restricted.This analysis provides valuable insights into the suitability of different CNN models for FPGA implementation based on their performance and resource utilization characteristics. Keywords: Convolutional Neural Networks, FPGA, Performance, Resource Utilization, AlexNet, MobileNet iv
  • 6. Acknowledgements The journey of completing this thesis has been a rewarding but challenging one, and I would like to express my heartfelt gratitude to those who have supported me throughout the process. First and foremost, I want to thank my family for their unwavering love and support. Their constant encouragement and belief in me have been instrumental in helping me overcome obstacles and persevere through difficulties. I am especially grateful for the sacrifices they made to enable me to pursue my educational goals.I extend my sincere thanks to my relatives and friends for their encouragement and understanding. I owe a debt of immense gratitude to my thesis supervisor, Professor Amalin Prince A. whose guidance, expertise, and patience have been invaluable in shaping my research and helping me refine my work. I am deeply grateful for their insightful feedback, constructive criticism, and unwavering support throughout the research process. Finally, I would like to express my sincere appreciation to my institute, BITS Pilani KK Birla Goa Campus. The institution's excellent academic environment, equipment, and dedicated faculty have provided me with the foundation and resources necessary to conduct my research. Thank you all for your invaluable contributions. v
  • 7. Contents Declaration of Authorship i Certificate ii Abstract iv Acknowledgements v Contents vi List of Figures viii List of Tables ix Abbreviations x 1 Introduction 1 1.1 Motivation.................................................................................................................. 1 1.2 Scope & Structure......................................................................................................2 2 Fundamentals 3 2.1 Current Work............................................................................................................ 3 2.1.1 Theoretical Background...................................................................................3 2.1.2 FPGA................................................................................................................ 3 2.2 Literature Review..................................................................................................... 4 1.2.1 AlexNet.........................................................................................................10 1.2.2 ResNet...........................................................................................................11 1.2.3 MobileNet..................................................................................................... 14 3 Design and Implementation 15 3.1 Design......................................................................................................................15 3.2 Implementation.......................................................................................................16 4 Hardware Implementation 20 4.1 Design Methodology................................................................................................20 4.2 HLS Methodology....................................................................................................20 4.3 Design Overview .................................................................................................... 20 4.4 Caching Strategy.....................................................................................................21 vi
  • 8. 5 Conclusion 24 Appendix 26 Bibliography 30 vii
  • 9. List of Figures 1.1 Figure 1: Neuron Architecture…………………………….................................................................................. 4 1.2 Figure 2: Layers in a CNN model. .................................................................................................................. 6 1.3 Figure 3: A typical convolutional neural network layer's components.......................................................... 6 1.4 Figure 4: Visual representation of AlexNet architecture................................................................................ 10 1.5 Figure 5: Visual representation of AlexNet layers……...................................................................................11 1.6 Figure 6: Residual Learning: a building block of the ResNet architecture...................................................11 1.7 Figure 7: Representation of the ResNet architecture.….................................................................................13 1.8 Figure 8: Xilinx Vivado 2021.3 IDE………………………...............................................................................17 1.9 Figure 9: Digilent Zedboard Avnet Evaluation Kit Zynq-7000 System-on-Chip (SoC)...............................18 1.10 Figure 10: AlexNet Lite……………………......................................................................................................18 1.11 Figure 11: Chinese Academy logic architecture............................................................................................ 19 1.12 Figure 12: Angel-Eye architecture................................................................................................................. 20 1.13 Figure 13: MobileNet Lite……..............................................................................................……………..…. 20 1.14 Figure 14: Xilinx Vivado IDE..........................................................................................…………………..... 21 1.15 Figure 15: Final Top-Level FPGA Design..................................................................................................... 22 1.16 Figure 16: Convolutional / Affine Layer Virtual Memory Test Bench…………………………………….….. 23 1.17 Figure 17: Convolutional / Affine Layer Block RAM Test Bench……………………………..…………….… 23 1.18 Figure 18: Max Pool Layer Virtual Memory Test Bench………………………………………….……….….… 23 1.19 Figure 19: Max Pool Layer Block RAM Test Bench……………………………………………….……….….… 23 viii
  • 10. viii
  • 11. List of Tables 1.20 Table 1: Specifications of Zedboard……………………………................................................................ 19 1.21 Table 2: Resource Utilization of Final Design. ...................................................................................... 21 1.22 Table 3: Hardware execution times of each AlexNet Layer…………..................................................... 24 1.23 Table 4: Simulation Model vs Hardware Implementation.................................................................... 25 1.24 Table 5: Comparing AlexNet vs MobileNet……..................................................................................... 25 1.25 Table 6: Comparison of other works to this work………………………….............................................. 26 ix
  • 12. Abbreviations CNN Convolutional Neural Networks FPGA Field Programmable Gate Arrays AI Artificial Intelligence ML Machine Learning HLS High Level Design DSP Digital Signal Processing
  • 13. Dedicate this to my family, friends, relatives and electronics. 14
  • 14. 1.Introduction 1.1 Motivation The field of high-performance computing (HPC) has witnessed a significant shift in recent years, driven by the ever-increasing demand for processing power across diverse application domains. This growth is fueled by advancements in various fields, including science, engineering, finance, and healthcare, each requiring the ability to analyze and process massive datasets in real-time. To address this growing demand, researchers have turned to Field-Programmable Gate Arrays (FPGAs) as a promising alternative to traditional CPUs and GPUs. FPGAs offer several key advantages over traditional computing architectures. Their reconfigurable nature allows them to be tailored to specific tasks, leading to significant performance improvements compared to general-purpose CPUs. Additionally, FPGAs excel in energy efficiency due to their parallel processing capabilities and optimized hardware design. This combination of performance and efficiency makes FPGAs ideal candidates for accelerating computationally intensive workloads in HPC. Over the past few decades, the field of Artificial Intelligence (AI) has experienced tremendous progress, revolutionizing numerous aspects of our lives. From image and speech recognition to natural language processing and autonomous vehicles, AI has demonstrably impacted various industries and scientific domains. This rapid advancement is fueled by the increasing availability of computing resources and data, enabling the development and deployment of complex machine learning algorithms and neural networks. However, the growing demand for AI applications necessitates the development of efficient and scalable neural networks. Traditional software-based implementations often struggle to handle the demands of real-time processing and resource limitations on mobile and embedded systems. This is where FPGAs present a compelling solution. With their inherent 1
  • 15. parallelism and hardware flexibility, FPGAs can be leveraged to implement efficient neural networks that deliver superior performance and energy savings compared to software-based approaches. The motivation for this project stems from the desire to explore the potential of FPGAs in accelerating Convolutional Neural Networks (CNNs), a class of neural networks widely used in various AI applications, particularly image and video processing. CNNs excel in extracting features and identifying patterns in images, making them instrumental for tasks such as image recognition, object detection, and image segmentation. My primary objective is to analyze and compare different CNN architectures available for implementation on FPGAs. This analysis focuses on key performance metrics like resource utilization, scalability, and real-time processing capabilities. The ultimate goal is to identify and optimize a CNN model that delivers the best performance on the Zedboard, a popular FPGA development platform. Additionally, the potential for deploying CNNs on low-resource systems like smartphones motivates this project. This enables the processing of sensitive data directly on the device, eliminating the need for internet data transmission and ensuring data privacy. Furthermore, integrating CNNs into embedded systems opens up exciting possibilities for real-time applications in areas like robotics, autonomous vehicles, and smart home technologies. By exploring the implementation of various CNNs on FPGAs, this project aims to contribute to the development of efficient and scalable AI solutions for resource-constrained environments. The insights and findings will provide valuable knowledge and pave the way for future research in the field of hardware-accelerated AI. 1.2 Scope & Structure The prospect of creating a whole framework capable of analyzing data in real time piqued my interest. However, due to the task's complexity and the lack of intensive experience with Neural Networks earlier, the scope was reduced to the following points. 2
  • 16. 1. The data set was restricted to numbers. This would be a simple & good starter for other forms of information like written language, signals etc. 2. Only individual pre-existing images were used for static analysis. The main reason for this decision is because, while there exist Neural Networks capable of properly analyzing video, their complexity has risen and the analysis for their use in embedded systems has not yet been fully established, which would add an additional risk to the project. The project needs to be broken down into two independent sub-problems that can be tackled separately. However, when combined, they will provide the desired overall outcome. 1. This work aims to develop a system configured to run as many layers as desired and test it using a currently defined CNN configuration, AlexNet. This type of system would allow a developer to scale a design to fit any size of FPGA. 2. Comparing two CNN architectures, AlexNet and MobileNet on the basis of their measurable parameters like performance, speed, DSP slice, LUTs etc. on a Zedboard. This would help determine the compatibility of these models on a sample Zedboard. 3
  • 17. 2.Fundamentals 2.1 Current Work 2.1.1 Background Convolutional Neural Networks (CNNs) are a type of artificial intelligence that fall within the field of machine learning and are also categorized as a deep learning technique. Neural networks: Inspired by the human brain, neural networks are computational structures composed of interconnected nodes called neurons. These neurons receive and process information from each other, mimicking the way synapses in the brain facilitate communication. This intricate network of connections, numbering in the millions, underlies the complex thought processes and behavior observed in humans and other intelligent beings. Artificial Neural Networks use the way neurons interact - to construct systems in which each of the building blocks (usually referred to as neurons) receives several inputs that are weighed using weights and produces an output that is sent to several other building blocks. Fig 1. shows the hardware architecture of a neuron. Fig.1 Neuron Architecture (Reddy, 2019) 4
  • 18. A neuron receives multiple inputs, such as pixel values or sound data, depending on the application. It multiplies the inputs (say x) with suitable weights (w) and adds bias (b) . The function σ(w⋅x+b)is obtained. Functionality: Neural networks excel at classifying inputs into predetermined categories. This ability stems from assigned weights to each neuron within the network. A crucial step called training determines the specific combination of weights that enables accurate classification. During this phase, the network receives numerous inputs with known outputs, and the weights are adjusted iteratively until an optimal configuration is achieved. Topology: To provide all neurons with a suitable structure for analyzing input data, they can be organized in various ways. In our project, we will focus on networks where neurons are arranged in ordered layers, only receiving input from the preceding layer and sending output to the subsequent one. Consequently, the network's topology is defined by how the layers are interconnected and the operations performed within each layer, often utilizing previously learned weights. Convolutional Neural Networks (CNNs) are a special type of neural networks that are really good at working with 2D data, like images. They are commonly used for tasks like identifying objects in images or labeling scenes. Imagine a 256x256 image with three color channels (RGB). Feeding this pixel data into a conventional neural network would require millions of weights, due to the typical connectivity between neurons across layers. However, CNNs leverage the inherent spatial locality of information in images. For instance, to identify a car in an image, analyzing pixels in the top-right corner isn't crucial. Features like edges, lines, circles, and contours provide enough context. This is where convolutional layers come in. These specialized layers replace fully-connected layers, allowing the network to focus on local information and extract meaningful features. Each convolutional layer receives a stack of images as input and generates another stack as output. These layers utilize small filters (kernels) to scan the input and extract features. These filters, equipped with learned weights, help the network recognize patterns and objects in the images. 5
  • 19. In essence, CNNs employ convolutional layers to efficiently capture key features in images, facilitating accurate image understanding and classification. Convolutional Layer Details: ● Each input layer receives a stack of 2D images (chin) with dimensions hin×win, referred to as input feature maps. ● Each layer outputs a stack of 2D images (chout) with dimensions hout×wout, called output feature maps. ● Each layer utilizes a stack of chin×chout kernels (or 2D filters) with dimensions k×k (typically ranging from 1x1 to 11x1) containing the trained weights. By focusing on local information and utilizing efficient convolutional layers, CNNs achieve exceptional performance in image-related tasks, solidifying their position as a powerful tool for image processing and computer vision applications. Fig.2 Layers in a CNN model (Goodfellow,2016) Activation and Pooling Activation: Each linear activation is then passed through a non-linear activation function. This stage, also known as the "detector stage," introduces non-linearity into the network, allowing it to learn complex relationships between features. A popular choice for the activation function is the rectified linear unit (ReLU), which outputs the input value if it is positive, and zero otherwise. Pooling: This stage further modifies the layer's output by applying a pooling function. Pooling functions summarize the output within a specific neighborhood, often reducing the 6
  • 20. dimensionality of the data. Common pooling functions include: ● Max pooling: Replaces each output with the maximum value within its rectangular neighborhood. ● Average pooling: Replaces each output with the average value within its rectangular neighborhood. ● L2-norm pooling: Replaces each output with the L2 norm of the values within its rectangular neighborhood. ● Weighted average pooling: Replaces each output with a weighted average based on the distance from the central pixel. By performing these stages sequentially, CNNs extract and learn features from input data, enabling them to perform complex tasks like image recognition and natural language processing.[4] (Shown below Fig.3) Fig.3: A typical convolutional neural network layer's components (Goodfellow, 2016) Convolutional networks (ConvNets) can be described using two distinct sets of terminology. Left-hand View: This perspective treats the ConvNet as a collection of relatively complex layers, each containing multiple "stages." Each kernel tensor directly corresponds to a network layer in this interpretation. Right-hand View: This perspective presents the ConvNet as a sequence of simpler layers. Every processing step within the network is considered its own individual layer. Consequently, not every "layer" possesses learnable parameters.[4] Practical Convolution Convolution in the context of neural networks transcends a singular operation. It involves the parallel application of multiple convolutions, leveraging the strength of extracting diverse features across multiple spatial locations. A single kernel can only identify one type of feature, limiting the richness of extracted information. By employing multiple kernels in parallel, the network extracts a broader spectrum of features, enhancing its representational power. 7
  • 21. Neural networks often handle data with a richer structure than mere grids of real values. The input typically consists of "vector-valued observations," where each data point holds additional information beyond a single value. For instance, a color image presents red, green, and blue intensity values at each pixel, creating a 3-dimensional tensor. One index denotes the different channels (red, green, blue), while the other two specify the spatial coordinates within each channel[4]. Software implementations of convolution often employ "batch mode," processing multiple data samples simultaneously. This introduces an additional dimension (the "batch axis") to the tensor, representing different examples within the batch. For clarity, we will disregard the batch axis in our subsequent discussion [4]. A crucial element of convolutional networks is "multi-channel convolution," where both the input and output possess multiple channels. This multi-channel nature introduces an interesting property: the linear operations involved are not guaranteed to be commutative, even with the implementation of "kernel flipping." Commutativity only holds true when each operation involves the same number of input and output channels. To illustrate these concepts, consider a 3-channel color image as the input to a convolutional layer with multiple kernels. Each kernel extracts a specific type of feature from each channel, resulting in multiple "feature maps." These feature maps, when combined, form the output of the convolution operation[4]. Training a Neural Network Because training is computationally expensive, there are frameworks and tools available to help with this process. Two popular ones are Caffe and Tensorflow. In this thesis, different frameworks were explored gradually, starting with simpler ones and gradually moving towards more advanced ones, as we had limited prior knowledge. There exist two primary forms of training for neural networks: 1. Full training: In situations where an ample amount of data is accessible, it is possible to train all the network weights to enhance results tailored to the specific application. 2. Transfer learning: Frequently, insufficient data is available to train all the weights from the ground up. In such instances, a prevalent strategy involves employing a pre-trained network designed for a distinct application. The majority of layer weights are repurposed, with only the final layer being adjusted to align with the requirements of the 8
  • 22. new application. 2.1.2 FPGAs Field-Programmable Gate Arrays (FPGAs) are a type of integrated circuit that can be reprogrammed and reconfigured countless times after they have been manufactured. These devices form the foundation of reconfigurable computing, a computing approach that emphasizes splitting applications into parallel, application-specific pipelines. FPGAs have reconfigurable logic resources like LUT (Look-Up Tables), DSPs (Digital Signal Processors), and BRAMs (Block RAMs). These resources can be connected and configured in various ways, allowing the implementation of different electronic circuits. The allure of reconfigurable computing lies in its ability to merge the rapidity of hardware with the adaptability of software, essentially bringing together the most advantageous features of both hardware and software. Harnessing the computational power of FPGAs takes a leap forward with distributed computing. This strategy clusters FPGAs, dividing problems into smaller tasks for parallel processing. By working as a team, this distributed network unlocks significant performance gains through parallelization. This approach offers key benefits: ● Scalability: Easily add FPGAs to the cluster as computational demands grow. ● Efficiency: Shared resources and coordinated tasks optimize resource utilization. ● Flexibility: Adapt and optimize the configuration to meet specific needs. ● Performance: Parallelization boosts processing speed for quicker results. Distributed FPGAs hold promise in various fields: ● HPC: Solve complex scientific and engineering problems faster. ● AI: Train and deploy AI models with the necessary power and scalability. ● Real-Time Applications: Meet the demanding requirements of latency-sensitive fields like robotics and autonomous systems. 9
  • 23. High Level Synthesis and FPGAs For over three decades, engineers have relied on Hardware Description Languages (HDLs) to design electronic circuits implemented in FPGAs. This approach, while established, requires a significant investment of time and expertise. Writing detailed descriptions of each hardware component can be tedious and demands a deep understanding of the underlying hardware structure. However, a fresh and promising paradigm shift has emerged in recent years: High-Level Synthesis (HLS). This innovative approach leverages the familiarity and convenience of high-level languages like C to design hardware. Dedicated tools then translate this high-level code into an equivalent hardware description in a lower-level language, known as Register Transfer Level (RTL). Several compelling advantages make HLS an increasingly attractive choice for hardware design: ● Maturity and Stability: HLS tools have evolved significantly, offering improved reliability and a clearer understanding of the generated hardware behavior. ● Efficiency and Performance: HLS can often produce hardware that rivals, or even surpasses, the efficiency achieved by manually crafted HDL code. This efficiency gain, combined with the significantly faster development cycle, makes HLS a compelling option. Given these undeniable benefits, HLS has been chosen as the technology of choice for this thesis, paving the way for a more efficient and accessible approach to FPGA design. 2.2 Literature Review Deep Learning Deep learning utilizes artificial neural networks, inspired by the human brain, to perform 10
  • 24. machine learning tasks. These networks consist of multiple layers organized hierarchically, enabling them to learn complex patterns from data. Each layer progressively builds upon the knowledge acquired by the previous layer. The initial layers extract fundamental features, like edges or lines, from the input data. Subsequent layers combine these basic features into more complex shapes and objects, culminating in the identification of the desired target. Imagine training a deep learning model to recognize hands in images. The initial layers would learn to detect edges and lines, the building blocks of shapes. Moving up the hierarchy, the network would combine these basic elements into more complex features, like ovals and rectangles, which could represent whiskers, paws, and tails. Finally, the topmost layers would recognize these combined features as specific to hands, allowing the network to differentiate them from other animals. While focusing on hand identification, the network simultaneously learns about other objects present in the training data. This allows it to generalize its knowledge and apply it to other contexts, recognizing hands in diverse environments and situations. This hierarchical learning process, where simple features are gradually combined to form complex representations, is the core of deep learning's success. It allows the network to effortlessly handle complex tasks, making it a powerful tool for various applications. Derived from the SqueezeNet topology [8], initially designed for embedded systems, Zynqnet is tailored to be FPGA-friendly through modifications during development. The topology comprises an initial convolutional layer, 8 identical fire modules, and a final classification layer, each containing 3 convolutional layers. Notably, efforts were made to align hyperparameters with power-of-two values. Key points of improvement in this thesis include: 1. HW Definition: Zynqnet's original hardware accelerator is only partially implemented on the Xilinx Zynq board [6], working closely with an ARM processor. In contrast, the presented accelerator is fully hardware-designed, adapting to runtime layer variations without software intervention. 2. Fixed Point: To mitigate FPGA overhead, fixed-point computations replace the 32-bit 11
  • 25. floating-point implementation used in Zynqnet. The Ristretto tool [7] guides bit width and fractional bits, applying manual fine-tuning. 3. Data vs. Mem: Significant size and memory reductions occur by reducing classification items and employing 8-bit fixed-point weights. This optimization simplifies the system, eliminating external memory access and prioritizing computation speed over memory volume in the accelerator. 2.2.1 AlexNet Introduced in 2012, AlexNet, a pioneering Deep Learning architecture developed using the ImageNet database. They trained a deep convolutional neural network on 1.2 million high-resolution images, each with dimensions of 224x224 RGB pixels (Li, F., et. al, 2017). Achieving a worst error rate of 37.5% and a best average error rate of 17.0%, the neural network boasted 60 million parameters, 650,000 neurons, five Convolutional Layers followed by ReLu and Max Pool Layers, three Fully Connected Layers, and a 1000-way Softmax Classifier [9]. The architecture, illustrated below, marked the first use of a Rectified Linear Unit as an activation layer, deviating from the conventional Sigmoid Activation function. Their groundbreaking implementation secured victory in the ImageNet LSVRC-2012 competition. The entire project was conducted using two GTX 580 GPUs. Fig.4: Visual representation of AlexNet architecture. Illustration shows the layers used and their interconnectivity.(Krizhevsky, 2012) 12
  • 26. Fig.5: Visual representation of AlexNet architecture. Illustration shows the layers used and their interconnectivity. (Li, F., et. al, 2017) 2.2.2 VGGNet In 2014, Karen Simonyan and Andrew Zisserman, researchers at the University of Oxford's Visual Geometry Group, introduced VGGNet, a groundbreaking architecture that significantly improved upon the capabilities of its predecessor, AlexNet. VGGNet's key innovation was its increased depth, achieved by adding more convolutional layers. These layers utilized smaller receptive fields, primarily 3x3 and 1x1 filters, enabling the network to extract more detailed and nuanced features from the input images. Simonyan and Zisserman tested various configurations of their network, all adhering to a general design but differing in depth. They experimented with 11, 13, and 19 weight layers, with each depth further divided into sub-configurations. Among these configurations, VGG16 and VGG19 emerged as the top performers. VGG16 achieved a remarkable maximum error rate of 27.3% and an impressive average error rate of 8.1%. VGG19, with its increased depth, further improved upon these results, achieving a maximum error rate of 25.5% and an average error rate of 8.0%.[7] As expected, the increased depth of VGG16 and VGG19 led to a significant rise in the number of parameters. VGG16 boasts 138 million parameters, while VGG19 possesses an even more impressive 144 million parameters. 13
  • 27. Figure 6. provides a visual comparison of VGG16, VGG19, and their predecessor AlexNet, highlighting the significant architectural advancements made by VGGNet. This innovative architecture ultimately led to its victory in the 2014 ImageNet LSRVC Challenge, solidifying its place as a landmark achievement in the field of deep learning. Fig. 6: Visual representation of VGGnet architecture & AlexNet (right) Illustration shows the layers used and their interconnectivity. (Li, F., et. al, 2017) 2.2.3 ResNet In 2015, a team from Microsoft, including Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, developed the ResNet architecture as an enhancement to VGGnet. Recognizing the importance of network depth for accuracy, they addressed the "vanishing gradient" problem during backpropagation by introducing "deep residual learning." This novel framework incorporated "Shortcut Connections," hypothesized to simplify training and optimization while overcoming the gradient issue (He et al., 2015; Li, F., et al., 2017). 14
  • 28. Fig.7: Residual Learning: a building block of the ResNet architecture. (He et. al., 2015) For their experimentation they constructed a 34-layer plain network with no Shortcut connections and a 34 Layer network with shortcut connections, a Resnet. They also configured several networks with incrementally increasing layer count from 34 layers to 152 layers. Overall, 34-layer ResNet outperformed the 34-layer plain network and the average error rate achieved on the 152-layer network for the 2015 ImageNet LSVRC competition was 3.57%. This network architecture won the 2015 ImageNet LSVRC challenge. (Li, F., et. al, 2017) Fig.8 represents the building block of the ResNet Architecture. 15
  • 29. Fig.8: Residual Learning: a building block of the ResNet architecture. [11] 16
  • 30. 2.2.4 MobileNet MobileNets, which were originally developed by Google for mobile and embedded vision applications [12], are distinguished by their use of depth-wise separable convolutions, which reduce trainable parameters when compared to networks with regular convolutions of the same depth. MobileNetv2 introduced linear bottlenecks and inverted residuals, resulting in lightweight deep neural networks that are ideal for the scenario under consideration in this work. Fig. 9: Visual representation of MobileNet architecture 2.2.5 Other work Several research groups have explored implementing Convolutional Neural Networks (CNNs) on FPGAs, achieving impressive results in terms of performance and efficiency. Here's a summary of five notable works: 1. Real-Time Video Object Recognition System (Neurocoms, South Korea, 2015) ● Architecture: Custom 5-layer CNN developed in Matlab. ● Input: Grayscale images (28x28). ● Platform: Xilinx KC705 evaluation board. ● Frequency: 250MHz. ● Power consumption: 3.1 watts. ● Resource utilization: 42,616 LUTs, 32 BRAMs, 326 DSP48s. ● Data format: 16-bit fixed point. ● Performance: Focused on frames per second. 17
  • 31. Figure 10: Neurocoms work using 6 neurons and 2 receptor units (Ahn, B,2015) This paper describes a real-time video object recognition system implemented on an FPGA. The system consists of a receiver, a feature map, and a detector. The receiver decodes and pre-processes the video stream, the feature map extracts features using a CNN, and the detector identifies objects by comparing the features to a database. Key takeaways: ● Real-time performance ● Very efficient (3.1 watts power consumption) ● FPGA implementation enables high performance 2. Small CNN Implementation (Institute of Semiconductors, Chinese Academy of Sciences, Beijing, China, 2015) ● Architecture: 3 convolutional layers with activation, 2 pooling layers, 1 softmax classifier. ● Input: 32x32 images. ● Platform: Altera Arria V FPGA board. ● Frequency: 50MHz. ● Data format: 8-bit fixed point. ● Performance: Focused on images per second. 18
  • 32. Figure 11: Chinese Academy logic architecture (Li, H et.al., 2015) This paper presents a small CNN implementation on an FPGA. The CNN consists of three convolutional layers with activation, two pooling layers, and a softmax classifier. The input images are 32x32 and the data format is 8-bit fixed point. The CNN is implemented on an Altera Arria V FPGA board and operates at a frequency of 50MHz. Key takeaways: ● The CNN achieves a frames per second rate of 50, which is sufficient for real-time video processing. ● The CNN achieves an accuracy of 93.6% on the MNIST handwritten digit classification task. ● The CNN uses 118K LUTs, 112K BRAMs, and 13K DSPs. 3. Angel-Eye System (Tsinghua University and Stanford University, 2016) ● Architecture: Array of custom processing elements. ● Platform: Xilinx Zynq XC7Z045. ● Frequency: 150MHz. ● Data format: 16-bit fixed point. ● Power consumption: 9.63 watts. ● Performance: 187.80 GFLOPS (VGG16 ConvNet). ● Custom compiler: Minimizes external memory access. 19
  • 33. Figure 12: Angel-Eye (Left) Angel-Eye architecture. (Right) Processing Element (Guo, K. et. al., 2016) 4. Customized Software Tools for CNN Accelerator (Purdue University, 2016) ● Platform: Xilinx Kintex-7 XC7K325T. ● Performance: 58-115 GFLOPS. ● Architecture: Custom software tools for optimization. ● Data format: Not specified. 5. Scalable FPGA Implementation of CNN (Arizona State University, 2016) ● Platform: Stratix-V GXA7. ● Frequency: 100MHz. ● Data format: 16-bit fixed point. ● Power consumption: 19.5 watts. ● Performance: 114.5 GFLOPS. ● Resource utilization: 256 DSPs, 112K LUTs, 2,330 BRAMs. ● Shared multiplier bank: Optimizes multiplication operations. So far One major challenge in deploying Deep Learning (DL) models on FPGAs has been their limited design size. The inherent trade-off between reconfigurability and density restricts the implementation of large neural networks on FPGAs. However, advancements in fabrication technology, particularly the use of smaller feature sizes, are enabling denser FPGAs. Additionally, the integration of specialized computational units alongside the general FPGA fabric enhances processing capabilities. These advancements are paving the way for the implementation of complex DL models on single FPGA systems, opening up new possibilities for hardware-accelerated AI. 20
  • 34. 3. Design and Implementation 3.1 Design Let's delve deeper into the individual layers of a convolutional neural network: 1. Input: The network begins with the input image, typically represented as a 3D matrix with dimensions representing width, height, and color channels (e.g., RGB). In this case, the image size is 32x32 pixels with three color channels. 2. Convolutional Layer: This layer applies filters to the input image, extracting features through localized dot product calculations. Applying 12 filters would result in a new 3D volume with dimensions 32x32x12, where each element represents the activation of a specific feature at a specific location. 3. ReLU Layer: The rectified linear unit (ReLU) layer applies a non-linear activation function, typically max(0,x), to each element in the previous volume. This introduces non-linearity and sparsity into the feature representation, enhancing the network's ability to learn complex patterns. The volume size remains unchanged (32x32x12). 4. Max Pooling Layer: This layer performs downsampling by selecting the maximum value within a predefined neighborhood in the input volume. By reducing the spatial dimensions (e.g., by a factor of 2), the network can achieve translational invariance and reduce computational complexity. In this case, the resulting volume would be 16x16x12. 5. Affine/Fully Connected Layer: This layer connects all neurons in the previous volume to each output neuron, essentially performing a weighted sum followed by a bias addition. This final step calculates the class scores for each possible category, resulting in a 1x1x10 volume where each element represents the score for a specific class. Sequential Processing and Parameter Learning: Convolutional Neural Networks transform the input image through a series of layers, gradually extracting features and building increasingly complex representations. While 21
  • 35. some layers like ReLU and Max Pooling operate with fixed functions, others like Convolutional and Fully Connected layers involve trainable parameters (weights and biases). These parameters are adjusted through gradient descent optimization during training, allowing the network to learn optimal representations based on labeled data. 3.2 Implementation Having covered the foundational aspects of Deep Learning and reviewed prominent Deep Convolutional Neural Network architectures, along with their implementations on FPGA, let's delve into the specifics of this design. This section outlines the implementation of Deep Convolutional Neural Networks in FPGA, discussing similarities & distinctions from prior works, design goals, and the tools employed. Following this, we provide an overview of the overall architecture intended for implementation on the FPGA. Due to a focus on hardware implementation and constraints in time, along with the availability of pre-existing, trained image system data for CNN, code was sourced from the internet. Finally, we comprehensively examine three key sub-designs: Convolutional/Affine Layer, ReLu Layer, Max Pooling Layer, and Softmax Layer. Similarities In scrutinizing previous works where groups implemented DCNNs on FPGAs, numerous similarities emerge between their implementations and the present work. Certain aspects of DCNNs are inherently common across designs aimed at accelerating DCNNs. Consequently, essential elements like required layers (e.g., convolution, ReLu, max pool) and adder trees for summing channel products will not be explicitly discussed in this section. Bus Protocol Firstly, prior works showcase designs employing sub-module intercommunication. Several designs that utilized separate sub-modules in their overall architecture employed a communication bus protocol. This approach leverages existing intellectual property from FPGA manufacturers such as Intel or AMD, allowing the focus to be on the DCNN portion of the task rather than the infrastructure. Additionally, hardware microprocessors or implemented co-processors can communicate with the submodules, providing valuable insights for both software and hardware developers during debugging and verification. The 22
  • 36. drawback, however, is that a bus protocol introduces additional overhead to the design due to handshaking between sub-modules for reliable communication. Moreover, the presence of the bus protocol necessitates more signal routing, utilizing overall FPGA resources and potentially leading to increased dwell time with no task being executed. Despite these drawbacks, effective management can be achieved by carefully planning the overall design's concept of operations. DSP Slices A prevalent aspect shared among prior works and the present study involves the utilization of Digital Signal Processing (DSP) Slices. These dedicated hardware components excel in performing multiply and add operations for both floating-point and fixed-precision numbers. DSP slices outperform custom designs implemented in hardware description language (HDL). FPGAs benefit from maximizing available DSP slices, enhancing the speed of designs, especially in Deep Convolutional Neural Networks (DCNNs). Data Format In the software domain, Deep Learning research employs 64-bit double precision floating point signed digits for weight data. While some works have employed 32-bit single numbers, there is mounting evidence suggesting that reducing the bit size and format can significantly impact overall performance. A common alteration is the use of 16-bit fixed-precision numbers. Alternatively, truncating the 32-bit single number to a 16-bit "Half" precision number is proposed, presenting a potentially more effective design. Scalability Scalability, a crucial feature in previous works and this study, revolves around navigating through the CNN.As witnessed in other works, the increasing size of software implementations of DCNNs, exemplified by the 152-layer ResNet design, poses a challenge for FPGA implementation. To address this, strategies involve implementing reusable designs capable of performing the functions of all necessary layers in the DCNN architecture. Simple Interface Unlike many previous works, considerable effort has been invested in creating a custom compiler to completely describe a Deep Convolutional Neural Network in this design. The aim is to make the DCNN accessible to both software and hardware designers by making 23
  • 37. FPGA hardware programmable. The FPGA can be commanded through function calls in the microprocessor, performing register writes to the FPGA implementation. Flexible Design Unlike prior works where CNN designs are tailored to specific hardware boards, this work aims for a configurable number of DSPs depending on the FPGA in use. Each layer in the CNN is modular and can interact through a bus protocol, allowing developers to insert multiple instances of the Convolutional Layer, Affine Layer, and Max Pooling Layer. Tools Throughout the development process of implementing a CNN on an FPGA, various tools were employed. The choice of utilizing Xilinx chips was influenced by their extensive usage and the author's prior experience with Xilinx products. Consequently, the tools selected for this development were drawn from the diverse set offered by AMD Xilinx. The central design environment was the Xilinx Vivado 2021.3 package (refer to Figure 13), serving as the primary design hub throughout the developmental phase. Within Vivado, each neural network layer type was crafted as an AXI-capable submodule. Additionally, Vivado facilitated integration with pre-existing Xilinx Intellectual Property (IP), such as the Zynq SoC Co-Processor and Memory Interface Generator (MIG). Lastly, Vivado acted as a platform for software development, enabling the creation of straightforward software to run on the Zynq SoC. Fig.13 Xilinx Vivado 2021.3 IDE Hardware: The FPGA chosen was a Zedboard. In the context of Digilent’s Zedboard Development Kit, it consists of a matrix of programmable logic blocks and programmable interconnections. Zedboard is a built-around Zynq-7000 SoC Xilinx that combines a two-core ARM Cortex-A9 processor along with a FPGA fabric. 24
  • 38. Fig. 14 Digilent Zedboard Avnet AES-series Evaluation Kit Zynq-7000 System-on-Chip (SoC) (www.digilent.com) Table.1:Specifications for Zedboard (www.digilent.com) 25 SPECIFICATION DESCRIPTION SOC OPTIONS • XC7Z020 - CLG484-1 MEMORY • 512 MB DDR3 • 256 Mb Quad - SPI Flash VIDEO DISPLAY •1080p HDMI •8 - bit VGA • 128 x 32 OLED USER INPUTS • 8 User Switches and 7 user push buttons AUDIO • 12S Audio CODEC ANALOG • XADC Header • Onboard USB JTAG POWER • 12VDC CERTIFICATION • CE • RoHS DIMENSIONS • 5.3 " x6.3 " CONFIGURATION MEMORY • 256 Mb Quad - SPI Flash , SDCARD ETHERNET • 10/100/1000 Ethernet USB • USB 2.0 COMMUNICATIONS • USB 2.0 •USB - UART •10/100/1000 Ethernet USER I / O • ( See User Inputs ) OTHER • PetaLinux BSP
  • 39. 4. Hardware Implementation 4.1 Design Methodology In extensive code projects with multiple instances and increasing complexity, defining the order and scope of steps, along with how they will be executed, is crucial—comprising the project's methodology. 4.2 HLS Methodology As detailed in Section 2.2, High Level Synthesis (HLS) is chosen for hardware implementation due to its suitability. Xilinx® Vivado HLS is employed in this project, following its three-step methodology: 1. Software Simulation: This involves testing code execution using a regular software compiler and CPU, aided by a test bench. 2. Synthesis: Generating HDL files crucial for code implementation and HLS pragmas. This step is critical, executed after successful software simulation. 3. Co-Simulation: The most significant step, testing synthesized code functionality using a hardware simulation. It leverages the test bench from software simulation, comparing outputs and ensuring hardware-software consistency. Table 2: Simulation Model vs Hardware Implementation 26 Layer SIM FOPs HW FOPs Diff CONV1 0.7407 G 0.73530 G 0.74% CONV2 126.897 M 113.796 M 12.89% CONV3 35.158 M 29.106 M 27.66% CONV4 26.645 M 20.830 M 27.91% CONVS 26.574 M 20.763 M 27.99% AFFINE1 176.322 M 113.884 M 54.83% AFFINE2 87.677 M 38.077 M 130.26% AFFINE3 33.919 M 20.229 M 83.23%
  • 40. 4.3 Design Overview Fig.15 Final Top-Level FPGA Design Before delving into the pipelined core and other enhancements, understanding the module's top-level functionality is vital. The system comprises three modules and a group of memories: 1. Pipelined Core: This module serves as the computational powerhouse, receiving layer parameters, weight information, and input data from the Flow Control module. It executes the necessary calculations and generates the desired outputs. 2. Convolution Flow Control: This module acts as the conductor, ensuring the proper execution of the network topology. It determines whether update or classification tasks are required and orchestrates access to all memory units and relevant layer parameters. 3. Memory Controller: This module acts as the memory interface, deciphering read/write positions for data exchange with the memory units. It receives instructions from both the Flow Control and Pipelined Core modules, ensuring smooth data flow and efficient memory 27
  • 41. utilization. By understanding the interactions and responsibilities of these modules, we gain a clear understanding of how the system operates as a whole. This high-level perspective provides a valuable foundation for delving deeper into the specific details of the individual components and their contributions to the overall system performance. 4.4 Caching Strategy Organized loop order and data reuse optimization lead to local storage of reused information to avoid accessing on-chip memory overhead. Caches are needed for kernel and bias, output, and input. 1. Kernel and Bias Caches: Simplest caches loaded at the beginning and updated during channel changes. 2. Output Cache: More complex due to irregular access pattern, loading bias and computing ReLU for performance maximization. 3. Input Cache: Most complex, addressing reuse issues with a group of multiple registers that displace information every iteration. Memory Controller: Arrays Merging Adapting access patterns between layers and facilitating simultaneous access to multiple elements are essential for varying memory requirements. Fixed-Point Implementation Following Ristretto's fixed-point analysis of the network, bit width and fractional bits are defined. So, Xilinx® Vivado HLS utilizes a fixed-point arithmetic type definition (ap_fixed<bit width, frac bits>). Runtime reconfiguration is managed using integers and bit shifts for fixed-point operations since Vivado HLS requires a compile-time definition of fractional bits. 28
  • 42. 29
  • 43. Following is a detailed explanation of the four test benches shown in the diagrams: 1. Convolutional/Affine Layer Virtual Memory Test Bench This test bench verifies the functionality of the convolutional/affine layer implementation by comparing its outputs to the expected outputs generated by a reference software model. The test bench loads the input and kernel data into virtual memory and then performs the convolutions/affine operations. The outputs are then compared to the expected outputs to ensure that the implementation is correct. 2. Convolutional/Affine Layer Block RAM Test Bench This test bench is similar to the virtual memory test bench, but it stores the input and kernel data in block RAM instead of virtual memory. This test bench is useful for verifying the performance of the convolutional/affine layer implementation, as it can achieve higher throughput by avoiding the overhead of accessing virtual memory. 3. Max Pool Layer Virtual Memory Test Bench This test bench verifies the functionality of the max pool layer implementation by comparing its outputs to the expected outputs generated by a reference software model. The test bench loads the input data into virtual memory and then performs the max pooling operation. The outputs are then compared to the expected outputs to ensure that the implementation is correct. 4. Max Pool Layer Block RAM Test Bench This test bench is similar to the virtual memory test bench, but it stores the input data in block RAM instead of virtual memory. This test bench is useful for verifying the performance of the max pool layer implementation, as it can achieve higher throughput by avoiding the overhead of accessing virtual memory. The diagram shows the four test benches connected to a common input and output interface. This allows the test benches to be easily swapped in and out, depending on the layer being tested. 30
  • 44. Input and Output Interface: This interface provides a common way to load input data into the test benches and to read the output data from the test benches. The interface can be implemented using a variety of different methods, such as FIFO buffers, DMA transfers, or direct memory access. Virtual Memory: Virtual memory is used to store the input and kernel data for the convolutional/affine layer and the max pool layer virtual memory test benches. Virtual memory allows the test benches to access large amounts of data without having to load it all into physical memory at once. Block RAM: Block RAM is used to store the input data for the convolutional/affine layer and the max pool layer block RAM test benches. Block RAM is a type of on-chip memory that is faster than virtual memory, but it has a limited capacity. Test Bench Control Logic: The test bench control logic is responsible for loading the input and kernel data into the test benches, performing the convolutions/affine operations or the max pooling operation, and comparing the outputs to the expected outputs. The test bench control logic can be implemented using a variety of different methods, such as a finite state machine, a microcontroller, or a software program. The four test benches described above are essential tools for verifying the functionality and performance of convolutional neural network implementations on FPGAs. By using these test benches, designers can ensure that their implementations are correct and that they meet the desired performance requirements. Performance Evaluation and Analysis After implementing all optimization techniques, the accelerator was ready to classify images using trained network weights. To simulate the hardware behavior, Xilinx® Vivado HLS Co-simulation was employed. Images from the validation dataset, which achieved 73% accuracy with Ristretto, were evaluated. The simulation process, spanning over 185 hours, resulted in an overall 58% accuracy, requiring 26 million cycles per image. 31
  • 45. With a relatively small critical path, a 100MHz clock can be utilized, enabling the processing of approximately 4 frames per second. These results are deemed successful, as the achieved accuracy meets the project's minimum threshold, and the performance surpasses the lower limit by nearly fourfold. Consequently, no further modifications are required, and the accelerator is prepared for deployment. Table 3 Resource Utilization of Final Design (AlexNet) Resource Utilization Optimization While the accelerator described in Section 3.2 is functional and implementable, the pipelined core's low resource footprint (35 DSPs, 41,000 Flip-flops, and 36,500 LUTs) allows for potential modifications or duplications to reduce the pipeline depth. This situation is particularly suited for HLS optimization, as it can sometimes surpass human design capabilities.(See Table 3) Initially, Vivado generated two core instances with a 4-stage pipeline, requiring 26,596,261 cycles, due to different memory inputs. To improve this design, various configurations were explored using the function_instantiate pragma, creating four core instances. By sharing resources effectively, only 15% more DSPs, 27% more flip-flops, and 33% more LUTs were utilized compared to the double-mode core implementation. This configuration enabled reducing two out of the four pipelines by one stage each. However, this modification resulted in a negligible 0.2% performance improvement, ultimately leading to its rejection. Here are some parameters to compare different CNN implementations on FPGA: 32 Resource Utilization Available Utilization % LUT 36527 53200 68.66 LUTRAM 2594 46200 5.61 FF 41198 106400 38.72 BRAM 54 140 38.22 DSP 35 220 16.08 IO 69 285 24.21 BUFG 7 32 21.88 MMCM 2 10 20 PLL 1 10 10
  • 46. ● Throughput: Throughput is the number of input data that can be processed per unit time. It is an important parameter to measure the performance of a CNN implementation on an FPGA. Measured through FOPS ● Latency: Latency is the time taken by the CNN to process one input data. ● Resource utilization: It is an important parameter to measure the efficiency of a CNN implementation on an FPGA. ● Power consumption: It is a crucial parameter to measure the energy efficiency of a CNN implementation on an FPGA. ● Accuracy: It is an important parameter to measure the effectiveness of a CNN implementation on an FPGA. ● Flexibility: Flexibility is the ability of the CNN implementation to adapt to different CNN models and configurations. ● Ease of use Table 4: Hardware execution times of each AlexNet Layer Table.4 shows that the convolutional layers (CONV1-CONV5) are the most time-consuming 33 Layer Start Time End Time Total Time FOPS CONV1 0 71198.67 us Epoch = 0x1161e Cycle = 0x43 71.19867 ms 0.7456 G CONV2 0 547753.71 us Epoch = 0x85BA9 Cycle = 0x47 547.75371 ms 108.806 M CONV3 0 463776.90 us Epoch = 0x713A0 Cycle = 0x5A 463.77690 ms 24.858 M CONV4 0 697862.14 us Epoch Oxaa606 Cycle = 0x0E 697.86214 ms 16.551 M CONV5 0 466757.25 us Epoch = 0x71f45 Cycle = 0x19 466.75725 ms 16.543 M AFFINE1 0 796440.32 us Epoch = 0xc2718 Cycle = 0x20 796.44032 ms 110.922 M AFFINE2 0 1018890.52 us Epoch Oxf8c0a Cycle = 0x34 1018.89052 ms 33.446 M AFFINE3 0 4682.26 us Epoch = 0x124A Cycle = 0x1A 4.68226 ms 17.769 M
  • 47. layers in the network, accounting for over 90% of the total execution time. This is because convolutional layers perform a large number of floating-point operations. The fully connected layers (AFFINE1-AFFINE3) are much faster, but they still account for a significant portion of the total execution time. This is because fully connected layers also perform a large number of floating-point operations, and they also require more memory bandwidth. The table also shows that the FOPS of each layer is inversely proportional to the execution time. This means that the layers with the longest execution times have the lowest FOPS. Overall, the table provides insights into the performance of the AlexNet CNN when implemented on an FPGA. It shows that the convolutional layers are the most time-consuming layers in the network, and that the FOPS of each layer is inversely proportional to the execution time. Here are some specific observations from the table: ● The CONV1 layer has the longest execution time, at 71.1986 milliseconds. This is because the CONV1 layer has the largest number of filters. ● The CONV5 layer has the shortest execution time, at 466.7572 milliseconds. This is because the CONV5 layer has the smallest number of filters. ● The AFFINE1 layer has the highest FOPS, at 110.922 million FOPS. This is because the AFFINE1 layer has the smallest number of connections. ● The AFFINE3 layer has the lowest FOPS, at 17.769 million FOPS. This is because the AFFINE3 layer has the largest number of connections. Table.4 also shows that the total execution time for the AlexNet CNN is 796.4403 milliseconds. This means that the network can process approximately 1.25 frames per second. Table5 AlexNet vs MobileNet 34 Layer Ops To Perform AlexNet FOPS MobileNet FOPS Difference CONV1 210249696 0.7407 G 2.9530 G 34878.85 CONV2 62332672 126.897 M 113.796 M 287.09 CONV3 13498752 35.158 M 29.106 M 42.42 CONV4 14537088 26.645 M 20.830 M 30.31 CONV5 9691392 26.574 M 20.763 M 30.15 AFFINE1 90701824 176.322 M 113.884 M 115.9 AFFINE2 38797312 87.677 M 38.077 M 24.67 AFFINE3 94720 33.919 M 20.229 M 19.76
  • 48. It is important to note that the performance of a CNN implementation on an FPGA can be affected by a variety of factors, such as the FPGA platform, the CNN architecture, and the optimization techniques used. Table5 only provides a comparison of 2 specific CNN implementations on a specific FPGA platform. Guo , K. et . al . , 2016 Ma , Y.et. al . , 2016 Zhang, C.et . al . , 2015 Espinosa.M., 2019 This Work FPGA Zynq XC7Z045 Stratix - V GXA7 FPGA Virtex7 VX485T Artix7 XC7A200T Zedboard Zynq AAES-Z7EV Clock Freq 150 MHz 100MHz 100MHz 100MHz 100MHz Data format 16 - bit fixed Fixed ( 8-16b ) 32 - bit Float 32 - bit Float 32 - bit Fixed Power 9.63 W ( measured ) 19.5 W ( measured ) 18.61 W ( measured ) 1.5 W ( estimated ) 0.9 W ( estimated ) FF 127653 ? 205704 103610 41198 LUT 182616 121000 186251 91865 36527 BRAM 486 1552 1024 139.5 54 DSP 780 256 2240 119 35 Performance 187.80 GFOPS 114.5 GFOPS 61.62 GFOPS 2.93 GFOPS 0.74 GFOPS Table 6: Comparison of other works to this work. (AlexNet) Methods of Improvement / Scope This implementation of a Convolutional Neural Network in an AlexNet configuration is a first pass attempt and leaves a lot room for improvement and optimization. There are a few ways the performance of this implementation can be increased which would be areas for future work. Looking at Table 6, we can see the differences in resource utilization and performance between other recent works and this one. Although this implementation achieved a lower amount of GFOPs performance, the number of chip resources is far lower than any of the other implementations. Also, the estimated power consumption is far lower. 35
  • 49. 5. Conclusions 5.1 Results While Deep Learning and Convolutional Neural Networks (CNNs) have traditionally resided within the realm of Computer Science, with massive computations performed on GPUs housed in desktop computers, their increasing power demands raise concerns about efficiency. Existing FPGA implementations for CNNs primarily focus on accelerating the convolutional layer and often have rigid structures limiting their flexibility. This work aims to address these limitations by proposing a scalable and modular FPGA implementation for CNNs. Unlike existing approaches, this design seeks to configure the system for running an arbitrary number of layers, offering greater flexibility and adaptability. The proposed architecture was evaluated on publicly available CNN architectures like AlexNet, ResNet, and MobileNet on a Zedboard platform. Performance analysis revealed MobileNet as the fastest among the three, achieving an accuracy of 47.5%. This demonstrates the system's potential for efficient and adaptable execution of diverse CNN architectures. This work paves the way for further research in scalable and flexible FPGA implementations for CNNs, offering promising avenues for resource-efficient deep learning beyond traditional computing platforms. 36
  • 50. Appendix // CNN Sample Layer model module Layer_1 #(parameterNN=30,numWeight=784,dataWidth=16,layerNum=1,sigmoidSize=10,weightIntWidth= 4,actType="relu") ( input clk, input rst, input weightValid, input biasValid, input [31:0] weightValue, input [31:0] biasValue, input [31:0] config_layer_num, input [31:0] config_neuron_num, input x_valid, input [dataWidth-1:0] x_in, output [NN-1:0] o_valid, output [NN*dataWidth-1:0] x_out ); neuron #(.numWeight(numWeight),.layerNo(layerNum),.neuronNo(0),.dataWidth(dataWidth),.sigmoidSize( sigmoidSize),.weightIntWidth(weightIntWidth),.actType(actType),.weightFile("w_1_0.mif"),.biasFil e("b_1_0.mif"))n_0( .clk(clk), .rst(rst), .myinput(x_in), .weightValid(weightValid), .biasValid(biasValid), .weightValue(weightValue), .biasValue(biasValue), .config_layer_num(config_layer_num), .config_neuron_num(config_neuron_num), .myinputValid(x_valid), .out(x_out[0*dataWidth+:dataWidth]), .outvalid(o_valid[0]) ); endmodule Due to space constraints, all the data, references and code can be accessed here: Thesis_Appendix 37
  • 51. Bibliography [1] D. M. Harris and S. L. Harris, Digital Design and Computer Architecture. Elsevier, (2007) [2] S.Authors, History of artificial intelligence [3] Farabet, C., Martini, B., Akselrod, P., Talay, S., LeCun, Y., Culurciello, E.: Hardware accelerated convolutional neural networks for synthetic vision systems. In: Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on. pp. 257–260. IEEE (2010) [4] Goodfellow, I., & Bengio, Y., & Courville, A Convolutional Networks. In Dietterich, T.,(Ed.), Deep Learning(326-339). Cambridge, Massachusetts: The MIT Press.(2016) [5] D. Gschwend, Zynqnet: An fpga-accelerated embedded convolutional neural network. [6] Xilinx (2017). Zynq-7000 All Programmable SoC Family Product Tables and Product SelectionGuide. Retrieved from https://www.xilinx.com/support/documentation/selection-guides/zynq-7000-product-se lection-guide.pdf [7] Romén Neris , Adrián Rodríguez , Raúl Guerra. FPGA-Based Implementation of a CNN Architecture for the On-Board Processing of Very High-Resolution Remote Sensing Images, IEEE Journal Of Selected Topics In Applied Earth Observations And Remote Sensing, Vol. 15, 2022. [8] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, And K. Keutzer, Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5mb model size, arXiv:1602.07360, (2016) [9] Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems, 25 (NIPS 2012). Retrieved from https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neu ral-networks.pdf [10] Zisserman, A. & Simonyan, K. (2014). Very Deep Convolutional Networks For Large-Scale Image Recognition. Retrieved from https://arxiv.org/pdf/1409.1556.pdf [11] Li, F., et. al. CNN Architectures [PDF document]. Retrieved from Lecture Notes Online Website: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture9.pdf 38
  • 52. [12] Qiao, Y., & Shen, J., & Xiao, T., & Yang, Q., & Wen, M., & Zhang, C. FPGA‐ accelerated deep convolutional neural networks for high throughput and energy efficiency. Concurrency and Computation Practice and Experience. John Wiley & Sons Ltd.(May 06, 2016). [13] Lacey, G., & Taylor, G., & Areibi, S. Deep Learning on FPGAs: Past, Present and Future. Cornell University Library. https://arxiv.org/abs/1602.04283 (Feb. 13, 2016) [14] Gomez, P. Implementation of a Convolutional Neural Network (CNN) on a FPGA for Sign Language's Alphabet recognition. Archivo Digital UPM. Retrieved December 6, 2023, from https://oa.upm.es/53784/1/TFG_PABLO_CORREA_GOMEZ.pdf (2018, July) [15] Espinosa, M. A. Implementation of Convolutional Neural Networks in FPGA for Image Classification. ScholarWorks. Retrieved December 6, 2023, from https://scholarworks.calstate.edu/downloads/hd76s209r (2019, Spring) [16] He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition.Retrieved from https://arxiv.org/pdf/1512.03385.pdf [17] Reddy, G. (2019, January 1). FPGA Implementation of Multiplier-Accumulator Unit using Vedic multiplier and Reversible gates. Semantic Scholar. https://www.semanticscholar.org/paper/FPGA-Implementation-of-Multiplier-Accumul ator-Unit-Rajesh-Reddy/edab41b3600b2b51d6887042487bac32c80182b5 [18] Guo, K., & Sui, L., & Qiu, J., & Yao, S., & Han, S., & Wang, Y., & Yang, H. (July. 13, 2016). Angel-Eye: A Complete Design Flow for Mapping CNN onto Customized Hardware. IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2016, pp.24-29. doi:10.1109/ISVLSI.2016.129 [19] Ahn, B. (Oct. 01, 2015). Real-time video object recognition using convolutional neural networks. International Joint Conference on Neural Networks (IJCNN), 2015. doi:10.1109/IJCNN.2015.7280718 39