My notes on shared memory parallelism.
Shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between programs. Using memory for communication inside a single program, e.g. among its multiple threads, is also referred to as shared memory [REF].
Gaya Call Girls #9907093804 Contact Number Escorts Service Gaya
Shared memory Parallelism (NOTES)
1. Shared memory Parallelism
Subhajit Sahu
Abstract — Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua. Symmetric vs Asymmetric Multiprocessing.
Index terms — Symmetric multiprocessing, Asymmetric multiprocessing.
1. Shared memory
Shared memory is memory that may be simultaneously accessed by multiple programs
with an intent to provide communication among them or avoid redundant copies. Shared
memory is an efficient means of passing data between programs. Using memory for
communication inside a single program, e.g. among its multiple threads, is also referred to
as shared memory [REF].
Shared memory systems may use [REF]:
- Uniform Memory Access (UMA): all the processors share the physical memory uniformly;
- Non-Uniform Memory Access (NUMA): memory access time depends on the memory
location relative to a processor;
- Cache-Only Memory Architecture (COMA): the local memories for the processors at each
node is used as cache instead of as actual main memory.
A shared memory system is relatively easy to program since all processors share a single
view of data and the communication between processors can be as fast as memory
accesses to the same location. The issue with shared memory systems is that many CPUs
need fast access to memory and will likely cache memory, which has two complications
[REF]:
- Access time degradation: when several processors try to access the same memory
location it causes contention. Trying to access nearby memory locations may cause false
sharing. Shared memory computers cannot scale very well. Most of them have ten or
fewer processors;
- Lack of data coherence: whenever one cache is updated with information that may be
used by other processors, the change needs to be reflected to the other processors,
2. otherwise the different processors will be working with incoherent data. Such cache
coherence protocols can, when they work well, provide extremely high-performance
access to shared information between multiple processors. On the other hand, they can
sometimes become overloaded and become a bottleneck to performance.
In case of a Heterogeneous System Architecture (processor architecture that integrates
different types of processors, such as CPUs and GPUs, with shared memory), the memory
management unit (MMU) of the CPU and the input–output memory management unit
(IOMMU) of the GPU have to share certain characteristics, like a common address space.
Compared to multiple address space operating systems, memory sharing -- especially of
sharing procedures or pointer-based structures -- is simpler in single address space
operating systems. The alternatives to shared memory are distributed memory and
distributed shared memory, each having a similar set of issues [REF].
2. Multi-process parallelism
A crash in a thread takes down the whole process. And you probably wouldn't want it any
other way since a crash signal (like SIGSEGV, SIGBUS, SIGABRT) means that you lost
control over the behavior of the process and anything could have happened to its memory.
So if you want to isolate things, spawning separate processes is definitely better [REF]. Two
or more processes can share data between each other efficiently through the use of
interprocess shared memory. In most cases, you need to use interprocess synchronization
mechanisms (such as a semaphore, or a mutex) [REF]. For iterative algorithms such as
PageRank (barrier-free), this synchronization may not be necessary [SAHU]. When a
process dies, the shared memory is left as it is. It is mapped as a file under /dev/shm/
directory. It is removed either when the system reboots, or when all processes un-mmap
the shared memory file and the shm_unlink() is called [REF]. What if the cause of the crash
was a write to a rogue pointer stomping through the shared memory? You'll never be 100%
safe [REF].
Interprocess synchronization can work differently on different operating systems. What
happens when a process that has locked the shared memory crashes? Windows frees the
locked named mutex after a process crash whereas Linux doesn't free it [REF]. The
advantage of Windows is that the waiting thread is freed to continue. The disadvantage is
that it has no idea what the state of the shared memory is—the crashed process may have
been part way through updates. In practice the only safe thing to do is to reset the shared
memory in some way (assuming that can be done meaningfully) or to fail [REF].
3. Threads require significantly less overhead for switching and for communicating. A process
context switch requires changing the system’s current view of virtual memory, which is a
time-consuming operation. Switching from one thread to another is a lot faster. For two
processes to exchange data, they have to initiate interprocess communication (IPC), which
requires asking the kernel for help. IPC generally involves performing at least one context
switch. Since all threads in a process share the heap, data, and code, there is no need to get
the kernel involved. The threads can communicate directly by reading and writing global
variables [REF].
At the same time, the lack of isolation between threads in a single process can also lead to
some disadvantages. If one thread crashes due to a segmentation fault or other error, all
other threads and the entire process are killed. This situation can lead to data and system
corruption if the other threads were in the middle of some important task. As an example,
assume that the application is a server program with one thread responsible for logging all
requests and a separate thread for handling the requests. If the request handler thread
crashes before the logging thread has a chance to write the request to its log file, there
would be no record of the request. The system administrators left to determine what went
wrong would have no information about the request, so they may waste a lot of time
validating other requests that were all good [REF].
3. Heterogeneous Multiprocessing
Why do we have CPUs with all the cores at the same speeds and not combinations of
different speeds [REF]? This is known as heterogeneous multiprocessing (HMP) and is
widely adopted by mobile devices. In ARM-based devices which implement big.LITTLE, the
processor contains cores with different performance and power profiles, e.g. some cores run
fast but draw lots of power (faster architecture and/or higher clocks) while others are
energy-efficient but slow (slower architecture and/or lower clocks). This is useful because
power usage tends to increase disproportionately as you increase performance once you
get past a certain point. The idea here is to get performance when you need it and battery
life when you don't [REF].
On desktop platforms, power consumption is much less of an issue so this is not truly
necessary. Most applications expect each core to have similar performance characteristics,
and scheduling processes for HMP systems is much more complex than scheduling for
traditional SMP systems. (Windows 10 technically has support for HMP, but it's mainly
intended for mobile devices that use ARM big.LITTLE.) Also, most desktop and laptop
processors today are not thermally or electrically limited to the point where some cores
4. need to run faster than others even for short bursts. We've basically hit a wall on how fast
we can make individual cores, so replacing some cores with slower ones won't allow the
remaining cores to run faster [REF].
While there are a few desktop processors that have one or two cores capable of running
faster than the others, this capability is currently limited to certain very high-end Intel
processors (as Turbo Boost Max Technology 3.0) and only involves a slight gain in
performance for those cores that can run faster. While it is certainly possible to design a
traditional x86 processor with both large, fast cores and smaller, slower cores to optimize
for heavily-threaded workloads, this would add considerable complexity to the processor
design and applications are unlikely to properly support it [REF].
Take a hypothetical processor with two fast Kaby Lake (7th-generation Core) cores and
eight slow Goldmont (Atom) cores. You'd have a total of 10 cores, and heavily-threaded
workloads optimized for this kind of processor may see a gain in performance and efficiency
over a normal quad-core Kaby Lake processor. However, the different types of cores have
wildly different performance levels, and the slow cores don't even support some of the
instructions the fast cores support, like AVX (ARM avoids this issue by requiring both the
big and LITTLE cores to support the same instructions) [REF].
Again, most Windows-based multithreaded applications assume that every core has the
same or nearly the same level of performance and can execute the same instructions, so
this kind of asymmetry is likely to result in less-than-ideal performance, perhaps even
crashes if it uses instructions not supported by the slow cores. While Intel could modify the
slow cores to add advanced instruction support so that all cores can execute all
instructions, this would not resolve issues with software support for heterogeneous
processors [REF].
A different approach to application design, closer to what you're probably thinking about in
your question, would use the GPU for acceleration of highly parallel portions of
applications. This can be done using APIs like OpenCL and CUDA. As for a single-chip
solution, AMD promotes hardware support for GPU acceleration in its APUs, which
combine a traditional CPU and a high-performance integrated GPU onto the same chip, as
Heterogeneous System Architecture, though this has not seen much industry uptake
outside of a few specialized applications [REF].