Parallel Organization Notes | COM SPPU 2024

Introduction to Parallel Processing

For decades, computer engineers improved processor performance by increasing clock speeds (the frequency at which a processor executes instructions). However, physical limits like overheating and power leaks made it impossible to keep raising clock speeds indefinitely. To continue making computers faster, engineers shifted their focus from making a single processor run faster to making multiple processors work together simultaneously. This approach is known as parallel processing or parallel organization.

Parallel organization divides a large, complex task into smaller sub-tasks. These sub-tasks are then processed at the same time using multiple execution units. This article provides comprehensive, textbook-quality educational notes on advanced multiprocessor architectures, classification models, memory distribution, and the core mechanics of Reduced Instruction Set Computer (RISC) architectures.

Multiprocessor Systems and Clusters

When building high-performance computing systems, engineers scale up hardware systems using two main approaches: Multiprocessor Systems and Clusters.

Multiprocessor Systems

A multiprocessor system is a single computer system that contains two or more independent processing units (CPUs) that share access to a common memory space and peripheral devices.

Working Principle: All processors inside the system are managed by a single centralized operating system. The operating system allocates separate threads or tasks to different CPUs to balance the workload.
Applications: High-end engineering workstations, enterprise database servers, and modern personal computers running multi-core chips.

Clusters

A cluster is a collection of independent, standalone computers connected via a high-speed network that work together as a single computing resource. Each computer within the cluster is called a node.

Working Principle: Each node runs its own independent operating system, has its own dedicated processor, and maintains its own private main memory. Special cluster management software coordinates work distribution across these nodes.
Applications: Large-scale search engines, weather forecasting centers, cloud data centers, and supercomputing research centers.

Comparison: Multiprocessors vs. Clusters

Feature	Multiprocessor Systems	Clusters
Memory Architecture	Shared memory space among all CPUs.	Distributed memory; each node has private RAM.
Connection Method	High-speed internal system bus or crossbar switches.	External high-speed network cables (LAN/InfiniBand).
Operating System	Managed by a single central operating system.	Each individual node runs its own operating system.
Scalability	Limited; adding too many CPUs overloads the bus.	High scalability; new standalone nodes can easily be added.
Fault Tolerance	If the main board or memory fails, the whole system crashes.	High; if one node crashes, other nodes take over the work.

Flynn's Taxonomy for Multiple Processor Organizations

In 1966, computer scientist Michael J. Flynn introduced a classification system for computer architectures based on how they handle instruction streams and data streams. An Instruction Stream is the sequence of commands executed by a processor, while a Data Stream is the sequence of data operands manipulated by those commands.

Flynn's Taxonomy divides all computers into four distinct architectural categories:

                      +---------------------------------------+
                      |           FLYNN'S TAXONOMY            |
                      +---------------------------------------+
                      |  Single Data Stream  |  Multiple Data |
+---------------------+----------------------+----------------+
| Single Instruction  |        SISD          |      SIMD      |
+---------------------+----------------------+----------------+
| Multiple Instruction|        MISD          |      MIMD      |
+---------------------+----------------------+----------------+

1. SISD (Single Instruction Stream, Single Data Stream)

This represents the classic, traditional computer architecture. The processor executes one instruction at a time on a single data value stored in memory.

Working: A control unit fetches a single command, and an execution unit processes a single piece of data for that command during each cycle.
Real-Life Example: Early desktop computers, simple embedded microcontrollers, and older processors like the Intel 8085.

2. SIMD (Single Instruction Stream, Multiple Data Stream)

A single control unit broadcasts the same instruction to multiple processing units simultaneously. Each processing unit executes that instruction on different data values.

Working: This architecture uses a single instruction to manipulate an entire array or vector of data values at once.
Real-Life Example: Graphics Processing Units (GPUs) rendering image pixels, and vector processors used for video processing.

3. MISD (Multiple Instruction Stream, Single Data Stream)

Multiple independent processing units execute different instructions on the exact same data stream simultaneously.

Working: The output of one processor serves as the input for the next, or all units evaluate the same input data using different algorithms to verify correctness.
Real-Life Example: Highly experimental architectures. It is rarely used in commercial computing but is sometimes deployed for redundant safety backups in aerospace flight control computers.

4. MIMD (Multiple Instruction Stream, Multiple Data Stream)

Multiple independent processors execute completely different programs on completely different sets of data at the same time.

Working: Each processor contains its own control unit and its own arithmetic logic unit. This allows them to run independent tasks asynchronously.
Real-Life Example: Modern multi-core laptops, distributed cloud servers, and cluster computer networks.

Hinglish Explanation: Flynn's Taxonomy computers ko divide karne ka ek tareeka hai based on Instruction (command) aur Data (values). SISD purane normal computers hain jahan ek time par ek hi kaam hota hai. SIMD mein ek hi command (jaise "add") bohot saare alag-alag data values par ek saath chalayi jaati hai (jaise graphic rendering mein). MIMD modern multi-core computers hain jahan har processor apna alag kaam aur alag data handle karta hai.

Closely Coupled and Loosely Coupled Multiprocessors

Multiprocessor systems are further classified based on how closely their processors interact and how they share memory resources.

Closely Coupled Multiprocessor Systems (Shared Memory)

In a closely coupled system, all processing units share a single centralized main memory space through an internal system bus or a switch network.

Data Exchange: Processors communicate with one another by reading from and writing to shared memory locations. When one CPU modifies a variable in shared RAM, other CPUs can see that change immediately.
Latency: Very low data access latency because data transfers occur over high-speed internal hardware channels.
Disadvantage: As more processors are added, the shared system bus becomes congested, creating a performance bottleneck that limits total system size.

Loosely Coupled Multiprocessor Systems (Distributed Memory)

In a loosely coupled system, each processor has its own dedicated local memory. These independent processor-memory units are connected via a high-speed communication network.

Data Exchange: Processors do not share memory addresses directly. Instead, they exchange data by passing messages across the network lines using message-passing protocols.
Latency: Higher latency than closely coupled systems because data must travel through network interfaces.
Advantage: High scalability. Systems can expand to include hundreds of processors because there is no single shared bus to overload.

Symmetric Multiprocessor (SMP) Organization

A Symmetric Multiprocessor (SMP) is a specific type of closely coupled shared-memory system that follows a peer-to-peer processor model.

Definition and Key Characteristics

In an SMP organization, two or more identical processors share access to a single common main memory space and input-output systems. The term Symmetric means that all processors share equal responsibility and have equal capabilities.

There is no permanent "master-slave" relationship. Any processor can execute any software thread, handle any hardware device interrupt, and run operating system kernel tasks.

+-------------------------------------------------------+
|              SMP HARDWARE ARCHITECTURE                |
+-------------------------------------------------------+
|  +-------------+  +-------------+  +-------------+    |
|  | Processor 1 |  | Processor 2 |  | Processor 3 |    |
|  +------+------+  +------+------+  +------+------+    |
|         |                |                |           |
|  +------v------+  +------v------+  +------v------+    |
|  | Private L1  |  | Private L1  |  | Private L1  |    |
|  |   Cache     |  |   Cache     |  |   Cache     |    |
|  +------+------+  +------+------+  +------+------+    |
|         |                |                |           |
+---------v----------------v----------------v-----------+
|              SHARED SYSTEM INTERCONNECT BUS           |
+---------------------------+---------------------------+
                            |
                   +--------v--------+
                   |  Shared Main    |
                   |  Memory (RAM)   |
                   +-----------------+

Architectural Advantages of SMP

Performance Scaling: Multiple program threads run simultaneously across different processors, accelerating total application execution.
High Availability: If one processor suffers an internal hardware failure, the remaining processors automatically take over its active threads, allowing the system to keep running.
Incremental Growth: System performance can be upgraded simply by plugging an additional processor into an empty motherboard socket.

UMA vs. NUMA Memory Architectures

Symmetric Multiprocessor configurations scale and handle memory latency using two primary models: UMA and NUMA.

UMA (Uniform Memory Access)

In an UMA architecture, all processors share access to a centralized pool of main memory through a shared system bus.

Working Principle: Every processor experiences the exact same access time (latency) when reading or writing data to any location in main memory, regardless of which CPU issues the request.
Cache Management: Processors use private L1/L2 caches, which requires a hardware mechanism called Cache Coherency to ensure that all local caches show the same updated data values.
Limitation: It scales poorly beyond 8 to 16 processors because the single memory bus becomes congested with requests.

NUMA (Non-Uniform Memory Access)

In a NUMA architecture, physical main memory is divided into separate, local segments and distributed directly to individual processors or processor clusters.

Working Principle: A processor can access its own local memory segment very quickly. However, if it needs data stored in a remote memory segment (attached to a different processor), it must send a request across a specialized interconnect bus, which takes longer.
Memory Latency: Memory access times are non-uniform because accessing local memory is significantly faster than fetching data from a remote memory module.
Benefits: Highly scalable, enabling enterprise servers to link dozens of processors together efficiently.

+-----------------------------------------------------------------+
|                    NUMA SYSTEM LAYOUT                           |
+-----------------------------------------------------------------+
|  +---------------+                    +---------------+         |
|  |  Processor A  |                    |  Processor B  |         |
|  +-------+-------+                    +-------+-------+         |
|          | (Fast Local Access)                | (Fast Local     |
|  +-------v-------+                    +-------v-------+  Access)|
|  | Memory Bank 1 |                    | Memory Bank 2 |         |
|  +-------+-------+                    +-------+-------+         |
|          |                                    |                 |
+----------v------------------------------------v-----------------+
|              HIGH-SPEED SYSTEM INTERCONNECT FABRIC              |
|             (Slower Remote Cross-Over Access Line)              |
+-----------------------------------------------------------------+

Comparison Summary: UMA vs. NUMA

Feature	Uniform Memory Access (UMA)	Non-Uniform Memory Access (NUMA)
Memory Distribution	Single centralized memory module.	Divided into local and remote memory banks.
Access Latency	Uniform; identical speed for all memory addresses.	Non-uniform; local access is fast, remote access is slow.
System Scalability	Low scalability (typically under 16 cores).	High scalability (can support dozens of cores).
Interconnect Setup	Simple shared system bus layout.	Complex high-speed interconnection networks.

RISC (Reduced Instruction Set Computer) Architecture

Processors can also be organized based on their Instruction Set Architecture (ISA). The two primary philosophies are RISC (Reduced Instruction Set Computer) and CISC (Complex Instruction Set Computer).

Instruction Execution Characteristics

RISC architectures focus on simplifying instructions so they can execute very quickly. Research into program execution characteristics revealed that:

A small number of simple instructions contribute to the vast majority of executed code.
Memory access operations are slow and create performance bottlenecks.
Complex instructions are difficult to optimize using hardware pipeline systems.

As a result, RISC design prioritizes a small, optimized set of simple instructions that execute in a single clock cycle.

Use of a Large Register File

Because memory operations are slow, RISC processors minimize RAM access by including a large number of internal registers (often 32 to 128 general-purpose registers). This allows variables to stay inside fast CPU registers longer, reducing the need to read and write to slower external RAM.

Compiler-Based Register Optimization

RISC architectures rely on sophisticated compilers to manage this large register file. The compiler analyzes software code loops and uses a technique called Register Windowing to allocate local variables to specific CPU registers. This approach minimizes the need to save and restore registers to memory during function calls, accelerating subroutine performance.

RISC Architecture and Pipelining

A core feature of RISC processors is their ability to run simple, uniform instructions through an execution pipeline efficiently.

Execution Pipeline Mechanics

Instruction pipelining is a technique where multiple instructions are overlapped during execution. The process is broken down into sequential steps, similar to an assembly line. A standard RISC pipeline typically uses five stages:

Instruction Fetch (IF): Pulls the next instruction from memory.
Instruction Decode (ID): Translates the instruction and reads values from registers.
Execute (EX): The Arithmetic Logic Unit (ALU) performs the operation.
Memory Access (MEM): Reads or writes data to memory if necessary.
Write Back (WB): Saves the final result back into a register.

Clock Cycles:      1     2     3     4     5     6     7
Instruction 1:   [IF]  [ID]  [EX]  [MEM] [WB]
Instruction 2:         [IF]  [ID]  [EX]  [MEM] [WB]
Instruction 3:               [IF]  [ID]  [EX]  [MEM] [WB]

Because every RISC instruction has the same size (typically 32 bits) and follows the same execution steps, a new instruction can enter the pipeline on every clock cycle. This allows the processor to complete close to one instruction per clock cycle once the pipeline is full.

Hinglish Explanation: RISC architecture ka main goal instructions ko itna simple banana hai ki har command ek single clock cycle mein execute ho sake. Iske liye RISC ke paas ek badi Register File hoti hai, taaki baar-baar slow RAM ke paas na jaana pade. Isme Pipelining ka use hota hai—jaise ek factory assembly line mein kaam hota hai, waise hi alag-alag instructions ke parts ek sath alag-alag stages (Fetch, Decode, Execute) par parallel chalte hain, jisse performance badh jaati hai.

Comparison: RISC vs. CISC

CISC (Complex Instruction Set Computer) architectures use a large library of complex, variable-length instructions designed to accomplish tasks in fewer lines of assembly code. This contrasts with the simple, uniform instruction approach used by RISC.

Feature	RISC Architecture	CISC Architecture
Instruction Size	Fixed size (typically 32 bits).	Variable size (ranges from 1 to 15 bytes).
Instruction Set Size	Small set of simple instructions.	Large set of complex instructions.
Execution Speed	Most instructions execute in a single clock cycle.	Instructions take multiple clock cycles to complete.
Memory Access	Limited to dedicated `LOAD` and `STORE` commands.	Any instruction can access memory directly.
Internal Registers	Large register file (32 or more).	Fewer general-purpose registers.
Control Unit Design	Hardwired control logic (fast and simple).	Microprogrammed control unit (complex).
Pipelining Efficiency	High; easy to optimize due to uniform steps.	Low; difficult to pipeline variable-length steps.

Technical Summary and Key Takeaways

Scalable Architecture Options: Multiprocessor platforms use shared memory spaces to coordinate tasks efficiently across multiple processors, while clusters link independent computer nodes across high-speed networks to scale processing power.
Execution Stream Categories: Flynn's Taxonomy categorizes computer systems into four distinct types (SISD, SIMD, MISD, and MIMD) by analyzing how they route instruction and data streams.
Memory Integration Models: UMA systems provide predictable, uniform access times across a shared system bus, whereas NUMA architectures partition memory locally to reduce latency and improve scalability for high-core systems.
Instruction Set Philosophy: RISC architectures prioritize performance using simple, fixed-length instructions that execute in a single clock cycle, supported by large register files and highly efficient execution pipelines.

SEO Keywords

Parallel organization and multiprocessor systems notes
Flynn's taxonomy for multiple processor organizations
Symmetric Multiprocessor SMP architecture layout
Difference between UMA and NUMA memory systems
RISC vs CISC architecture comparison table
Instruction pipelining stages in RISC processors
Closely coupled vs loosely coupled multiprocessors
Compiler based register optimization windowing
Computer organization parallel processing study guide
Single instruction multiple data stream SIMD examples

Unit V – Parallel Organization | COM SPPU 2024