A Proposed High-Speed Computer Design
The designs of several high performance general purpose computers built during
the last two decades are examined. The performance limitations that
they encountered and their approaches to overcoming these problems are discussed.
An alternate design approach is described which avoids these problems.
This is to have the computer support the simultaneous execution of
several independent programs by multiprogramming its various components at
the sub-instruction level. Estimates of its performance are made, and
the implications for the Operating System are examined.
The principles discussed are generally applicable, but to make the proposals
more concrete the hardware implications are worked out in some detail for
the IBM System/370 architecture.
1.0 Limitations on Computer Performance
The modern high-performance computer consists of several components. One
can identify registers, storage, instruction decode units, instruction execute
units (including adders, multipliers, shifters, etc) and data channels for
communication with I/O devices. These components may not be distinct,
but their functions are, and control circuitry must be designed for each
function supported. In faster machines the different functions have
Technology improvements have speeded up these circuits enormously, but a
problem remains in that each instruction must go through a series of these
functions one at a time. For instance to perform an add, the computer
will have to fetch the instruction from storage, decode it, generate the
operand address, fetch the operand, and route the operation via the adder
to its destination register. If these operations are performed strictly
in sequence, each individual function is utlilized only a small fraction
of the time and overall systems throughput is poor.
The following table illustrates the problem. Some common CPUs are shown
with their approximate instruction processing rate in millions of instructions
per second (MIPS), and major machine cycle times. The average number
of cycles per instruction completed is also presented. Each of these
machines is capable of decoding and issuing an instruction every cycle, but
their execution rates generally fall far short of this.
Note that the MIPS rates quoted are unauthenticated numbers drawn from the
literature. MIPS rates cannot be equated directly to performance, since
the amount of work done by each instructions varies widely from one machine
architecture to another.
|CDC Cyber 175
|CDC Cyber 176
2.0 Common Solutions to Performance Problems
The problem of low utilization of CPU components is very similar to the situation
which arose with early computers. Every time the CPU required I/O, it
would initiate it and then wait for the operation to complete before proceeding,
this despite the fact that the I/O unit involved would wait idle while the
CPU handled previous input. This problem was largely overcome by overlapping
CPU and I/O operations. I/O buffers were provided in CPU storage to
allow input devices to read in data ahead of the program's needs, and to allow
program output to stack up and await the services of the output devices. This
allowed the CPU, input and output devices to operate simultaneously instead
of serially, and substantial performance improvements resulted.
2.1 Instruction Fetch and Execute Overlap
This same technique of buffering has been used with considerable success
within CPUs. On most medium to high speed machines, instruction fetch
and execute are separate and concurrent functions. Instructions are
fetched in ahead of being needed into instruction buffers, then executed
when the execution unit becomes available. Since main storage is considerably
slower than the CPU execution units, this approach reduces the amount of time
that the CPU has to wait for storage. Execution units have speeded up
over the years, and to keep them reasonably busy now demands that several
instructions be prefetched, and also any operands that they might need to
access in storage. Unfortunately this technique cannot be extended indefinitely
to provide more and more performance. Instruction streams contain conditional
branches, and only once they are executed does the CPU know which instruction
stream to prefetch. IBM produced a high-speed computer in the early
1960's called the 7030 or Stretch, which attempted high performance by extended
instruction prefetch. Many jobs were found to have a high conditional
branch content, and these prevented the machine from achieving its target
The IBM 3033 processor design tackles this problem by an improved ability
to handle conditional branches. It predicts more accurately the outcome
of each branch, and can prefetch instructions for up to two alternate branch
paths, by having three sets of instruction buffers. A major hardware
investment is made to try to ensure that there is always a queue of instructions
waiting for the execution unit to work on, with their operands prefetched.
Despite this, and the fact that the execution unit completes most instructions
in one or two cycles, the 3033 completes on average only one instruction every
2.2 Core Storage Buffering
On larger CPUs, the relative slowness of main storage is a major limiting
factor on performance. A sophisticated system of storage buffering was
developed for the System/360 Model 85 to alleviate this problem.
High speed buffers are inserted between the CPU and main storage so that the
instruction execution units can operate directly to and from these buffers,
providing they contain the instructions and operands that the CPU requires.
The CPU cannot know which areas of main storage to buffer in advance.
It relies on fetching on demand. This results in occasional delays,
but almost all programs iterate in confined storage areas, so subsequent iterations
have their data immediately available in the buffers. In practice more
than 90% of storage references are satisfied by the high speed buffers. This
approach has been widely used on subsequent machines.
2.3 Multiple Instruction Overlap
Other attempts at improving processor throughput were made by the simultaneous
execution of several successive instructions.
2.3.1 The CDC 6600
In the mid 1960s CDC developed a new approach to the problem of CPU throughout
for their scientific processor, the 6600. This machine incorporated
an instruction preparation unit and nine autonomous execution units, each
specialized to handle only a subset of instruction types. The instruction
preparation unit was able to decode and issue one instruction every machine
cycle. Each instruction was routed to the execution unit appropriate
to its needs. Execution of these instructions then proceeded independently
in each execution unit. A high degree of execution overlap was possible.
This same design approach has been continued in the CDC Cyber 170 series.
Most of the execution units are pipelined so that they can accept a
new instruction every machine cycle. The execution units have the combined
capacity to execute about 7.5 operations per machine cycle, even though the
supply of operations can never exceed one per cycle.
On selected hand-coded programs the CDC 6600 could approach one instruction
completed per machine cycle, but on general jobstreams including online work
the throughput dropped to one instruction every for to six cycles. This
implied that the instruction preparation unit was only 20% utilized, and the
combined execution unit capacity only about 3% utilized. The reason
why this architecture could not achieve a consistent one instruction per
machine cycle rate was because in practice successive instructions in a program
are not independent. Instructions require operands which are produced
by preceding instructions. Unrelated operations have to make use of
the same limited number of machine registers, which introduces apparent
dependencies. Conditional branches cause potential forks in the instruction
stream, which can only be resolved when the branch has completed. The
6600 computer incorporated a "reservation control" station, which arrested
the decode and issue of instructions whenever a conditional branch or operand
interdependence was detected, until the dependence condition was resolved.
Another seminal idea was introduced with the 6600. It had ten peripheral
processors (PP) to manage all its I/O operations. To keep their costs
down these machines had very simple logic, and they shared a common arithmetic
unit (AU). Each PP had access to the AU once every ten cycles, when
it could input an arithmetic operations with its operands. The result
would be made available to the requestor ten cycles later. The AU was
pipelined so that each PP could have an operation proceeding simultaneously.
2.3.2 The IBM 360/91 and 370/195
During the late 1960s IBM developed a new model in the 360 range to cater
for high performance scientific applications. This machine, the model
91, extended the principles developed for the CDC 6600 (Reference 1). Instruction
decode, fixed point arithmetic, floating point add/subtract, and floating
point multiply/divide were all separate autonomous units. Also separate
were main storage read/write control and instruction prefetch. If no
overlap was realized in this machine, overall performance would have been
poor. However, on the Model 91 several instructions could be executed
simultaneously. The machine did not wait for one instruction to complete
before decoding and scheduling the next. Sixteen words of instructions
could be prefetched and executed at the same time. Although each individual
instruction did not process very quickly, the high degree of overlap allowed
fair utilization of all the instruction units in the Model 91 and total systems
performance was good with suitable applications.
Performance limitations arose with the Model 91 for the same reason that they
did with the CDC 6600. In general, instructions in System/360 interdepend
upon one another. The operands of one instruction are frequently derived
from a previous instruction. This forced the Model 91 to observe some
ordering of instruction execution. Individual instructions would be
delayed until the operands that they needed became available, even though
the instructions following them might be executing. Extensive and sophisticated
circuitry was required to detect these interdependencies and keep track of
which operands became available when, and where they were required. Independent
code sequences frequently made use of the same machine registers one after
the other, since there are a limited number of these. An ingenious algorithm
was derived by Tamasulo to rationalize these dependencies and eliminate dependencies
in the floating point execution units. Even so, the performance of
the Model 91 dropped from one instruction per 1.5 cycles on scientific work
to one instruction per 6 cycles when it had to handle online workloads with
a high interrupt and conditional branching content.
The problem of handling conditional branches was just as severe for the Model
91 as for the 6600. When a branch was met in the instruction stream,
none of the instructions following it could be completed until the CPU knew
whether the branch would be taken or not. Exactly the same situation
arose when the Model 91 had to take machine interrupts. These are equivalent
to conditional branches occurring at unpredictable moments. The Model
91 attempted to reduce this disruption by hedging its bets. It prefetched
instructions past the branch, but also prefetched four words of instructions
from the branch target to use in case the branch was taken (a two-way Stretch
action). Even so, if the branch were taken the new instruction stream
had to be decoded and issued to the execution units, which idled in the interim.
The System/360 Model 195 was essentially a combination of the Models 85 and
91, having both memory buffering and concurrent instruction execution.
The concurrent instruction execution in the Models 91 and 195 was achieved
at the cost of extensive control and checking circuitry. An estimated
60% of the Model 91 circuitry existed only to detect and resolve interdependence
problems. In a normal operating environment these CPUs lost a lot of
performance to instruction interdependence holdups, conditional branches,
and interrupts. The Model 195 execution units had enough power to execute
70 million instructions per second (MIPS), but the instruction preparation
unit could issue only 18 MIPS. On selected scientific applications the
machine achieved 14 MIPS, but on general jobstreams this dropped substantially.
When running the Airline Control Program (ACP) which supports online
transaction processing, its throughput dropped to about 3 MIPS (Reference
The problems encountered with conditional branching and interrupt handling
described above have become more severe with the passing of time. Computer
workloads have become more and more online and demand driven. This has
resulted in an increasing percentage of conditional branch execution, and
the attempts of the Models 91 and 195 to achieve concurrent execution of
many instructions have become increasingly ineffective.
2.4 Parallel and Pipeline (Array) Processing
Other high-speed computers are designed to perform instructions over complete
arrays at a time. Examples are ILLIAC-V, CDC STAR and CRAY-1. An
Attached Vector Processor is available on IBM machines. These devices
are well suited to applications involving array operations, but are not cost-effective
when normal scalar oriented workload are involved. For example, the
CRAY-1 processing a scalar workload completes about one instruction every
four cycles, while it has enough execution power to complete 12 operations
every machine cycle. However given a suitable program of vector arithmetic
it can get multiple execution units busy concurrently, and complete 1.75 operations
per cycle (Reference 5).
The conventional approach to multiprocessing is to connect more than one CPU
to a common memory, for instance the Burroughs D825, Univac 1100/82 or IBM
3033MP. Since both cost and performance go up together, no net improvement
in the performance/cost ration is achieved. Indeed the overall throughput
of such systems is less than the sum of their parts, since some contention
and mutual interference must occur. Multiprocessing in this way is
usually employed to ensure systems availability by redundancy in all critical
3.0 Characteristics Required for Maximum Throughput
We have discussed the problems encountered in achieving high performance on
a uniprocessor CPU. These CPUs are able to decode and issue one instruction
per major machine cycle, and have enough fire-power in the supporting execution
hardware to process about four to ten times as many instructions as they
can issue. However, when confronted with an online workload, they complete
about one instruction every four major machine cycles. This is far
short of their one instruction per cycle issue rate, and the net utilization
of their execution units is plainly poor - these have the capacity to process
25 or more times the number of operations than they do. This applies
to the System/360 Model 91 and 195, and the CDC Cyber 175 and 176. Clearly
it is a general problem.
What is needed is an architecture where the CPU can process various types
of workload including online work, and still get close to the target of completing
one instruction per major machine cycle, without requiring engineering overkill
in the execution units. Their aggregate fire-power should more nearly
approximate to the effective instruction execution rate. Likewise the
extensive instruction prefetch of the IBM 3033, with its complex logic when
conditional branches and interrupts are encountered, entails hardware complexity
and cost which would be better avoided.
An architecture that approaches this ideal is described in the following section.
4.0 Proposed Approach - the Virtual Multiprocessor (VMP)
The preceding sections have described the designs of several high performance
computers. Since individual instructions take a relatively long time
to execute, these machines work on several at once. To do this they
contain many units that operate in parallel. Even on the workloads for
which they are optimized, these units are poorly utilized. The general
trend in computer workloads is towards more online transaction driven applications,
with a higher conditional branch and interrupt content. This frustrates
the high performance computers' attempts to organize their workload efficiently.
A better approach is to accept the fact that each individual instruction
takes a relatively long tome to complete, and achieve the required throughput
by executing instructions from many independent programs concurrently. To
do this, the computing system has to have enough resources to sustain several
concurrent programs. It has to emulate multiple independent conventional
CPUs, one per active program.
The benefit of this approach is that all the instructions being worked on
at any moment will belong to separate independent programs. No effort
need be invested in controlling the sequence in which they execute. If
one particular program is held up by, for example, a buffer fault, the execution
units can proceed with the processing of other programs, and so keep themselves
well utilized, and overall systems throughput high. The proposed system
need have less execution unit capacity than current high-performance uniprocessors,
but these units can achieve far better utilization, and greater throughput
This approach can be likened to the IBM VM/370 operating system which simulates
many independent computers on one uniprocessor, so that each user appears
to have his own system. The proposed CPU is multiprogrammed at the sub-instruction
level, to emulate multiple virtual CPUs.
Another benefit can be gained by executing several programs concurrently.
Individual programs tend to concentrate on only a subset of instruction
types, such as floating point arithmetic, or character manipulation. This
results in a subset of the execution logic being heavily used while the rest
is underutilized. This problem can be solved by running a mix of program
types on VMP.
4.1. General Operation of the VMP
VMP consist of a number of interconnected components, each performing specialized
functions. This system emulates multiple CPUs, each of which executes
a separate program one instruction at a time. The instructions undergoing
execution move from one component of VMP to another, obtaining the services
that they need to complete execution. As each instruction completes,
it triggers the fetch, decode and issue of the next instruction of its program.
No attempt is made to prefetch instructions or operands for any of the
The functional components that VMP requires are:
When an instruction is decoded, its opcode is converted to a string of routing
and control bits, and an instruction parcel is issued by the Instruction Fetch
and Decode unit. This parcel contains:
- Instruction Fetch, Decode and Issue unit
- Integer Arithmetic and Logic unit
- Floating Point Arithmetic unit
- Storage-to-Storage unit, for instructions with multiword operands
- Main Storage Control unit
- I/O Channels
Every instruction parcel issued progresses independently through the network
of components in the sequence determines by its routing bits. The components
operate on the parcels to perform the services selected by their control
bits. The parcels are modified by each component before emerging. Routing
and control bits are stripped off as they are acted upon, to expose the subsequent
routing and control bits. Once an instruction has received all the
services it needs to complete, its routing bits will take it back to the
Instruction Fetch and Decode unit, so they next instruction of that program
can be issued.
- a tag to identify which virtual CPU the parcel belongs to
- a series of routing and control bits. The routing bits determine
the path of the parcel through the network, and the control bits determine
the operation performed by each component that the parcel passes through
- the program interrupt mask and condition code bits of the associated
virtual CPU, since most components read or modify these
- the rest of the parcel, which changes as it gets operated on, eg:
- in the initial parcel issued by Instruction Fetch, the instruction
minus its opcode
- after address arithmetic, a storage address
- after Main Store Control operand fetch, an operand
- after a register store operation, the register content and its target
A component may modify the routing and control bits of an instruction parcel
to alter its pre-planned routing. For example, a Floating Add instruction
obtaining its operand from Main Store Control may cause a storage fetch violation.
In this case Main Store Control would reroute the parcel back to Instruction
fetch, with control bits set to direct a program check PSW load, whereas
the original routing would have taken the parcel to the Floating Point unit.
The idea of multiprogramming a CPU at the sub-instruction level is very similar
to the principle of multiprogramming a byte multiplexor I/O channel, as is
common on many computers. The channel executes many channel programs
concurrently, one for each active I/O device attached to it. The channel
multiplexes itself from one channel program to the next to meet the demands
of the devices, resuming channel commands (instructions) part-way through
their execution. This approach has proven very viable for channels,
and the same approach can be applied to program instruction execution.
4.2 Register to Support Virtual CPUs
To support each virtual CPU emulated the base machine would need to provide
distinct sets of addressable registers, instruction address registers, and
processor state registers and flags. The details depend on the machine
architecture selected. By way of example, the resources needed to implement
the System/370 architecture are detailed in this section. For each virtual
- a Program Status Word - instruction address and state flags
- 16 General Purpose Registers
- 4 Floating Point Registers
- 16 Control Registers
- 1 memory width word Instruction Buffer
- 2 Storage-to-Storage Operand Address Registers
- 1 word buffer for Storage-to-Storage Operands
Figure 1. Organization of Major VMP Components for System/370 Architecture
Figure 2. VMP Main Store Control Unit for System/370 Architecture
4.3 Organization of the Major Components of VMP
A possible layout of the major components of VMP and their connections is
shown in Figures 1 and 2. Each major component includes the logic to
provide some instruction service, and the registers associated with this service.
These major components are interconnected by data paths which allow the movement
of instructions from component to component so that each instruction can obtain
the services that it needs.
- The Instruction Fetch and Decode unit includes the Instruction Address
Register of each virtual CPU
- The Floating Point unit executes Floating Point arithmetic, and includes
the Floating Point registers for each virtual CPU
- The Integer and Logic unit performs integer arithmetic, logical operations
and general instructions, and includes the General Registers of each virtual
CPU, plus an operand buffer.
- The Storage-to-Storage unit manages instructions with multi-word operands,
and includes Operand Address and Length Registers for each virtual CPU, which
it updates as the instruction proceeds.
- The Main Store Control Unit (SCU) services storage requests, manages
buffer storage and, for computers with virtual storage, it performs address
translation. For systems where storage protect keys are implemented,
the SCU contains the Protect Key of each virtual CPU. Each Storage Module
that it controls includes multiple pairs of storage address registers and
operand buffers to queue access requests.
4.4 Execution of Storage-to-Storage Instructions
System/370 architecture includes many instructions which process multi-word
operands, such as multiple-register loads and stores, and character vector
moves and compare. These are handled in VMP by the Storage-to-Storage
unit (SSU). This unit uses the Integer and Logic unit to calculate the
operand starting addresses, then uses the same unit iteratively to process
the multi-word operands one word at a time.
The SSU reconstitutes the routing and control bits of the instruction parcel
on each iteration to ensure that the parcel keeps recirculating through the
functional units until its operands have been completely processed. It
contains two operand address registers and a length register for each virtual
CPU, and an adder to update them as the operands are processed. The
Integer and Logic unit contains an operand buffer for each virtual CPU to
support the processing of multi-word operands.
During the execution of the interruptible instructions MVCL and CLCL, the
SSU will detect a hardware interrupt pending condition should one arise, and
will route these instruction parcels back to instruction fetch and decode
to allow interrupts to be taken.
4.5 Examples of Instruction Routes through VMP
To illustrate how VMP would function, some examples are given to show the
routes which various instructions would take through the network of major
components. Each example starts with issue from Instruction Fetch and
The lettering in parentheses in the following examples indicates the major
component involved in each operation, using the lettering of Figure 1. IAR
denotes Instruction Address Register.
- AER - Floating Point add to register
(A) Decode Instruction, issue parcel
(D) Floating Point add
(A) add 2 to IAR
- AE - Floating Point add, storage to register
(A) Decode Instruction, issue parcel
(C) calculate operand address
(E) fetch operand
(D) Floating Point add
(A) add 4 to IAR
- STE - Floating Point store
(A) Decode Instruction, issue parcel
(C) calculate store address
(D) get register contents, issue with address
(E) store register contents in Main Store
(A) add 4 to IAR
(A) Fetch and decode Instruction.
(C) calculate branch target address
(A) place branch target in IAR
Else branch is not taken, route to:
(A) add 4 to IAR
- SVC - Supervisor Call
(A) Fetch and decode Instruction,
issue SVC old PSW address
(E) store old PSW
(A) issue SVC new PSW address
(E) load new PSW
(A) replace IAR
- MVC - Move two words storage to storage
(A) Decode Instruction, issue parcel
(B) save operand length(s)
(C) calculate target operand address
(B) save target operand address
(A) get source operand address
(C) calculate source operand address
(B) save source operand address
(E) fetch first word of source operand
(B) output source operand, and target address
(E) store first word at target address
(B) increment and output source operand address
(E) fetch second word of source operand
(B) output source operand, and new target address
(E) store second word at target address
(A) add 6 to IAR
- BAL - Branch and Link
(save return address and branch)
(A) Decode Instruction, issue IAR in parcel
(C) save IAR in nominated register
(A) re-issue parcel with branch address
(C) calculate branch target address
(A) place branch target in IAR
5.0 Performance Constraints and Proposed Solutions for VMP
We have seen in Section 1 that current high performance computers execute
on average only one instruction every three to six machine cycles, despite
large investments in logic circuits to try to speed them up. The performance
goal for VMP is to approach one instruction completion per cycle with minimal
investment in circuitry to detect and resolve instruction interdependencies.
This target imposes certain design constraints which are described
5.1 Implementation of Register Storage
The most straightforward way of implementing registers in VMP would be as
a single high speed storage array. This would however significantly
limit the performance of VMP. Every instruction execution requires access
to its instruction counter for readout and update, access to registers for
address arithmetic, and access to registers for operands. A single store
array would only be able to service one such request per cycle, and so would
limit throughput to one instruction every three to five cycles.
The approach proposed for VMP is to segregate the registers by type, and incorporate
each type within the functional unit which references it most frequently.
This would have several advantages:-
- Different register types could be accessed simultaneously by the various
functional units which contain them.
- Transmission of data between functional units would be minimised, thus
reducing the contention for data paths.
- Functional Units usually operate on an internal cycle which is much
faster that that of the machine as a whole (Reference 1). Registers
integrated within functional units would be physically adjacent to the circuits
which access them, and could be accessed each minor cycle of the functional
unit. Hence an add instruction referencing two registers for operands
and updating a third would consume three major cycles of register access time.
5.2 Interconnection of VMP Components
The components of VMP cannot be interconnected via a common bus. Although
this leads to a very simple design, the average instruction in System/370
architecture would need to be transported three to four times. Each
transport would take a full cycle of bus time, so overall VMP throughput would
be limited to one instruction per three or four cycles, which is far
below the performance target. For this reason, separate bussing is
needed between the components which communicate with each other.
The concept of VMP is that no major component can accept more than one parcel
per cycle, and that those components which are heavily used such as main store
control, and instruction fetch and decode, should be able to accept a new
parcel every machine cycle. Should they require longer than one cycle
to service a request, pipelining and storage interleaving is assumed. Since
each component has multiple possible predecessors, it may be presented with
more than one parcel per cycle. It will only be able to accept one,
so the other parcels will have to wait. This implies that each major
component may have to buffer parcels it has completed, before being able
to issue them. The components will be able to continue accepting and
processing new parcels as long as their output buffers are not all full(or
committed to work in progress in their pipelines).
For some operations the amount of data in a parcel may exceed the width of
the data paths of VMP, for example a store operation with an operand and its
address. When this occurs, transmission of the instruction parcel takes
place over two or more consecutive cycles.
5.3 Performance Prediction for VMP
Given the above design approach, the maximum throughput rate which VMP could
sustain would be one instruction every major machine cycle. However
with normally encountered workloads the design shown in Figure 1 will not
sustain this rate. Given assumptions on the instruction mix to be executed
by VMP, one can model the system as a Markov process and use the state transition
probabilities to deduce how busy each component would be. With the arrangement
shown in Figure 1 the storage unit would probably be a bottleneck, as illustrated
by the following assumptions taken from (Reference 1)
Then for each instruction executed there will be 0.45 Main Store references
for instruction fetch, and 0.8 for operand reference, totalling 1.25 per instruction
executed. But of the 80% of instructions which reference storage operands,
some will be branches - say 25%. Their storage references will actually
be to their branch targets. We can account for this by decreasing the
reference rate to 0.6 per instruction, and increasing the instruction fetch
to about 0.56. This leads to the conclusion that VMP will average 1.16
storage references per instruction completed.
- eight byte storage word
- average instruction 0.45 word long
- 80% of instructions reference an operand in Main Store
If the SCU is the performance bottleneck then with the 54 nanosecond (ns)
cycle of the 195, VMP would process 16 million instructions per second (MIPS),
with the 25 ns cycle of the Cyber 175, 35 MIPS, with the 12,5 ns cycle of
the CRAY-1, 70 MIPS. Another potential limitation arises from the assumption
that 80% of the instructions reference storage, and hence require the services
of the integer and logic unit for address arithmetic. If many of the
instructions (re-)use the integer and logic unit to access a general register,
then this unit could become a bottleneck.
In order to make a proper evaluation an accurate model would have to be set
up for each machine architecture of interest.
5.4 Number of Virtual CPUs in VMP
The number of CPUs that VMP emulates determines the depth of multiprogramming
its components support. This number should somewhat exceed the number
of major machine cycles that the average instruction takes to complete in
VMP. For a machine using components similar to those in a System/360
Model 195, about ten virtual CPUs would be adequate.
5.5 VMP Storage Control Unit
The VMP Storage Control Unit (SCU) performance is just as crucial to VMP's
throughput as it is in a conventional uniprocessor design. the SCU supporting
VMP must be able to accept a new storage access request every machine cycle.
However the SCU need not respond after only one cycle, as is the case
with the IBM 3033. Address translate, buffer directory lookup, and
buffer access may take place on consecutive cycles providing these functions
are pipelined to support one request per cycle. This would result in
individual requests taking longer, but VMP could still achieve its target
throughput is its multiprogramming depth is adequate. This process
of reducing the amount of work done in each machine cycle allows the cycle
to be speeded, thus achieving greater throughput.
The SCU in Figure 2 follows the conventional IBM 303x design, except a buffer
miss condition does not tie up the buffer during main store access - the failing
request is sent to the appropriate (interleaved) storage module for service,
and the buffer continues to service other requests.
5.5.1 Management of Paging by the Storage Control Unit
Computer workloads are tending more and more towards online processing. The
demand for main storage to contain these applications is growing rapidly.
Common solutions to this problem are program overlays, swapping programs
to disk, and virtual storage systems. Control Data have long used Extended
Core Storage (ECS) as a swapping device. Programs resident in ECS cannot
execute, they are swapped into main store when needed by a block data move
Virtual storage systems handle paging on normal I/O channels. With current
operating systems a substantial overhead is incurred by each page fault serviced.
Scheduling the I/O, taking the I/O interrupt and task switching cost
time. Now that charge couple devices (CCD) are commercially available,
demand paging could be brought under the control of the SCU. The CCD
could be directly accessible to the SCU as ECS is on Cyber machines. VMP
would detect page faults, and suspend the virtual CPU involved until it had
swapped in the needed page(s). During this time the other virtual CPUs
would proceed with their programs, so the overall throughput could still
The operating system controlling VMP would not be aware of the paging activity,
so the overheads currently incurred by managing page exceptions and page I/O
would be avoided. This process would be very similar to the hardware-managed
main storage buffering described in section 2.2.
Hardware managed paging also has implications for the operation of the I/O
channels. They generally operate out of real storage addresses provided
to them by the operating system. If the SCU managed address translation
then the channels would have to operate with virtual storage addresses. Each
ongoing I/O operation would need a pointer to the address translate tables
current when it was started.
Given the above the VMP managed paging system, there would be no need for
a two-level storage hierarchy between the CPU and the paging device. One
level of storage, of intermediate size, speed and page size would suffice.
For example, a machine with a 50 nanosecond cycle time, executing 17
MIPS, should cope with 2MB of buffer storage. This storage should be
capable of accepting one storage request per cycle, but need only respond
after four to five cycles. A 256 byte page would be suitable. This
setup should result in about 20000 page swaps per second. Four interleaved
paging CCD devices with a transfer rate of 2MB per second would be able to
support this paging rate. The total amount of CCD storage would have
to be sufficient to contain the swapped-in programs. For a 17 MIP machine,
30MB should suffice.
6.0 VMP Compared to Other Virtual Multiprocessor Proposals
Other publications have suggested approaches to multiprocessing which have
similarities to the VMP architecture described in this paper.
Reference 2 proposes the time-division multiprogramming of the circuits of
a mini computer. Each instruction follows a fixed routing through the
instruction decode and execute steps. This approach would not meet the
needs of a large general purpose computer architecture, since the variance
in instruction types and the execution resources that they require could not
be satisfied by a fixed sequence of services.
Reference 3 proposes an extension of the principles which the CDC 6600
Peripheral Processors use, as described in Section 2.3.1. A collection
of simple autonomous processors share a set of central execution resources
for the more complex instructions that they have to execute. The simple
functions of instruction fetch and decode, branching, and operand fetch and
store are handled by the autonomous processors. The central facilities
are shared on a fairly rigid time-division basis. In practice, a program
will often not want the central resource when its "turn" comes around. When
it does take its "turn", the program will have to transmit its instructions
plus one or two operands and some process state data (eg program interrupt
masks) to the central execution resource, in one machine cycle. This
implies a very wide data path, which would necessitate a relatively long machine
The major virtues of the proposed VMP approach compared with the above are:
- Each process supported by VMP contends for only the resources that it
needs, in the sequence that it needs them. Hence optimum resource usage
should be achieved.
- The state registers required to support each process are distributed
over the network of components which implement VMP. Each component will
have very rapid and uncontested access to the state registers it most often
has to access.
7.0 VMP Interrupt Handling Philosophy
This section details how VMP could handle hardware interrupts in a way compatible
with IBM System/370 architecture. Much of the discussion will be applicable
to other interrupt driven machine architectures.
7.1 Response to Maskable Interrupts
Supervisor code to handle interrupts usually runs disabled to the type of
interrupt it is handling until it has fully recorded the details of the interrupt.
To ensure the correct operation of such code, VMP will have to ensure
that only one virtual CPU at a time can disable itself to any particular interrupt
type. for all maskable System interrupts (e.g. I/O interrupts, External
interrupts, and Machine Check interrupts) VMP will maintain an overall system-wide
interrupt mask which is the logical AND of the interrupt masks in each virtual
CPUs state registers. Thus should an External Interrupt condition become
pending and any one of the virtual CPUs is disabled to External Interrupts,
VMP will not honour the interrupt until the virtual CPU re-enables itself
to External Interrupts by changing its interrupt mask.
If a virtual CPU attempts to disable itself to an interrupt type already seized
by another, VMP will suspend execution of this virtual CPU until the one
in possession of the interrupt type re-enables itself to that interrupt type
again. Hence code executing disabled to any interrupt type will automatically
be serialized by VMP. Virtual CPUs awaiting disabled state would use
no resources and cause no interference. The remaining enabled virtual
CPUs will continue to process instructions at the normal rate.
7.2 Supervisor Call and Program Check Interrupts
Supervisor Call and Program Check Interrupts are usually private to the virtual
CPU causing them, and result in a branch. No other virtual CPU is affected.
Once the interrupts has been handled by the Supervisor, control is usually
returned to the invoking program. The VMP regards this as private to
the virtual CPU, and no other virtual CPU is affected. However, the
Interrupt Handling code may update Supervisor information over which data
integrity must be ensured. See Section 8, "Software Support of VMP".
7.2 I/O and External Interrupts
When an I/O or External interrupt occurs, the program expecting the interrupt
may or may not be executing. In response to such interrupts, VMP suspends
any one of its active virtual CPUs, stores its Program Status Word (PSW) in
the appropriate storage location, and loads the appropriate new PSW, much
as a conventional System/370 now does. All the other active virtual
CPUs continue with their own programs. There is no need for a total
cycle-down of work as with the System/360 model 91, nor any start-up delay
in fetching a new instruction stream. However, the integrity of System
Software control blocks and data must be ensured, see Section 8, "Software
Support of VMP".
Once the interrupt has been processed, a Task Switch may be necessary. Once
of the virtual CPUs will suspend its program and restart the program needing
attention. The other virtual CPUs will not be affected.
7.4 Machine Check Interrupts
This is obviously a very complex area. When such a check occurs, VMP
must determine whether the fault is transient and can be recovered from, or,
if not, then whether it can be localised to a particular virtual CPU. If
it can, then that CPU takes the machine check interrupt. Depending on
the severity of the error, VMP may suspend all other processing activity.
However, if the fault is unlikely to affect other virtual CPUs (e.g.
a parity error in main storage) then VMP may allow them to continue while
the afflicted virtual CPU attempts hardware or software recovery.
8.0 Software Support of VMP
The techniques of software support for VMP are highly dependent upon the machine
architecture and operating system chosen, and a detailed treatment of the
topic is beyond the scope of this paper. Most major computer lines
and their operating systems support multiprocessing, and so could support
the VMP approach without much change. Some general observations are
8.1. Uniprocessor Compatible Mode
A Machine Reset should put all of the VMP virtual CPUs into a stopped state.
IPL (initial program load by hardware) would activate only one of the
virtual CPUs, which could complete the IPL and initialize the system. Unless
specifically activated, no other virtual CPU would run - the VMP would run
in uniprocessor mode. Hence, all existing programs and software systems
which run on a uniprocessor would run on VMP, with the usual timing dependencies.
8.2 Initiating Multiprocessing
A new privileged instruction, Exchange Program Status Word (EPSW) must be
introduced with the VMP. This instruction's operand would designate
a virtual CPU number. The effect of the instruction would be to interrupt
the designated CPU and cause it to enter a Supervisory routine to save the
status of the program it was executing (if any), and start executing another.
If a virtual CPU finds no task ready to run, it will load a Wait State PWS,
which will render it inactive. The Supervisor will maintain a table
in storage indicating for each virtual CPU either that it is inactive, or
else the address of the Task which it is currently executing.
8.3 Ensuring the Integrity of Supervisor Data
The integrity of much supervisor data would be lost if multiple CPUs were
allowed simultaneous access. Such data is usually protected by two mechanisms.
When running on a uniprocessor, serial access to interrupt handling
code and tables is ensured by running disabled to that interrupt type. On
a multiprocessor, integrity is insured by use of a set of exclusive locks.
From the description of VMP given, it is clear that the technique of running
disabled to an interrupt type while handling it will work as well on VMP
as it does on a conventional uniprocessor. In addition exclusive locks
in storage can be used to ensure system integrity. With the current
multi-CPU approach to multiprocessing, contention for locks often results
in CPUs idling while they wait for control of the lock that they need. To
reduce this contention for locks, VMP could offer (and even standard multiprocessors
should offer) a small associative memory to contain the most recently contended-for
locks. The CPUs waiting for these locks would be suspended until the
Storage Control Unit detected that the lock had been modified.
8.4 Determining the Current Program
At any moment may different programs may be executing on VMP. Each virtual
CPU must be able to identify which program it is executing. To do this
requires an instruction which returns to the virtual CPU its processor number.
This number can then be used to index over an array of program identifiers.
8.5 Accounting for Machine Time
A very strong requirement for a computer is that it must account for the time
that it spends executing problem programs. Since VMP executes many
problem programs simultaneously, and they compete with one another for resources,
the amount of CPU time that they spend to execute a fixed amount of work
will vary each time that they run. To provide a measure of CPU work
done that does not vary from one execution to another, VMP could provide
the path length executed by each program as it runs. This requires less
work than counting instructions, since VMP would only have to do arithmetic
when a program branched or was interrupted - subtract the branch/interrupt
target, add the old Instruction Address Register content to that program's
A more accurate but expensive solution would be for VMP to maintain a counter
for each virtual CPU, and increment it for each instruction executed by that
virtual CPU. The increment would depend on the instruction opcode, so
long instructions would be more heavily weighted. For storage-to-storage
instructions, the operand length could be taken as the instruction weighting.
This paper describes a way of achieving a high instruction throughput rate
on a relatively modest amount of hardware, by multiprogramming the CPU at
a sub-instruction level. The design will cope well with current computer
workloads in online environments, where conditional branches and interrupts
This paper has shown that when current high performance computers execute
jobstreams that include online work, they complete only one instruction every
four machine cycles, while VMP should achieve close to one instruction per
cycle. This level of performance can be realized without the large investments
in complexity which current high performance computers make in order to reduce
the number of cycles per instruction.
The paper has also shown how the new rapid access CCD and bubble memory devices
can be incorporated into a hardware managed storage hierarchy. The software
overheads currently associated with handling I/O would otherwise preclude
the full exploitation of these devices for paging in virtual storage systems.
- "The IBM System/360 Model 91 : Machine Philosophy", IBM Journal of R&D
11 : No 1 (Jan 1967)
- LE Shar and ES Davidson, "A Multiminiprocessor System Implemented through
Pipelining", Computer Feb 1974
- MJ Flynn et al, "A Multiple Instruction Stream Processor with Shared
Resources", Parallel Processor Systems, Technologies and Applications
- MJ Flynn, "Towards More Efficient Computer Organizations", Spring Joint
Computer Conference 1972
- RM Russell, "The CRAY-1 Computer System", Communications of the ACM
Vol 21 no. 1 January 1978.