CDC 6600 explained

The CDC 6600 was a mainframe computer from Control Data Corporation, first (first in the United States) delivered in 1964 to the Lawrence Radiation Laboratory, part of the University of California at Berkeley. It was used primarily for high energy nuclear physics research, particularly for the analysis of nuclear events photographed inside the Alvarez bubble chamber. The very first CDC 6600 was delivered about one year earlier to Conseil Européen pour la Recherche Nucléaire (CERN) near Geneva, Switzerland also for use in high energy nuclear physics research. It is generally considered to be the first successful supercomputer, outperforming its fastest predecessor, IBM 7030 Stretch, by about three times. With performance of about 1 megaFLOPS, it remained the world's fastest computer from 1964–69, when it relinquished that status to its successor, the CDC 7600.

The system organization of the CDC 6600 was used for the simpler (and slower) CDC 6400, and later a version containing two 6400 processors known as the CDC 6500. These machines were instruction-compatible with the 6600, but ran slower due to a much simpler and more sequential processor design. The entire family is now referred to as the CDC 6000 series. The CDC 7600 was originally to be compatible as well, starting its life as the CDC 6800, but during the design compatibility was dropped in favor of outright performance. While the 7600 CPU remained compatible with the 6600, allowing portable user code, the peripheral processor units (PPUs) were different, requiring a different operating system.

A CDC 6600 is on display at the Computer History Museum in Mountain View, California.

History and impact

See main article: Control Data Corporation.

CDC's first products were based on the machines designed at ERA, which Seymour Cray had been asked to update after moving to CDC. After an experimental machine known as the Little Character, they delivered the CDC 1604, one of the first commercial transistor-based computers, and one of the fastest machines on the market. Management was delighted, and made plans for a new series of machines that were more tailored to business use; they would include instructions for character handling and record keeping for instance. Cray was not interested in such a project, and set himself the goal of producing a new machine that would be 50 times faster than the 1604. When asked to complete a detailed report on plans at one and five years into the future, he wrote back that his five year goal was "to produce the largest computer in the world", "largest" at that time being synonymous with "fastest", and that his one year plan was "to be one-fifth of the way".

Taking his core team to new offices nearby the original CDC headquarters, they started to experiment with higher quality versions of the "cheap" transistors Cray had used in the 1604. After much experimentation, they found that there was simply no way the germanium-based transistors could be run much faster than those used in the 1604. The "business machine" that management had originally wanted, now forming as the CDC 3000 series, pushed them about as far as they could go. Cray then decided the solution was to work with the then-new silicon-based transistors from Fairchild Semiconductor, which were just coming onto the market and offered dramatically improved switching performance.

During this period, CDC grew from a startup to a large company and Cray became increasingly frustrated with what he saw as ridiculous management requirements. Things became considerably more tense in 1962 when the new CDC 3600 started to near production quality, and appeared to be exactly what management wanted, when they wanted it. Cray eventually told CDC's CEO, William Norris that something had to change, or he would leave the company. Norris felt he was too important to lose, and gave Cray the green light to set up a new lab wherever he wanted.

After a short search, Cray decided to return to his home town of Chippewa Falls, WI, where he purchased a block of land and started up a new lab. Although this process introduced a fairly lengthy delay in the design of his new machine, once in the new lab things started to progress quickly. By this time, the new transistors were becoming quite reliable, and modules built with them tended to work properly on the first try. Working with Jim Thornton, who was the system architect and the 'hidden genius' behind the 6600, the machine soon took form.

More than 100 CDC 6600s were sold over the machine's lifetime. Many of these went to various nuclear bomb-related labs, and quite a few found their way into university computing labs. Cray immediately turned his attention to its replacement, this time setting a goal of 10 times the performance of the 6600, delivered as the CDC 7600. The later CDC Cyber 70 and 170 computers were very similar to the CDC 6600 in overall design.

Description

Typical machines of the era used a single complex CPU to drive the entire system. A typical program would first load data into memory (often using pre-rolled library code), process it, and then write it back out. This required the CPUs to be fairly complex in order to handle the complete set of instructions they would be called on to perform. A complex CPU implied a large CPU, introducing signalling delays while information flowed between the individual modules making it up. These delays set a maximum upper limit on performance, the machine could only operate at a cycle speed that allowed the signals time to arrive at the next module.

Cray took another approach. At the time, CPUs generally ran slower than the main memory they were attached to. For instance, a processor might take 15 cycles to multiply two numbers, while each memory access took only one or two. This meant there was a significant time where the main memory was idle. It was this idle time that the 6600 exploited.

Instead of trying to make the CPU handle all the tasks, the 6600 CPUs handled arithmetic and logic only. This resulted in a much smaller CPU which could operate at a higher clock speed. Combined with the faster switching speeds of the silicon transistors, the new CPU design easily outperformed everything then available. The new design ran at 10 MHz (100 ns cycle), about ten times faster than other machines on the market. In addition to the clock being faster, the simple processor executed instructions in fewer clock cycles; for instance, the CPU could complete a multiplication in ten cycles.

However, the CPU could only execute a limited number of simple instructions. A typical CPU of the era had a complex instruction set, which included instructions to handle all the normal "housekeeping" tasks such as memory access and input/output. Cray instead implemented these instructions in separate, simpler processors dedicated solely to these tasks, leaving the CPU with a much smaller instruction set. (This was the first of what later came to be called reduced instruction set computer (RISC) design.) By allowing the CPU, peripheral processors (PPs) and I/O to operate in parallel, the design considerably improved the performance of the machine. Under normal conditions a machine with several processors would also cost a great deal more. Key to the 6600's design was to make the I/O processors, known as peripheral processors (PPs), as simple as possible. The PPs were based on the simple 12-bit CDC 160A, which ran much slower than the CPU, gathering up data and "squirting" it into main memory at high speed via dedicated hardware.

The machine as a whole operated in a fashion known as barrel and slot, the "barrel" referring to the ten PPs, and the "slot" the main CPU. For any given slice of time, one PP was given control of the CPU, asking it to complete some task (if required). Control was then handed off to the next PP in the barrel. Programs were written, with some difficulty, to take advantage of the exact timing of the machine to avoid any "dead time" on the CPU. With the CPU running much faster than on other computers, each memory access took ten CPU clock cycles to complete, so by using ten PPs, each PP was guaranteed one memory access per machine cycle.

The 10 PPs were implemented virtually; there was CPU hardware only for a single PP. This CPU hardware was shared and operated on 10 PP register sets which represented each of the 10 PP states (similar to modern multithreading processors). The PP register barrel would "rotate", with each PP register set presented to the "slot" which the actual PP CPU occupied. The shared CPU would execute all or some portion of a PP's instruction whereupon the barrel would "rotate" again, presenting the next PP's register set (state). Multiple "rotations" of the barrel were needed to complete an instruction. A complete barrel "rotation" occurred in 1000 nanoseconds (100 nanoseconds per PP), and an instruction could take from 1 to 5 "rotations" of the barrel to be completed, or more if it was a data transfer instruction.

The basis for the 6600 CPU is what would today be referred to as a RISC system, one in which the processor is tuned to do instructions which are comparatively simple and have limited and well-defined access to memory. The philosophy of many other machines was toward using instructions which were complicated — for example, a single instruction which would fetch an operand from memory and add it to a value in a register. In the 6600, loading the value from memory would require one instruction, and adding it would require a second. While slower in theory due to the additional memory accesses, the PPs offloaded this expense. This simplification also forced programmers to be very aware of their memory accesses, and therefore code deliberately to reduce them as much as possible.

Central Processor (CP)

The Central Processor (CP) and main memory of the 6400, 6500, and 6600 machines had a 60-bit word length. The Central Processor had eight general purpose 60-bit registers X0 through X7, eight 18-bit address registers A0 through A7, and eight 18-bit scratchpad registers B0 through B7. B0 was held permanently at zero by the hardware; many programmers found it useful to set B1 to 1 and then treat it as similarly inviolate.

The CP had no instructions for input and output, which are accomplished through Peripheral Processors (below). No opcodes were specifically dedicated to loading or storing memory; this occurred as a side effect of assignment to certain A registers. Setting A1 through A5 loaded the word at that address into X1 through X5 respectively; setting A6 or A7 stored a word from X6 or X7. No side effects were associated with A0.A separate hardware load/store unit handled the actual data movement independently of the operation of the instruction stream, allowing other operations to complete while memory was being accessed, which required (best case) eight cycles.

The 6600 CP included 10 parallel functional units, allowing multiple instructions to be worked on at the same time. Today, this is known as a superscalar design, but it was unique for its time. Unlike most modern CPU designs, functional units were not pipelined; the functional unit would become busy when an instruction was "issued" to it and would remain busy for the entire time required to execute that instruction. (By contrast, the CDC 7600 introduced pipelining into its functional units.) In the best case, an instruction could be issued to a functional unit every 100 ns clock cycle. The system read and decoded instructions from memory as fast as possible, generally faster than they could be completed, and fed them off to the units for processing. The units were:

Floating-point operations were given pride of place in this architecture: the CDC 6600 (and kin) stand virtually alone in being able to execute a 60-bit floating point multiplication in time comparable to that for a program branch.

Previously executed instructions were saved in an eight-word cache, called the "stack". In-stack jumps were quicker than out-of-stack jumps because no memory fetch was required. The stack was flushed by an unconditional jump instruction, so unconditional jumps at the ends of loops were conventionally written as conditional jumps that would always succeed.

The system used a 10-megahertz clock, but used a four-phase signal, so the system could at times effectively operate at 40 MHz. A floating-point multiplication took ten cycles, a division took 29, and the overall performance, taking into account memory delays and other issues, was about 3 MFLOPS. Using the best available compilers, late in the machine's history, FORTRAN programs could expect to maintain about 0.5 MFLOPS.

Memory organization

User programs are restricted to use only a contiguous area of main memory. The portion of memory to which an executing program has access is controlled by the RA (Relative Address) and FL (Field Length) registers which are not accessible to the user program. When a user program tries to read or write a word in central memory at address a, the processor will first verify that a is between 0 and FL-1. If it is, the processor accesses the word in central memory at address RA+a. This process is known as base-bound relocation; each user program sees core memory as a contiguous block words with length FL, starting with address 0; in fact the program may be anywhere in the physical memory. Using this technique, each user program can be moved ("relocated") in main memory by the operating system, as long as the RA register reflects its position in memory. A user program which attempts to access memory outside the allowed range (that is, with an address which is not less than FL) will trigger an interrupt, and will be terminated by the operating system. When this happens, the operating system may create a core dump which records the contents of the program's memory and registers in a file, allowing the developer of the program a means to know what happened. Note the distinction with virtual memory systems; in this case, the entirety of a process's addressable space must be in core memory, must be contiguous, and its size cannot be larger than the real memory capacity.

All but the first seven CDC 6000 series machines could be configured with an optional Extended Core Storage (ECS) system. ECS was built from a different variety of core memory than was used in the central memory. This made it economical for it to be both larger and slower. The primary reason was that ECS memory was wired with only two wires per core (contrast with 5 for central memory), Because it performed very wide transfers, its sequential transfer rate was the same as that of the small core memory. A 6000 CPU could directly perform block memory transfers between a user's program (or operating system) and the ECS unit. Wide data paths were used, so this was a very fast operation. Memory bounds were maintained in a similar manner as central memory — with an RA/FL mechanism maintained by the operating system. ECS could be used for a variety of purposes, including containing user data arrays that were too large for central memory, holding often-used files, swapping, and even as a communication path in a multi-mainframe complex.

Peripheral Processors (PPs)

To handle the 'household' tasks which other designs put in the CPU, Cray included ten other processors, based partly on his earlier computer, the CDC 160A. These machines, called Peripheral Processors, or PPs, were full computers in their own right, but were tuned to performing I/O tasks and running the operating system. One of the PPs was in overall control of the machine, including control of the program running on the main CPU, while the others would be dedicated to various I/O tasks — quite similarly to I/O channels in IBM mainframes of the time. When the program needed to perform some sort of I/O, it instead loaded a small program into one of these other machines and let it do the work. The PP would then inform the CPU when the task was complete with an interrupt.

Each PP included its own memory of 4096 12-bit words. This memory served for both for I/O buffering and program storage, but the execution units were shared by 10 PPs, in a configration called the Barrel and slot. This meant that the execution units (the "slot") would execute one instruction cycle from the first PP, then one instruction cycle from the second PP, etc. in a round robin fashion. This was done both to reduce costs, and because access to CP memory required 10 PP clock cycles: when a PP accesses CP memory, the data is available next time the PP receives its slot time.

Wordlengths, characters

The central processor had 60-bit words, whilst the peripheral processors had 12-bit words. CDC used the term "byte" to refer to 12-bit entities used by peripheral processors; characters were 6-bit, and central processor instructions were either 15 bits, or 30 bits with a signed 18-bit address field, the latter allowing for a directly addressable memory space of 128K words of central memory (converted to modern terms, with 8-bit bytes, this is 0.94 MB). The signed nature of the address registers limited an individual program to 128K words. (Later CDC 6000-compatible machines could have 256K or more words of central memory, budget permitting, but individual user programs were still limited to 128K words of CM.) Central processor instructions started on a word boundary when they were the target of a jump statement or subroutine return jump instruction, so no-operations were sometimes required to fill out the last 15, 30 or 45 bits of a word.

The 6-bit characters, in an encoding called display code, could be used to store up to 10 characters in a word. They permitted a character set of 64 characters, which is enough for all upper case letters, digits, and some punctuation. Certainly, enough to write FORTRAN, or print financial or scientific reports. There were actually two variations of the display code character sets in use, 64-character and 63-character. The 64-character set had the disadvantage that two consecutive ':' (colon) characters might be interpreted as the end of a line if they fell at the end of a 10-byte word. A later variant, called 6/12 display code, was also used in the Kronos and NOS timesharing systems to allow full use of the ASCII character set in a manner somewhat compatible with older software.

With no byte addressing instructions at all, code had to be written to pack and shift characters into words. The very large words, and comparatively small amount of memory, meant that programmers would frequently economize on memory by packing data into words at the bit level.

It is interesting to note that due to the large word size, and with 10 characters per word, it was often faster to process words full of characters at a time — rather than unpacking/processing/repacking them. For example, the CDC COBOL compiler was actually quite good at processing decimal fields using this technique. These sorts of techniques are now commonly used in the 'multi-media' instructions of current processors.

Physical design

The machine was built in a plus-sign-shaped cabinet with a pump and heat exchanger in the outermost 18inches of each of the four arms. Cooling was done with Freon circulating within the machine and exchanging heat to an external chilled water supply. Each arm could hold four chassis, each about 8inches thick, hinged near the center, and opening a bit like a book. The intersection of the "plus" was filled with cables which interconnected the chassis. The chassis were numbered from 1 (containing all 10 PPUs and their memories, as well as the 12 rather minimal I/O channels) to 16. The main memory for the CPU was spread over many of the chassis. In a system with only 64K words of main memory, one of the arms of the "plus" was omitted.

The logic of the machine was packaged into modules about 2.5inches square and about 1inches thick. Each module had a connector (30 pins, two vertical rows of 15) on one edge, and six test points on the opposite edge. The module was placed between two aluminum cold plates to remove heat. The module itself consisted of two parallel printed circuit boards, with components mounted either on one of the boards or between the two boards. This provided a very dense, if somewhat difficult to repair, package with good heat removal that was known as cordwood construction.

Operating system and programming

There was a sore point with the 6600 operating system support — slipping timelines. The machines originally ran a very simple job-control system known as COS (Chippewa Operating System), which was quickly "thrown together" based on the earlier CDC 3000 operating system in order to have something running to test the systems for delivery. However the machines were intended to be delivered with a much more powerful system known as SIPROS (for Simultaneous Processing Operating System), which was being developed at the company's System Sciences Division in Los Angeles. Customers were impressed with SIPROS's feature list, and many had SIPROS written into their delivery contracts.

SIPROS turned out to be a major fiasco. Development timelines continued to slip, costing CDC major amounts of profit in the form of delivery delay penalties. After several months of waiting with the machines ready to be shipped, the project was eventually cancelled. The programmers who had worked on COS had little faith in SIPROS (likely due largely to not invented here syndrome) and had continued working on improving COS.

Operating system development then split into two camps. The CDC-sanctioned evolution of COS was undertaken at the Sunnyvale, California software development lab. Many customers eventually took delivery of their systems with this software, then known as SCOPE (Supervisory Control Of Program Execution). (Some Control Data Field Engineers used to refer to SCOPE as Sunnyvale's Collection Of Programming Errors). SCOPE version 1 was, essentially, dis-assembled COS; SCOPE version 2 included new device and file system support; SCOPE version 3 included permanent file support, EI/200 remote batch support, and INTERCOM time sharing support. SCOPE always had significant reliability and maintainability issues.

The underground evolution of COS took place at the Arden Hills, Minnesota assembly plant. MACE ([Greg] Mansfield And [Dave] Cahlander Executive) was written largely by a single programmer in the off-hours when machines were available. Its feature set was essentially the same as COS and SCOPE 1. It retained the earlier COS file system, but made significant advances in code modularity to improve system reliability and adaptiveness to new storage devices. MACE was never an official product, although many customers were able to wrangle a copy from CDC.

MACE was later used as the basis of Kronos, named after the Greek god of time. The main marketing reason for its adoption was the development of its TELEX time sharing feature and its BATCHIO remote batch feature. Kronos continued to use the COS/SCOPE 1 file system with the addition of a permanent file feature.

An attempt to unify the SCOPE and Kronos operating system products produced NOS, (Network Operating System). NOS was intended to be the sole operating system for all CDC machines, a fact CDC promoted heavily. Many SCOPE customers remained software-dependent on the SCOPE architecture, so CDC simply renamed it NOS/BE (Batch Environment), and were able to claim that everyone was thus running NOS. In practice, it was far easier to modify the Kronos code base to add SCOPE features than the reverse.

The assembly plant environment also produced other operating systems which were never intended for customer use. These included the engineering tools SMM for hardware testing, and KALEIDOSCOPE, for software smoke testing. Another commonly used tool for CDC Field Engineers during testing was MALET (Maintenance Application Language for Equipment Testing), which was used to stress test components and devices after repairs and/or servicing by engineers. Testing conditions often used hard disk packs and magnetic tapes which were deliberately marked with errors to determine if the errors would be detected by MALET and the engineer.

See also

References

External links