Nov. 1993, Nov.1994-Dec.1995: The Numerical Wind Tunnel (NWT) at the National Aerospace Laboratory of Japan (NAL). The NWT was an early implementation of the vector parallel architecture developed jointly by NAL and Fujitsu. Key stats: 140 vector processors, upgraded by 1995 to 167; 124.2Gflop/s upgraded 170Gflop/s. It was the first supercomputer with a sustained performance of close to 100Gflop/s for a wide range of fluid dynamics application programs. Processor: Gate delay of as low as 60ps in GaAs chips; cycle time 9.5ns; each processor had 4 independent pipelines, each executing 2 Multiply-Add instructions in parallel; each processor board had 256MB of central memory.

Every year since 1986 Hans Meuer has been publishing system counts of the most powerful supercomputers, well before the first Top 500 list was assembled in 1993. Here's a history of the most powerful since 1993, courtesy Top 500, various labs and vendors. In June 1993, the most powerful supercomputer was the CM-5 at the Los Alamos National Lab. It was made by Thinking Machines Corp. Key stats: 1,024 processors; 16,000 nodes in its largest configuration, with each node a 22MIPS RISC Sun SPARC microprocessor that had 4 vector pipes and was capable of 128Mflop/s. The staircase-like shape and large panels of red blinking LEDs earned it a role in the movie Jurassic Park (in the control room for the island). Sculptor/architect Maya Lin contributed to the CM-5 design.

June 1994: Intel XP/S 140 Paragon: Sandia National Labs. The first massively parallel processor supercomputer to be the fastest system in the world. Key stats: 3,680 Intel i860 RISC microprocessor connected in a 2D grid; 143.4Gflop/s. The OS, OSF-1, from Intel did not scale well and Sandia engineers ported SUNMOS, their lightweight kernel, to the Paragon. A 2nd-generation kernel, PUMA, later replaced SUNMOS.

June 1996: Hitachi's SR2201 took the top spot. Key stats: 1,024 processors, each a 150MHz HARP-1E based on the PA-RISC 1.1 architecture; 232.4Gflop/s running the Linpack benchmark (source: Hitachi). Hitachi used pseudo-vector processing (PVP; see above). While conventional RISC processors ran into difficulty when data could not fit the cache, SR2201's PVP fetched data (operands) from main memory, bypassing the cache, and into floating-point registers without holding up execution of subsequent instructions. A new trend was also noticed: in June 1993, 66% of the installed systems were based on Emitter-coupled Logic (ECL) but by June 1996, only 20% of the top 500 were ECL-based.

Nov. 1996: CP-PACS at the University Of Tsukuba. Another Hitachi supercomputer, it's key stats were: 2,048 processors; 368.2Gflop/s. Development involved close collaboration of computer scientists and physicists of the CP-PACS Project.

June 1997-June 2000: ASCI Red at the Sandia National Laboratory. Intel’s supercomputer was the first teraflop/s computer, which in June 1997 delivered a Linpack performance of 1.068teraflop/s. It was a mesh-based (38 X 32 X 2) MIMD massively parallel machine, which started out with Intel Pentium Pro processors, each running at 200MHz, 7,264 compute nodes, 1,212GB of distributed memory and 12.5TB of disk storage. ASCI Red later used 9,632 Pentium II Over-Drive processors, clocked at 333MHz, delivering 3.1Tflop/s.

VP Rick Stulen and Intel designer Stephen Wheat look inside an ASCI Red rack. The supercomputer's easy accessibility made it possible to upgrade the processors. It was retired in September 2005, after being on 17 TOP500 lists over 8 years. It was also the last supercomputer designed and assembled by Intel; Intel’s Supercomputer Division had shut when ASCI Red was launched. Photo: Sandia Corp.

Nov. 2000-Nov. 2001: ASCI White: Lawrence Livermore National Laboratory (LLNL). Reaching 4.9Tflop/s Linpack performance, IBM's machine had 512 nodes, each with 16 IBM Power3 processors (8,192 microprocessors) using a shared memory. The memory nodes were interconnected with nearly 79km of cable. According to an LLNL announcement then, the ASCI White had 160TB of storage in 7,133 disk drives, which could 'hold the equivalent of 300,000,000 books—six times the holdings of the Library of Congress.' By June 2001, upgrades had improved the Linpack performance to 7.2Tflop/s. Top 500 notes that around this time, the type of hierarchical architecture used by ASCI White was becoming common for systems used in HPC.

June 2002-June 2004: The Earth Simulator (ES) at the Earth Simulator Center, Yokohama, Japan. Its performance of 35.86Tflop/s was almost 5x higher than that of the IBM ASCI White--an unparalleled jump in performance in the history of the TOP500. The performance gap also kept the machine at the top spot for 5 consecutive lists. Built by NEC, the ES was a highly parallel vector supercomputer system that used distributed memory and had 640 processor nodes (PNs) connected by 640x640 single-stage crossbar switches. Each PN had shared memory, 8 vector-type arithmetic processors (APs), a 16GB main memory system (MS), a remote access control unit (RCU) and an I/O processor. The peak performance of each AP was 8Gflop/s. The entire machine had 5,120 APs with 10TB of main memory and a theoretical performance of 40Tflop/s. Shown above is the AP block diagram. Source: Japan Agency for Marine-Earth Science and Technology (JAMSTEC).

Nov. 2004-Nov. 2007: BlueGene/L at the Lawrence Livermore National Laboratory (LLNL). The slanted cabinets were a necessary design element to keep cooled air flowing properly around each cabinet's processors. The DOE/IBM BlueGene/L beta system recorded a Linpack performance of 70.72Tflop/s in 2004. It was a $100 million, 5-year development effort by IBM. In June 2005, the system was doubled in size and reached 136.8TFlop/s. By November 2005, it was again doubled in size and reached 280.6TFlop/s with 131,000 processors. Each rack contained 1,024 dual-processor nodes. In Nov. 2007, after a further upgrade, it was listed with a performance of 478.2TFop/s.

June 2008-June 2009: IBM's Roadrunner at Los Alamos National Laboratory breaks the petaflop/s Linpack barrier with 1.026Pflop/s. Another thing unique was that it combined two different kinds of processors at the supercomputer scale. It had 6,563 dual-core general-purpose processors (AMD Opterons), with each core linked to a PowerXCell 8i graphics processor (called a Cell). The Cell was an enhanced version of what was originally designed for the Sony Playstation 3.

The Roadrunner was based on the IBM QS22 blades, which were built with the PowerXCell 8i graphics processor, an advanced version of the processor in the Sony PlayStation 3. The nodes were connected with a commodity InfiniBand network. The QS22 blades were considered highly efficient, delivering up to 536Mflop/W. Source: IBM.

Nov. 2009-June 2010: Cray XT5 (Jaguar) at the Oak Ridge National Laboratory delivered a Linpack performance of 1.759Pflop/s. The Jaguar system, a 25Tflop/s Cray XT3 in 2005, went through upgrades and by 2009 had over 200,000 processing cores connected internally with Cray's Seastar2+ network. XT4 and XT5 parts of Jaguar were combined into a single system using an InfiniBand network. The Nov. 2009 configuration included 37,376 AMD 6-core Istanbul Opteron 2.6GHz processors (224,256 compute cores) in the XT5 part; 7,832 AMD 4-core Budapest Opteron 2.1GHz processors (31,328 compute cores) in the XT4 part; 362TB system memory (~3x the closest competitor); 284Gbps I/O bandwidth; and a storage comprising 10PB Lustre-based shared file system. And it had a unique cooling system (next slide).

The Jaguar's Cray XT5 portion had power density of over 2kW/square foot. It cool the system, Cray worked with its partner, Liebert, to develop a new cooling technology called ECOphlex. ECOphlex employed phase change cooling--hot air was blown through an evaporator with the refrigerant in it--instead of the standard coil. They were able to do away with the computer room air-conditioners because the air entering the system was at the same temperature as the air exiting the system. Here's an interesting video describing it:

Nov. 2010: Tianhe-1A at the National Supercomputing Center, Tianjin. China broke the record reaching the top slot dominated thus far by the United States and Japan. Tianhe-1A achieved 2.57Pflop/s. Designed by the National University of Defense Technology (NUDT) in China, Tianhe-1 was a hybrid design with 14,336 Intel Xeon processors and 7,168 NVIDIA Tesla GPUs used as accelerators. Each node consisted of two GPUs attached to two Xeon processors. An interconnect about 2x faster than commercial ones was developed by Chinese researchers.

June-Nov. 2011: K Computer at the Riken Advanced Institute for Computational Science, Kobe. The K computer, built by Fujitsu, originally combined 68,544 SPARC64 VIIIfx CPUs, each with eight cores, for a total of 548,352 cores--almost twice as many as any other system at the time. The K computer was also more powerful than the next five systems on the list combined, delivering 8.16Pflop/s. The Japanese word 'kei' means 10 quadrillion, a barrier the machine crossed before Nov. 2011 clocking 10.51Pflop/s using 705,024 SPARC64 processing cores. Source: Source: RIKEN Advanced Institute for Computational Science.

Each system board in the K Computer was equipped with four SPARC64 VIIIfx CPUs comprising 8 cores each. The heat generated was removed by water cooling. The K computer also used a custom interconnect, called Tofu, which used a structure called '6-dimensional mesh/torus' topology. The computer was configured to maximise performance despite failures by having alternate routes between CPUs and a mechanism to bypass failed CPUs. Source: Source: Fujitsu.

June 2012: IBM's Sequoia at the Lawrence Livermore National Laboratory (LLNL). Sequoia, an IBM BlueGene/Q system, achieved 16.32Pflop/s using 1,572,864 cores making it the first system with over 1 million cores. The primarily water-cooled system had 96 racks, 98,304 compute nodes, and 1.6PB of memory. The Sequoia was about 8x more power efficient (relative to peak speed) than BlueGene/L that ruled Nov. 2004-Nov. 2007 (slide 10). In the photo (source: LLNL) Kim Cupps, Livermore Computing Division Leader, and Adam Bertsch, BlueGene Team Lead, discuss the Sequoia project in October 2012.

Nov. 2012: Cray's XK7 (Titan) at the Oak Ridge National Laboratory (ORNL) had 552,960 processors and 700TB of memory for its 17.6Pflop/s Linpack performance. Each of Titan's 18,688 nodes comprised an Nvidia Tesla K20 GPU with a 16-core AMD Opteron 6274 CPU processor to give a peak performance of ~27Pflop/s. According to the Oak Ridge Leadership Computing Facility (OLCF), Titan's capability 'is on par with each of the world's 7 billion people being able to carry out 3 million calculations per second.' Image source: ORNL.

June 2013-Nov. 2015: Chinese dominance of the Top 500 list started with the Tianhe-2 (Milkyway-2), developed by the National University of Defense Technology (NUDT) and Chinese company Inspur. Tianhe-2, located at the National Super Computer Center in Guangzhou, delivered a performance of 33.86Pflop/s on the Linpack with 16,000 computer nodes, each comprising 2 Intel Ivy Bridge Xeon processors and 3 Xeon Phi chips (a total of 3,120,000 cores). Each node had 88GB of memory (64GB used by Ivy Bridge and 8GB by Xeon Phi). CPU plus coprocessor memory was ~1.34 PiB.

Tianhe-2 used several China-developed components, including the TH Express-2 interconnect, front-end processors, the Kylin Linux OS and software tools. The front-end system had 4,096 Galaxy FT-1500 CPUs, which were designed and developed at NUDT. The FT-1500 had 16 cores and was based on the SparcV9. Delivering 144Gflop/s, the FT-1500 consumed 65W, while the Intel Ivy Bridge had 12 cores for 211Gflop/s. In the block diagram above (source: NUDT), the blue blocks at the top are the processor cores and the yellow data pipeline is the interconnect.

June 2016: China continued to occupy the top spot of the Top 500 list with the Sunway TaihuLight delivering 93Pflop/s on the Linpack (Photo source: The State Council Information Office of the People's Republic of China). The system was developed by the National Research Center of Parallel Computer Engineering & Technology (NRCPC) and is installed at the National Supercomputing Center, Wuxi. With 10,649,600 computing cores comprising 40,960 nodes, it is 2x as fast and 3x as efficient as the Intel-based Tianhe-2. It consumes 15.37MW (peak) under load (HPL benchmark), which is 6Gflops/W. It uses the China designed and manufactured processor, SW26010, which uses a 64bit RISC architecture and possibly somewhat similar to Intel Xeon Phi (the latter information is not confirmed). Each node has 260 cores (four cores per group) and the system has 40,960 nodes (for a total of 10,649,600 cores). The SW26010 is clocked at 1.45GHz and has 32GB of primary memory (total 1.31PB memory). Read how at least one technologist is alarmed by China's continued success here: