The future of computers: Multicore and the Memory Wall23 Nov 2011 | Russel Fish III
After nearly 40 years wandering in the silicon wilderness searching for the promised land of CPU performance and power, computer deity, Berkeley's Dr. David Patterson handed down his famous "Three Walls." They were not etched in stone, but they may as well have been. These three immovable impediments defined the end times of increased computing performance. They would prevent computer users from ever reaching the land of milk and honey and 10GHz Pentiums. There may be a hole in the Walls, but for now we know them as:
- The Power Wall means faster computers get really hot.
- The Memory Wall means 1,000 pins on a CPU package is way too many.
- ILP Wall means a deeper instruction pipeline really means digging a deeper power hole. (ILP stands for instruction level parallelism.)
Taken together, they mean that computers will stop getting faster. Furthermore, if an engineer optimizes one wall he aggravates the other two. That is exactly what Intel did.
Intel's Tejas hits the walls – hard
Intel engineers went pedal to the metal straight into the Power Wall, backed up, gunned the gas, and went hard into the Memory Wall.
The industry was stunned when Intel canceled not one but two premier processor designs in May 2004. Intel's Tejas CPU, Sanskrit for fire, dissipated a stupendous 150 watts at 2.8GHz, more than Hasbro's Easy Bake Oven.
The Tejas had been projected to run 7GHz. It never did. When microprocessors get too hot, they quit working and sometimes blow up.
So, Intel quickly changed direction, slowed down their processors, and announced dual-core/multicore. Craig Barrett, Intel's CEO at the time, used a Q&A session at the 2005 Intel Developer Forum to explain the shift:
Question: "How should a consumer relate to the importance of dual core?"
Answer: "Fair question. I would only tell the average consumer... it's the way that the industry is going to continue to follow Moore's Law going forward—to increase the processing power in an exponential fashion over time... Dual core is really important because that's how it's happening. Multicore is tomorrow... Those are the magic ingredients that the average consumer will never see, but they will experience [them] through the performance of the machine."
Trust me, it's magic.
When an engineer, or anyone else for that matter, explains how to "increase the processing power in an exponential fashion" with "magic ingredients," it is wise to double check his math.
Multicore means that two or more complete microprocessors are built on the same chip attached to a shared memory bus. If one microprocessor is good, two must be twice as good, and four must be.......even better!
Barrett's statement was of course marketing hyperbole, an attempt to rally developers, customers, and stockholders behind the badly stumbling technology icon.
In small amounts multicore can have an effect. Two or even four cores can improve performance. However, doubling the number of cores does not double the performance. But the news is particularly alarming for popular data intensive cloud computing applications such as managing unstructured data.
Sandia Labs is an 8,000 person US government facility that dates back to the Manhattan Project. Besides nuclear weapons work, Sandia hosts several supercomputers and is a major computer research center. Sandia Labs performed an analysis of multicore microprocessors running just such data intensive applications.
Sandia checked Barrett's math and found it lacking in correctness. They reported as the number of cores increased, the processor power increased at substantially less than linear improvement and then decreased at an exponential rate. Sandia explained the decrease as, "The problem is the lack of memory bandwidth as well as contention between processors over the memory bus available to each processor.""
Multicore meets Memory Wall
Intel acknowledged the problem as ""a critical concern"" and recognized the criticality with the following understatement, ""Engineers at Sandia National Laboratories in New Mexico, have simulated future high-performance computers containing the 8-core, 16-core, and 32-core microprocessors that chip makers say are the future of the industry. The results are distressing.""
Breakfast at Barrett's
To help understand the problem, imagine the ""Make Breakfast"" application:
1. Start with a single cook in a small kitchen to scramble eggs.
2. Add another to fry bacon in parallel with the first doing the eggs, and making breakfast goes a bit faster.
3. Add another to brown, butter, and slice the toast, and you might gain a bit more speed.
4. Yet another person can set the table, pour the juice, and serve.
5. It is still not fast enough so you add a dozen more cooks to get that ""magic"", ""exponential"" increase in performance.
There are problems however.
The first four cooks all need to use the refrigerator. In our culinary world each would twiddle his thumbs waiting his turn. This is the physical world equivalent of ""contention between processors of the memory bus.""
Now how much faster do you think breakfast will be ready when those additional dozen cooks are added? As Borat would say, ""Not so much.""
Allocating multicores to do useful work is similarly challenging to allocating cooks. Just how many cores can usefully accelerate Microsoft Word, Excel, or PowerPoint? The answer is, not many. Legacy PC applications do not factor nicely into many pieces.
No need to be embarrassed
Fortunately, multicore is really about enabling the future rather than accelerating the past. Many of the really interesting and commercially valuable future computing opportunities are of the type known as ""embarrassingly parallel."" This means that the problems may be divided into many independent pieces and worked on separately.
Imagine the 1,000 page New York City phone book. Your task is to find a specific number in those thousand pages. The phone book is ordered by name, but the phone numbers are essentially random. You would begin a linear search at page 1 and continue until you found the desired number, probably several days later.
On the other hand, if you have a big Facebook community, you can give a page to each of your 1,000 friends and ask all of them to search at the same time. You will find the number approximately 1,000 times faster than if you searched by yourself.
The phone number search is an embarrassingly parallel problem. It can be divided into many independent tasks which can execute in parallel, and the results of all the tasks can be combined to produce a result. This is also the description of one of the most important applications in the massively parallel computer world. It is called MapReduce, and it runs Google's million server network search engine.
Intel has similarly described their future view of embarrassingly parallel problems as ""Recognition, Mining, and Synthesis."" In other words, future applications will manipulate and manage patterns of information. The pattern might be a sentence of text, a face in a crowd, or a phrase from a spoken speech. The datasets containing the patterns are immense; terabytes, petabytes, and eventually exabytes.
These enormous datasets factor nicely across embarrassingly parallel many CPU systems, and that was Intel's plan with multicore until it ran into the Memory Wall. Faced with the wall they have staved off total destruction with a series of architectural tricks.
Cache for clunkers
One attempt to mitigate the Memory Wall was computer cache. Cache is a small dedicated local memory that sits between a CPU core and main memory. Cache takes advantage of the fact that sometimes both instructions and data may be reused, and if they are already present in the local memory, the CPU core need not get it from main memory.
Back to the kitchen example. Each cook has a plate containing the breakfast ingredients retrieved from the refrigerator. Instead of going back to the refrigerator to get two more slices of bread each time the toaster dings done, the toast cook reaches for his stack of bread on his ingredient plate. The ingredient plate will eliminate some contention for the refrigerator, but the refrigerator still has one door, and that door is a choke point.
If the kitchen is cooking for 2 people, the ingredient plates will be small. As the kitchen scales up to serve a larger family, the ingredient plates will get bigger. This is similar to increasing a computer's cache size. Despite the ingredient plates, the refrigerator will be getting a lot of use loading up those plates.
When the kitchen decides to expand to become a commercial restaurant, management may decide to invest in a dual-door refrigerator. However there is no dual-door refrigerator for CPUs.
Caches are of limited use for data intensive applications such as MapReduce unless the entire dataset can fit in the cache thereby duplicating the main memory. Caches already occupy over half the silicon area of some CPUs and consume much of the power.
Since caches are nearing their practical limits in size, the obvious question from computer users is, ""Why can't you just increase the memory bus bandwidth?""
There are three ways to increase memory bus bandwidth:
1. Increase memory transfer speed.
2. Increase memory transfer size.
3. Move data closer to CPUs.
Memory transfer speed is limited by the Power Wall. The faster the data is pushed on the memory bus, the more heat is generated. Several attempts have been made to increase speed other than just pushing harder and getting hotter. About a decade ago, Intel designed some processors using a high-speed technology called Rambus. The technology was fast but more expensive than other technologies.
More recently, Intel has revisited the memory transfer problem with a similar but updated technology they call Hybrid Memory Cube (HMC). Few details have been publicly released, but it appears that HMC may suffer the similar cost problem to Rambus.
Share this page with your friends
5,000 Panasonic workers to lose jobs
Jobs from Panasonic's auto and industrial division, will be cut based on new business strategy...