Arik Hesseldahl

Recent Posts by Arik Hesseldahl

Nvidia Chips to Power World’s Most Powerful Supercomputer

Oak Ridge National Lab's "Jaguar" computer

It has been about a year since the United States lost its title as the home of the world’s most powerful publicly known supercomputer. Last November, the “Jaguar” computer based at the U.S. government’s Oak Ridge National Laboratory found itself supplanted by a computer in China in the top spot on the closely watched Top 500 list of the world’s most muscular supercomputers.

Despite the fact that the Chinese system was built largely with American-made or American-designed components, the news came as a bit of a blow to American pride, and even caught the attention of President Obama, who kvetched about it in January’s State of the Union address.

By June (the list is updated twice a year) the Chinese machine had fallen to second place, its crown seized by a supercomputer in Japan, relegating the top supercomputer in the U.S. to third place.

Today, the Oak Ridge National Lab in Tennessee, part of the U.S. Department of Energy, will announce plans to build a system that has a good shot at reclaiming the top spot. The machine will be named “Titan,” and its primary computing engine will be the Tesla chip from Nvidia, the company best known for turning out chips that enhance the graphics of games on personal computers.

Nvidia has been making inroads in high-performance computing for some time. Earlier this year I wrote about how the Tesla chips were helping Lucasfilm make movies faster.

I talked with Steve Scott, the CTO of Nvidia’s Tesla business unit, who told me that the Titan machine will be 10 times more powerful than the current Jaguar machine, and that 85 percent of its computing power will come from Nvidia chips, while the remaining portion will come from conventional CPU chips from Advanced Micro Devices.

Why GPUs and not CPUs? It turns out that graphics chips are really good at doing a certain kind of math known as a floating point operation, much faster than a typical CPU chip from Intel or AMD found inside a PC or server.

It’s also an issue of power. For years, as chips and the transistors on them have shrunk, the amount of power required to send pulsing through them has dropped as well. Scott says that is no longer the case. “We’ve reached the point where processors have become power constrained. If you pack all the transistors that you can onto a chip and run it as fast as you can, the chip will melt. We’ve entered a time where performance is constrained by power, and its only going to get worse, so you need processors that are power efficient,” he says. “It’s a fundamental sea change in the underlying technology of high performance computing.”

GPUs, originally designed for gaming and professional graphics applications like editing movies and visualizing complex problems for engineers and scientists, are inherently designed to perform several repetitive tasks at once. In explaining this, I always think back to the old saying “many hands make light work,” though here it’s applied to computing. Two people who divide up the task of folding a pile of laundry get it done faster than one. And four people will get it done faster than two.

Basically, a GPU chip is designed to render what happens to every pixel of a computer screen 50 times a second or even faster. Essentially, lots of small computational jobs are carried out at once. It’s called parallel computing, and, fundamentally, CPUs chips aren’t as good at it as GPU chips. CPUs are better at doing one job at a time, getting it done really fast, and then moving on to the next one. Generally speaking, Scott says, GPUs are about eight times faster at floating point operations than CPUs.

For Nvidia it will be a return trip to the top spot. China’s supercomputing champ, the Tianhe-1A at National Supercomputing Center in Tianjin, which is now ranked No. 2 in the world, uses Nvidia GPUs. This certainly got the world’s attention concerning the potential for GPUs in high performance computing.

The plan at Oak Ridge calls for Titan to have 18,000 nodes, each with an AMD CPU chip coupled with an Nvidia Tesla GPU. Most of the heavy lifting will be done by the GPUs, Scott says. Its total computing capacity will top out at 20 petaflops. FLOPS are floating point operations per second. “Peta” refers to how many the system can do every second: In this case, the answer is 20 quadrillion. Just because I can — and because it’s one of the rare cases where I get to use a number that’s larger than the national debt — I’m going to write that number out: 20,000,000,000,000,000.

And what will it be used for? While many of the Department of Energy’s computers are used to simulate nuclear explosions that are no longer allowed thanks to the Test Ban Treaty, this one won’t be. The mission at Oak Ridge, Scott says, is to advance the boundaries of science. Scientists will use it to model climate change, and to predict the results of different methods of mitigating it. They’ll also use it to design engines, study biology and genetics, and explore the possibilities of using nuclear fusion for energy. If you have interesting scientific work to do that requires this kind of computing oomph, you can even write a proposal explaining how you’d use it.

In the first phase of Titan’s deployment, which is already under way, Oak Ridge will upgrade its existing Jaguar supercomputer with 960 new Tesla chips. In a second phase, expected to start next year, Oak Ridge plans to deploy the 18,000-node Tesla-based system.

Down the road, the hope within supercomputing circles is that performance improves to the point where we’re no longer talking petaflops, but exaflops, or quintillions of floating point operations every second. The government is already working on that, and earlier this year President Obama asked Congress for $126 million in the federal budget to begin research to work on ways to get there by 2018. The biggest problem: How to supply enough electrical power while delivering the computing muscle. Today’s announcement by Oak Ridge is a big step in that direction, but there are still 981 more petaflops to conquer.

Latest Video

View all videos »

Search »

There’s a lot of attention and PR around Marissa, but their product lineup just kind of blows.

— Om Malik on Bloomberg TV, talking about Yahoo, the September issue of Vogue Magazine, and our overdependence on Google