The largest AI models can take months to train on today’s computing platforms. NVIDIA’s new offerings aim to address that issue.
At its annual GTC conference, NVIDIA announced a bevy of new AI-specific GPUs and CPUs, including the Hopper H100 GPU, which they claim will dramatically speed up how companies deploy the most sophisticated AI applications like BERT and GPT-3, which are built using transformer models. But the announcement begs the question: Can all challenges around AI simply be solved with more computing power?
See also: NVIDIA and Partners Target Artificial Intelligence at GTC
NVIDIA takes aim at the transformer problem
In NVIDIA’s announcement, Dave Salvator writes, “The largest AI models can require months to train on today’s computing platforms. That’s too slow for businesses.”
The big reason behind this huge lag time on training? The sheer complexity of these transformer models, which were initially started for natural language processing (NLP) applications but are now leveraged for other complex uses, like computer vision for self-driving cars. These models and their training sets can reach billions of parameters, all of which need to be calculated to turn seemingly random data into computer intelligence.
NVIDIA’s new chips have 80 billion transistors and are based on TSMC’s 4nm process, but NVIDIA says the biggest change with this new chip is actually in how it leverages a new 8-bit floating point data format, called FP8. Because AI training depends on how quickly it can churn through floating-point numbers with fractional components, being able to mix 8-bit precision and 16-bit “half” precision (FP16) is a huge advantage. The chips can also use 32-bit “single” precision (FP32) and 64-bit “double” precision (FP64) in specialty situations.
Combine that with new data center hardware for linking many Hopper H100s together, and NVIDIA seems confident that they’ll be leading the parameter race well into the trillions.
Salvator writes: “When coupled with other new features in the Hopper architecture — like the NVLink Switch system, which provides a direct high-speed interconnect between nodes — H100-accelerated server clusters will be able to train enormous networks that were nearly impossible to train at the speed necessary for enterprises.”
NVIDIA’s tests on a Mixture of Experts (MoE) Transformer Switch-XXL variant with 395 billion parameters showed “higher throughput and a 9x reduction in time to train, from seven days to just 20 hours.”
See also: NVIDIA’s New Grace CPU Enables Giant AI
Is bigger AI always better?
Not everyone agrees. A 2019 study from researchers at the University of Massachusetts found that training a transformer AI model, with 213M parameters, took 84 hours to complete and created 626,155 pounds of CO2e, which is roughly equivalent to the consumption of 17 Americans throughout the course of a year.
And while that might not seem like much at first, keep in mind that GPT-3 used a whopping 160-175 billion parameters, depending on who you ask. Google has already trained a new language model using 1.4 trillion parameters, and when talking to Wired, Andrew Feldman, founder and CEO of Cerebras, shared a rumor that the next iteration from OpenAI, GPT-4, will have more than 100 trillion parameters.
We’ll spare the calculations here, but it’s easy to see how AI applications can create enormous environmental impact, which is only exacerbated by the speed and accessibility of the processors that perform the work. But for those who are more conscious of costs than greenhouse gasses, the same University of Massachusetts study found that the same transformer training session also costs between $942,973 and $3,201,722 in cloud computing costs alone.
There’s no saying how these numbers change with a few hundred H100 GPUs leading the charge, but overall computational usage for AI training will certainly expand for many years to come. NVIDIA is promoting its new chip architecture as the go-to solution for new use cases, like omics (biological studies of genomics or drug discovery), route optimization for autonomous robots, and even tweaking SQL queries for reduced execution time.
On the other hand, the researchers call for more cost-benefit (accuracy) analysis, more equitable access to computation resources, and a larger industry push for algorithms that are optimized for using the least amount of computing power possible.
But in a battle between environmentally-conscious university researchers and tech companies that don rose-colored glasses when throwing billions into AI research—not to mention trillions of parameters—we’re likely to continue the cycle of bigger chips, bigger algorithms, and bigger promises.