Enlarge /. Intel wants to replace Nvidia as a "one-stop GPU shop" with a full line of GPUs ranging from laptops to games to data centers.
At the Intel Architecture Day 2020, the focus and enthusiasm was on the upcoming 10 nm laptop CPUs from Tiger Lake. However, Intel also announced advances in Xe GPU technology, strategy, and planning that could shake the industry up over the next several years.
Integrated Xe graphics are probably one of the best features of the Tiger Lake laptop CPU. While we don't have any officially approved test results yet, let alone third-party testing, some leaked benchmarks show that Tiger Lake's integrated graphics outperform the Vega 11 chipset in Ryzen 4000 Mobile by a whopping 35 percent.
Assuming these leaked benchmarks are used in the real world, they're a much-needed shot in the arm for Intel's bad reputation in the laptop space. But Xe has more to offer.
A new challenger appears
Enlarge /. The 7 nm Xe architecture from Intel is intended to cover the entire range of GPU applications. Ponte Vecchio – the first Xe product – is aimed specifically at high-end deep learning and training in data center and supercomputer environments.
It's been a long time since anyone really questioned the two-party lockdown on high-end graphics cards. For about 20 years you have only opted for Nvidia or Radeon chipsets. We first heard of Intel's plans to change this in 2019 – but at that time Intel was only really talking about the upcoming Xe GPU architecture in Ponte Vecchio, a product designed for the use of HPC supercomputers and data centers .
The company wasn't really ready to talk about it then, but we spotted a slide on Intel's Supercomputing 2019 deck mentioning plans to expand the Xe architecture to workstation, gaming, and laptop lines. We haven't seen a desktop gaming card from Intel yet – but Xe has replaced both the old UHD line and the more powerful Iris + replacement, and Intel is much more willing to talk about an expansion in the near future now than it is last year .
When we asked Intel executives about this "gaming" slide in 2019, they seemed pretty noncommittal. When we asked again on Architecture Day 2020, the shyness was gone. Intel has no date for a desktop gaming card (Xe HPG) yet, but its executives have expressed confidence that "market-leading performance" – including onboard hardware ray tracing – will soon be achieved in this segment.
A closer look at Xe LP
If you're reading our Tiger Lake CPU Cover, this graphic should look familiar – built-in graphics from Xe LP achieve the same increase in voltage range and frequency efficiency from Intel's newly improved FinFET and SuperMIM components under the hood.
Concurrency is the key to GPU performance. The 96 execution units of this Xe LP-GPU can generate 1,536 floating point operations, 48 texels and 24 pixels per clock cycle.
In each Xe LP execution unit there is an eight-fold floating point / integer arithmetic unit and a two-fold mathematical ALU. EUs are thread-controlled in pairs.
The GPU integrated in Xe LP has up to 16 MB of its own L3 cache – not approved for the CPU! – and an L1 data cache associated with each 16 EU subgroup.
Xe LP is designed to be optimally efficient for a wide variety of data types. By reducing the accuracy from 32 bits to 16, the number of operations per cycle is doubled. fall back to 8-bit double ops per clock.
Xe LP's media engine is designed for high performance environments, including 8K video at 60 fps.
Xe LP's display engine is designed for several high-performance video output interfaces with high resolutions and frame rates.
If you've followed our previous coverage of the architecture of Tiger Lake, the first diagram in the gallery should look very familiar. The Xe LP-GPU benefits from the newly designed FinFET transistors and SuperMIM capacitors from Intel like the Tiger Lake CPU. In particular, this means stability over a larger voltage range and a higher frequency increase across the board compared to Gen11 GPUs (Ice Lake Iris +).
With a wider dynamic range for the voltage, Xe LP can be operated with significantly lower power than Iris + – and it can also be scaled to higher frequencies. The increased frequency increase also means higher frequencies at the same voltages that Iris + could handle. It is difficult to overestimate the importance of this curve, which affects energy efficiency and performance for not just some, but all workloads.
However, the improvements do not end with an increase in voltage and frequency. The high-end Xe-LP has 96 execution units (compared to the 64 of Iris + G7's), and each of these execution units has FP / INT arithmetic logic units twice as wide as that of Iris + G7's. Add a new L1 data cache for each 16-EU subgroup and an increase in the L3 cache from 3MB to 16MB and you can get an idea of just how big an improvement in the Xe-LP actually is.
The 96-EU version of Xe LP is designed for 50 percent more 32-bit floating point operations (FLOPS) per clock cycle than Iris + G7 and operates at higher frequencies. This agrees pretty well with the previously mentioned benchmarks for leaked Time Spy GPUs – the i7-1165G7 got a Time Spy GPU score of 1,482 to 806 from the i7-1065G7 (and 1,093 from the Ryzen 7 4700U).
Improving the buy-in with OneAPI
One of the greatest business keys to success in the GPU market is reducing costs and increasing sales by addressing multiple markets. The first part of Intel's strategy for broad appeal and low manufacturing and design costs for Xe is scalability. Instead of having completely separate designs for laptop, desktop and data center parts, Xe should be scaled relatively easily by adding more subslices, more EUs than the SKUs raised.
There's one more important differentiator that Intel needs to really break into the market. AMD's Radeon line suffers from the fact that AI practitioners stay cold no matter how attractive they are to gamers. This isn't necessarily because Radeon GPUs couldn't be used for AI computation – the problem is simpler. There's an entire ecosystem of libraries and models built specifically for Nvidia's CUDA architecture and no other.
It's unlikely that a competing deep learning GPU architecture that requires massive code rewriting can be successful unless it offers something more enticing than some cheaper or slightly more powerful hardware. Intel's answer is to instead offer an environment that can be written once and executed anywhere – specifically the OneAPI framework, which is expected to reach production status later this year.
Many people expect all "serious" AI / deep learning workloads to run on GPUs, which generally offer massively higher throughput than CPUs – even CPUs with Intel's AVX-512 "Deep Learning Boost" instruction set – possibly. In the data center, where it is easy to order any configuration with little space, power or heating restrictions, this is at least almost true.
However, when it comes to inference workloads, GPU execution isn't always the best answer. While the GPU's massively parallel architecture offers potentially higher throughput than a CPU, the latency involved in setting up and breaking down short workloads can often make the CPU an acceptable or even superior alternative.
An increasing amount of inferences are not performed at all in the data center – they are performed on the fringes, where performance, space, heat, and cost constraints can often drive GPUs out of service. The problem here is that you can't just port the code written for Nvidia CUDA to an x86 CPU. Hence, a developer must make difficult decisions about which architectures to plan and support. These decisions affect the maintainability of the code as well as road performance over time.
Although Intel's OneAPI framework is really open, and Intel invites hardware developers to write their own libraries for non-Intel parts, obviously Xe graphics are a top-notch citizen there – as are Intel CPUs. The siren call of deep learning libraries, once written and once serviced to run on dedicated GPUs, integrated GPUs, and x86 CPUs, may be enough to spark serious interest in Xe graphics among AI developers which simple competition for performance would not.
As always, it is a good idea to maintain a healthy level of skepticism when vendors make claims about unreleased hardware. That said, we've seen enough detail from Intel to draw our attention to the GPU front, especially with the (strategic?) Leaked Xe LP benchmarks to back up their claims so far.
We believe the most important thing to look out for here is Intel's holistic strategy. Intel executives have been telling us for several years that the company is no longer a "CPU company" and invests as much in software as it does in hardware. In a world where it is easier to buy more hardware than hire (and manage) more developers, this seems like a smart strategy.
High quality drivers have long been a hallmark of Intel's integrated graphics. While gaming may not have been top notch on UHD graphics, the user experience has been overwhelming as expectations "just work" on all platforms. If Intel can manage to extend this "it just works" expectation to deep learning development with OneAPI, we believe there is a real opportunity to break Nvidia's current lockdown on the deep learning GPU market.
In the meantime, we're really looking forward to the Xe LP graphics debut in the real world when Tiger Lake launches in September.