28-06-2024 | Semidynamics | Semiconductors
Semidynamics has announced Tensor Unit efficiency data for its ‘All-In-One’ AI IP running a LlaMA-2 7B-parameter Large Language Model (LLM).
Roger Espasa, Semidynamics’ CEO, explained, “The traditional AI design uses three separate computing elements: a CPU, a GPU and an NPU connected through a bus. This traditional architecture requires DMA-intensive programming, which is error-prone, slow, and energy-hungry, plus the challenge of having to integrate three different software stacks and architectures. In addition, NPUs are fixed-function hardware that cannot adapt to future AI algorithms yet to be invented.
“In contrast, Semidynamics has re-invented AI architecture and integrates the three elements into a single, scalable processing element. We combine a RISC-V core, a Tensor Unit that handles matrix multiplication (playing the role of the NPU) and a Vector Unit that handles activation-like computations (playing the role of the GPU) into a fully integrated, all-in-one compute element. Our new architecture is DMA-free, uses a single software stack based on ONNX and RISC-V and offers direct, zero-latency connectivity between the three elements. The result is higher performance, lower power, better area and a much easier-to-program environment, lowering overall development costs. In addition, because the Tensor and Vector Units are under the direct control of a flexible CPU, we can deploy any existing or future AI algorithm, providing great protection to our customer’s investments.”
Large Language Models (LLMs) have emerged as a key element of AI applications. LLMs are computationally dominated by self-attention layers. These layers comprise five matrix multiplications (MatMul), a matrix Transpose and a SoftMax activation function. In the All-In-One solution, the Tensor Unit (TU) handles matrix multiplication, whereas the Vector Unit (VU) can efficiently handle Transpose and SoftMax. Since the Tensor and Vector Units share the vector registers, expensive memory copies can be largely avoided. Hence, there is zero latency and zero energy spent transferring data from the MatMul layers to the activation layers and vice versa. To keep the TU and the VU continuously busy, weights and inputs must be efficiently fetched from memory into the vector registers. To this end, the company’s Gazzillion Misses technology provides unprecedented ability to move data. By supporting many in-flight cache misses, data can be fetched ahead of time yielding high resource utilization. Furthermore, its custom tensor extension includes new vector instructions optimized for fetching and transposing 2D tiles, greatly improving tensor processing.
The company has run the full LlaMA-2 7B-parameter model (BF16 weights) on its All-In-One element, using its ONNX Run Time Execution Provider, and calculated the utilization of the Tensor Unit for all the MatMul layers in the model. The results are aggregated and presented, and organized by the A-tensor shape. There are six different shapes in LlaMA-2. As it can be seen, utilisation is above 80% for most shapes, in sharp contrast with other architectures. Results are collected in the most challenging conditions, i.e., with a batch of one and for the first-token computation. To complement this data the TU efficiency for large matrix sizes demonstrating the combined efficiency of the Tensor Unit and the Gazzillion technology. As the number of elements in the N, M, P dimensions of the matrix increase, the total size in MBs quickly exceeds any possible cache/scratchpad available. The noteworthy aspect of the chart is that the performance is stable, slightly above 70%, irrespective of the total size of the matrices. This quite surprising result is thanks to the Gazzillion technology being capable of sustaining a high streaming data rate between main memory and the Tensor Unit.
Espasa concluded, “Our new All-In-One AI IP not only delivers outstanding AI performance but is also so much easier to program as there is now just one software stack instead of three. Developers can use the RISC-V stack they already know and they do not have to worry about software-managed local SRAMs, or DMAs. Furthermore, Semidynamics provides an ONNX runtime optimized for the All-In-One AI IP, which allows programmers to easily run their ML models. Therefore, our solution represents a big step forward in programmer friendliness and ease of integration into new SOC designs. Our customers using All-In-One will be able to pass on to their customers, developers, and users all these benefits in the form of better and easier-to-program silicon.
“Moreover, our All-In-One design is completely resilient to future changes in AI/ML algorithms and workloads. This is a huge risk protection for customers starting a silicon project that will not hit the market for several years. Knowing that your AI IP will still be relevant when your silicon enters volume production is a unique advantage of our technology.”