# Foreword This chapter describes Huawei's Ascend AI chips, and hardware and software architectures of Ascend chips, and full-stack all-scenario solutions of Ascend AI chips. # Objectives On completion of the chapter, you will be able to: - Get an overview of AI chips. - Understand hardware and software architectures of Huawei Ascend chips. - Learn about Huawei Atlas AI computing platform. - Understand industry applications of Atlas. 3 Huawei Confidential # Overview and Objectives This section is an overview of AI chips, including the introduction, classification, and status of AI chips, comparison between GPUs and CPUs, and introduction of Ascend AI processors. #### **Definition** - Four elements of AI: data, algorithm, scenario, and computing power - AI chips, also known as AI accelerators, are function modules that process massive computing tasks in AI applications. Huawei Confidential # **Contents** #### 1. Overview of AI Chips - Classification of AI Chips - Current Status of AI Chips - Design Comparison of GPUs and CPUs - Ascend AI Processors - 2. Hardware Architecture of Ascend Chips - 3. Software Architecture of Ascend Chips - 4. Huawei Atlas AI Computing Platform - 5. Industry Applications of Atlas - 6 Huawei Confidential #### Classification of AI Chips - AI Chips can be divided into four types by technical architecture: - A central processing unit (CPU): a super-large-scale integrated circuit, which is the computing core and control unit of a computer. It can interpret computer instructions and process computer software data. - A graphics processing unit (GPU): a display core, visual processor, and display chip. It is a microprocessor that processes images on personal computers, workstations, game consoles, and mobile devices, such as tablet computers and smart phones. - An application specific integrated circuit (ASIC): an integrated circuit designed for a specific purpose. - A field programmable gate array (FPGA): designed to implement functions of a semicustomized chip. The hardware structure can be flexibly configured and changed in real time based on requirements. 7 Huawei Confidential ## **Contents** #### 1. Overview of AI Chips - Classification of AI Chips - Current Status of AI Chips - Design Comparison of GPUs and CPUs - Ascend AI Processors - 2. Hardware Architecture of Ascend Chips - 3. Software Architecture of Ascend Chips - 4. Huawei Atlas AI Computing Platform - 5. Industry Applications of Atlas - B Huawei Confidential #### Current Status of AI Chips - CPU - Central processing unit (CPU) - The computer performance has been steadily improved based on the Moore's Law. - The CPU cores added for performance enhancement also increase power consumption and cost. - Extra instructions have been introduced and the architecture has been modified to improve AI performance. - Instructions, such as AVX512, have been introduced into Intel processors (CISC architecture) and vector computing modules, such as FMA, into the ALU computing module. - Instruction sets including Cortex A have been introduced into ARM (RISC architecture), which will be upgraded continuously. - Despite that boosting the processor frequency can elevate the performance, the high frequency will cause huge power consumption and overheating of the chip as the frequency reaches the ceiling. 9 Huawei Confidentia ### Current Status of AI Chips - GPU - Graph processing unit (GPU) - GPU performs remarkably in matrix computing and parallel computing and plays a key role in heterogeneous computing. It was first introduced to the AI field as an acceleration chip for deep learning. Currently, the GPU ecosystem has matured. - Using the GPU architecture, NVIDIA focuses on the following two aspects of deep learning: - Diversifying the ecosystem: It has launched the cuDNN optimization library for neural networks to improve usability and optimize the GPU underlying architecture. - Improving customization: It supports various data types, including int8 in addition to float32; introduces modules dedicated for deep learning. For example, the optimized architecture of Tensor cores has been introduced, such as the TensorCore of V100. - The existing problems include high costs and latency and low energy efficiency. **W** HUAWEI #### Current Status of AI Chips - TPU - Tensor processing unit (TPU) - Since 2006, Google has sought to apply the design concept of ASICs (application specific integrated circuits) to the neural network field and released TPU, a customized AI chip that supports TensorFlow, which is an open-source deep learning framework. - Massive systolic arrays and large-capacity on-chip storage are adopted to accelerate the most common convolution operations in deep neural networks. - Systolic arrays optimize matrix multiplication and convolution operations to elevate computing power and lower energy consumption. 11 Huawei Confidentia #### Current Status of AI Chips - FPGA - Field programmable gate array (FPGA) - Using the HDL (hardware description language) programmable mode, FPGAs are highly flexible, reconfigurable and re-programmable, and customizable. - Multiple FPGAs can be used to load the DNN (Deep Neural Network) model on the chips to lower computing latency. FPGAs outperform GPUs in terms of computing performance. However, the optimal performance cannot be achieved due to continuous erasing and programming. Besides, redundant transistors and cables, logic circuits with the same functions occupy a larger chip area. - The reconfigurable structure lowers supply and R&D risks. The cost is relatively flexible depending on the purchase quantity. - The design and tapeout processes are decoupled. The development period is long, generally half a year. The entry barrier is high. # **Contents** #### 1. Overview of AI Chips - Classification of AI Chips - Current Status of AI Chips - Design Comparison of GPUs and CPUs - Ascend AI Processors - 2. Hardware Architecture of Ascend Chips - 3. Software Architecture of Ascend Chips - 4. Huawei Atlas AI Computing Platform - 5. Industry Applications of Atlas - 13 Huawei Confidential # Design Comparison of GPUs and CPUs - GPUs are designed for massive data of the same type independent from each other and pure computing environments that do not need to be interrupted. - Each GPU comprises several large-sized parallel computing architectures with thousands of smaller cores designed to handle multiple tasks simultaneously. - Throughput-oriented design - With many ALUs and few caches, which improve services for threads, unlike those in CPU. The cache merges access to DRAM, causing latency. - The control unit performs combined access. - A large number of ALUs process numerous threads concurrently to cover up the latency. - Specialized in computing-intensive and easy-to-parallel programs #### Design Comparison of GPUs and CPUs (cont.) - CPUs need to process different data types in a universal manner, perform logic judgment, and introduce massive branch jumps and interrupted processing. - Composed of several cores optimized for sequential serial processing - Low-latency design - The powerful ALU unit can complete the calculation in a short clock cycle. - The large cache lowers latency. - High clock frequency - Complex logic control unit, multi-branch programs can reduce latency through branch prediction. - For instructions that depend on the previous instruction result, the logic unit determines the location of the instructions in the pipeline to speed up data forwarding. - Specialized in logic control and serial operation **W** HUAWEI ALU **ALU** ALU ALU Cache **DRAM** Control 15 Huawei Confidential ### **Contents** #### 1. Overview of AI Chips - Classification of AI Chips - Current Status of AI Chips - Design Comparison of GPUs and CPUs - Ascend AI Processors - 2. Hardware Architecture of Ascend Chips - 3. Software Architecture of Ascend Chips - 4. Huawei Atlas AI Computing Platform - 5. Industry Applications of Atlas #### **Ascend AI Processors** - Neural-network processing unit (NPU): uses a deep learning instruction set to process a large number of human neurons and synapses simulated at the circuit layer. One instruction is used to process a group of neurons. - Typical NPUs: Huawei Ascend AI chips, Cambricon chips, and IBM TrueNorth - Ascend-Mini - Architecture: Da Vinci - Half precision (FP16): 8 Tera-FLOPS - Integer precision (INT8): 16 Tera-OPS - 16-channel full-HD video decoder: H.264/H.265 - 1-channel full-HD video decoder: H.264/H.265 - Max. power: 8W - 12nm FFC Huawei Confidential - Ascend-Max - Architecture: Da Vinci - Half precision (FP16): 256 Tera-FLOPS - Integer precision (INT8): 512 Tera-OPS - 128-channel full-HD video decoder: H.264/H.265 - Max. power: 350W - 7nm ## **Contents** - 1. Overview of AI Chips - 2. Hardware Architecture of Ascend Chips - Logic Architecture of Ascend AI Processors - Da Vinci Architecture - 3. Software Architecture of Ascend Chips - 4. Huawei Atlas AI Computing Platform - 5. Industry Applications of Atlas ### Logic Architecture of Ascend AI Processors - Ascend AI processor consist of: - Control CPU - AI computing engine, including AI core and AI CPU - Multi-layer system-on-chip (SoC) caches or buffers - Digital vision pre-processing (DVPP) module 19 Huawei Confidential ## **Contents** - 1. Overview of AI Chips - 2. Hardware Architecture of Ascend Chips - Logic Architecture of Ascend AI Processors - Da Vinci Architecture - 3. Software Architecture of Ascend Chips - 4. Huawei Atlas AI Computing Platform - 5. Industry Applications of Atlas ### Ascend AI Computing Engine - Da Vinci Architecture One of the four major architectures of Ascend AI processors is the AI computing engine, which consists of the AI core (Da Vinci architecture) and AI CPU. The Da Vinci architecture developed to improve the AI computing power serves as the core of the Ascend AI computing engine and AI processor. 21 Huawei Confidential ## Da Vinci Architecture (AI Core) - Main components of the Da Vinci architecture: - Computing unit: It consists of the cube unit, vector unit, and scalar unit. - Storage system: It consists of the on-chip storage unit of the AI core and data channels. - Control unit provides instruction control for the entire computing process. It is equivalent to the command center of the AI core and is responsible for the running of the entire AI core. ## Da Vinci Architecture (AI Core) - Computing Unit - Three types of basic computing units: cube, vector, and scalar units, which correspond to matrix, vector and scalar computing modes respectively. - Cube computing unit: The matrix computing unit and accumulator are used to perform matrix-related operations. Completes a matrix (4096) of 16x16 multiplied by 16x16 for FP16, or a matrix (8192) of 16x32 multiplied by 32x16 for the INT8 input in a shot. - Vector computing unit: Implements computing between vectors and scalars or between vectors. This function covers various basic computing types and many customized computing types, including computing of data types such as FP16, FP32, INT32, and INT8. - Scalar computing unit: Equivalent to a micro CPU, the scalar unit controls the running of the entire AI core. It implements loop control and branch judgment for the entire program, and provides the computing of data addresses and related parameters for cubes or vectors as well as basic arithmetic operations. #### Da Vinci Architecture (AI Core) - Storage System - The storage system of the AI core is composed of the storage unit and corresponding data channel. - The storage unit consists of the storage control unit, buffer, and registers: - Storage control unit: The cache at a lower level than the AI core can be directly accessed through the bus interface. The memory can also be directly accessed through the DDR or HBM. A storage conversion unit is set as a transmission controller of the internal data channel of the AI core to implement read/write management of internal data of the AI core between different buffers. It also completes a series of format conversion operations, such as zero padding, Img2Col, transposing, and decompression. - Input buffer: The buffer temporarily stores the data that needs to be frequently used so the data does not need to be read from the AI core through the bus interface each time. This mode reduces the frequency of data access on the bus and the risk of bus congestion, thereby reducing power consumption and improving performance. - Output buffer: The buffer stores the intermediate results of computing at each layer in the neural network, so that the data can be easily obtained for next-layer computing. Reading data through the bus involves low bandwidth and long latency, whereas using the output buffer greatly improves the computing efficiency. - · Register: Various registers in the AI core are mainly used by the scalar unit. 25 Huawei Confidential ## Da Vinci Architecture (AI Core) - Storage System (cont.) - · Data channel: path for data flowing in the AI core during execution of computing tasks - A data channel of the Da Vinci architecture is characterized by multiple-input single-output. Considering various types and a large quantity of input data in the computing process on the neural network, parallel inputs can improve data inflow efficiency. On the contrary, only an output feature matrix is generated after multiple types of input data are processed. The data channel with a single output of data reduces the use of chip hardware resources. #### Da Vinci Architecture (AI Core) - Control Unit - The control unit consists of the system control module, instruction cache, scalar instruction processing queue, instruction transmitting module, matrix operation queue, vector operation queue, storage conversion queue, and event synchronization module. - System control module: Controls the execution process of a task block (minimum task computing granularity for the AI core). After the task block is executed, the system control module processes the interruption and reports the status. If an error occurs during the execution, the error status is reported to the task scheduler. - Instruction cache: Prefetches subsequent instructions in advance during instruction execution and reads multiple instructions into the cache at a time, improving instruction execution efficiency. - Scalar instruction procession queue: After being decoded, the instructions are imported into a scalar queue to implement address decoding and operation control. The instructions include matrix computing instructions, vector calculation instructions, and storage conversion instructions. - Instruction transmitting module: Reads the configured instruction addresses and decoded parameters in the scalar instruction queue, and sends them to the corresponding instruction execution queue according to the instruction type. The scalar instructions reside in the scalar instruction processing queue for subsequent execution. 27 Huawei Confidential ## Da Vinci Architecture (AI Core) - Control Unit (cont.) - Instruction execution queue: Includes a matrix operation queue, vector operation queue, and storage conversion queue. Different instructions enter corresponding operation queues, and instructions in the queues are executed according to the entry sequence. - Event synchronization module: Controls the execution status of each instruction pipeline in real time, and analyzes dependence relationships between different pipelines to resolve problems of data dependence and synchronization between instruction pipelines. # Contents - 1. Overview of AI Chips - 2. Hardware Architecture of Ascend Chips - 3. Software Architecture of Ascend Chips - Logic Architecture of Ascend 310 - Neural Network Software Flow of Ascend 310 - 4. Huawei Atlas AI Computing Platform - 5. Industry Applications of Atlas 29 Huawei Confidential ### Logic Architecture of Ascend AI Processor Software Stack - L3 application enabling layer: It is an application-level encapsulation layer that provides different processing algorithms for specific application fields. L3 provides various fields with computing and processing engines. It can directly use the framework scheduling capability provided by L2 to generate corresponding NNs and implement specific engine functions. - Generic engine: provides the generic neural network inference capability. - Computer vision engine: encapsulates video or image processing algorithms. - Language and text engine: encapsulates basic processing algorithms for voice and text data. #### Logic Architecture of Ascend AI Processor Software Stack (cont.) - L2 execution framework layer: encapsulates the framework calling capability and offline model generation capability. After the application algorithm is developed and encapsulated into an engine at L3, L2 calls the appropriate deep learning framework, such as Caffe or TensorFlow, based on the features of the algorithm to obtain the neural network of the corresponding function, and generates an offline model through the framework manager. After L2 converts the original neural network model into an offline model that can be executed on Ascend AI chips, the offline model executor (OME) transfers the offline model to Layer 1 for task allocation. - L1 chip enabling layer: bridges the offline model to Ascend AI chips. L1 accelerates the offline model for different computing tasks via libraries. Nearest to the bottom-layer computing resources, L1 outputs operator-layer tasks to the hardware. - L0 computing resource layer: provides computing resources and executes specific computing tasks. It is the hardware computing basis of the Ascend AI chip. 31 Huawei Confidential ## **Contents** - 1. Overview of AI Chips - 2. Hardware Architecture of Ascend Chips - 3. Software Architecture of Ascend Chips - Logic Architecture of Ascend 310 - Neural Network Software Flow of Ascend 310 - 4. Huawei Atlas AI Computing Platform - 5. Industry Applications of Atlas #### Neural Network Software Flow of Ascend AI Processors - The neural network software flow of Ascend AI processors is a bridge between the deep learning framework and Ascend AI chips. It realizes and executes a neural network application and integrates the following functional modules. - Process orchestrator: implements the neural network on Ascend AI chips, coordinates the whole process of effecting the neural network, and controls the loading and execution of offline models. - Digital vision pre-processing (DVPP) module: performs data processing and cleaning before input to meet format requirements for computing. - Tensor boosting engine (TBE): functions as a neural network operator factory that provides powerful computing operators for neural network models. - Framework manager: builds an original neural network model into a form supported by Ascend AI chips, and integrates the new model into Ascend AI chips to ensure efficient running of the neural network. - · Runtime manager: provides various resource management paths for task delivery and allocation of the neural network. 35 Huawei Confidential #### Neural Network Software Flow of Ascend AI Processors # **Contents** - 1. Overview of AI Chips - 2. Hardware Architecture of Ascend Chips - 3. Software Architecture of Ascend Chips - 4. Huawei Atlas AI Computing Platform - 5. Industry Applications of Atlas #### Atlas Accelerates AI Inference Ascend 310 AI processor #### Performance improved 7x for terminal devices Atlas 200 Developer Kit (DK) AI developer kit Model: 3000 Atlas 200 AI accelerator module Model: 3000 #### Highest density in the industry (64-channel) for video inference Atlas 300 AI accelerator card Model: 3000 #### **Edge intelligence** and cloud-edge collaboration Atlas 500 AI edge station Model: 3000 #### Powerful computing platform for AI inference Atlas 800 AI server Model: 3000/3010 Huawei Confidential ### Atlas 200DK: Strong Computing Power and Ease-of-Use **Full-Stack AI development on** and off the cloud - **16TOPS INT8 24W** - 1 USB type-C, 2 camera ports, 1 GE port, 1 SD card slot - 8 GB memory - Operating temperature: 0°C to 45°C - Dimensions (H x W x D): 24 mm x 125 mm x 80 mm #### **Developers** #### Researcher Set up a dev environment with one laptop Ultra low cost for local independent environment, with multiple functions and interfaces to meet basic requirements collaboration Same protocol stack for Huawei Cloud and the developer kit; training on the cloud and deployment at local; no modification required Startups #### Code-level demo Implementing the algorithm function by modifying 10% code based on the reference architecture: interaction with the Developer Community; seamless migration of commercial products Model: 3000 Model: 3010 42 Huawei Confidential #### An efficient inference platform powered by Kunpeng #### Key functions: - 2 Kunpeng 920 processors in a 2U space - 8 PCIe slots, supporting up to 8 Atlas 300 AI accelerator - Up to 512-channel HD video real-time analytics - Air-cooled, stable at 5°C to 40°C #### A flexible inference platform powered by Intel #### • Key functions: - 2 Intel® Xeon® SP Skylake or Cascade Lake processors in a 2U space - 8 PCIe slots, supporting up to 7 Atlas 300/NVIDIA T4 AI accelerator cards - Up to 448-channel HD video real-time analytics - Air-cooled, stable at 5°C to 35°C #### Atlas 800 Training Server: Industry's Most Powerful Server for AI Training 2.5x 1 25x 1 1.8x 1 **Density of** Hardware decoder Perf./Watt computing power Atlas 800 16384 Model: 9000 **2P** FLOPS/4 **2P** FLOPS/5.6k W (1080p decoding) images/second **W** HUAWEI # **Contents** - 1. Overview of AI Chips - 2. Hardware Architecture of Ascend Chips - 3. Software Architecture of Ascend Chips - 4. Huawei Atlas AI Computing Platform - 5. Industry Applications of Atlas # Summary This chapter describes the products of Huawei Atlas computing platform and helps you to understand the working principles of Huawei Ascend chips. It focuses on the hardware and software architectures of Ascend chips and application scenarios of the Atlas AI computing platform. # Quiz - 1. What are the main applications of Ascend 310? ( - A. Model inference - B. Model training - C. Model building 57 Huawei Confidential # Recommendations - Ascend community: - https://ascend.huawei.com # Thank you. 把数字世界带入每个人、每个家庭、 每个组织,构建万物互联的智能世界。 Bring digital to every person, home, and organization for a fully connected, intelligent world. Copyright©2020 Huawei Technologies Co., Ltd. All Rights Reserved. The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time without notice.