当前位置:首页 / 文章测试 / 共享

共享

开始打字练习

ABSTRACTUnified virtual memory greatly simplifies GPU programming, but it introduces huge address translation overhead. To reduce this overhead, modern GPUs utilize the translation lookaside buffer (TLB) to accelerate the address translation process. However, the benefit of TLB is far from achieving optimal performance. In this work, we find that GPU performance deficiency mainly stems from the private property of L1 TLBs. First, there exist a lot of duplicate page table entries among L1 TLBs, which induces insufficient space utilization. Second, the miss rate of L2 TLB is high due to the massive number of requests from L1 TLB miss, which leads to a significant GPU performance degradation. To reduce L1 TLB miss and improve the address translation performance of GPU, we propose a hardware scheme by exploiting an Intra-Group Sharing approach, named IGSTLB. In IGS-TLB, L1 TLBs are decoupled from the compute units and aggregated into groups. Specifically, there only exist shared L1 TLB entries inside TLB groups that are responsible for non-overlapping address ranges. This greatly eliminates duplicate page table entries in L1 TLBs and significantly reduces the request number of L1 TLB misses. Our evaluation on a wide set of GPU workloads shows that IGS-TLB can effectively reduce L1 TLB miss rate and the L2 TLB traffic, speeding up the GPU performance by 20.5% on average.

As one of the current mainstream accelerators, Graphics Processing Unit (GPU) is widely used in various fields such as artificial intelligence, cloud computing and data mining. On the one hand, GPU is composed of thousands of processor cores, which has powerful parallel processing capabilities and high throughput [29]. On the other hand, GPU programming model is becoming more convenient [12, 23]. In recent years, GPU manufacturers proposed Unified Virtual Memory (UVM) [19, 27, 34] to simplify the programming model of GPU. However, it also brings huge address translation overhead [4]. In order to support UVM, GPU has to translate virtual address to physical address in each access. This is on the critical path of GPU and has a significant impact on the overall GPU performance. Modern GPUs utilize the TLB to accelerate the address translation process [7, 31]. However, recent studies have found that GPU performance severely degrades when dealing with large-scale workloads with irregular memory accesses, and the insufficient performance of TLB is the main bottleneck leading to this problem [22, 24].In this paper, we conduct a detailed experimental analysis of TLB performance and access behavior under existing architectures, and we derive two key observations. On the one hand, both L1 TLB and L2 TLB exhibit high miss rates in most workloads. On the other hand, there exist many duplicate page table entries among L1 TLBs. The duplication shows that the data sharing among GPU application threads is very high, that is, there is sharing characteristic. Such duplication greatly reduces the number of page table entries that can be stored by L1 TLB, making the space of L1 TLB not fully utilized. If the page table entry duplication is reduced, L1 TLB can effectively provide more capacity to store page table entries, leading to higher hit rate, which can significantly reduce the burden of L2 TLB.Based on above observations, we propose a hardware optimization scheme called IGS-TLB, which exploits the data sharing characteristic to reduce duplication among L1 TLBs. Specifically, we decouple the originally private L1 TLBs from the compute units and aggregate them into group. To achieve intra-group sharing, we propose an address partitioning scheme to manage the address range across L1 TLBs. Each L1 TLB within group is responsible for. exclusive address range. Furthermore, we add a request aggregation module to reduce the number of requests since each L1 TLB is shared by multiple CUs. Under IGS-TLB, the duplication among L1 TLBs within group is eliminated, which makes better use of the limited TLB capacity and improves the overall address translation performance of GPU.In order to implement decoupling-based L1 TLB design, we leverage the crossbar interconnection network to achieve the communication between CUs and L1 TLB groups. Our experimental result verifies the effectiveness of the interconnection network.

In summary, this paper makes the following contributions:We perform an in-depth experimental analysis of the access characteristics of existing GPU TLB architecture.We propose IGS-TLB, a hardware scheme based on intragroup sharing. This design decouples the private L1 TLBs from CUs and aggregates them into groups, allowing the CU within a group to access multiple shared L1 TLBs.We conduct a series of detailed experiments in gem5 [11] to evaluate IGS-TLB. Our experimental result shows that compared with the state-of-the-art TLB sharing schemes, IGS-TLB can provide an additional 19% and 22% speedup respectively.2 BACKGROUND AND MOTIVATIONIn this section, we first review the modern GPU architecture and present the address translation architecture in GPU. Then, in preliminary study, we investigate the TLB miss rate and inter-TLB sharing characteristic. Finally, we discuss how to exploit the sharing characteristic to improve GPU address translation performance.2.1 Background2.1.1 GPU Architecture. The essential details of the GPU architecture are shown in Figure 1. For AMD GPU, it is composed of multiple compute units (CUs) [2]. Each CU can execute a large number of GPU threads and contains many cores (i.e., Shader Processors or SPs), which are the functional units and can execute one instruction per clock cycle. All SPs within a CU share an instruction fetch/decode unit, a L1 cache and a large register file that can be scheduled flexibly. Each thread in a CU can be allocated a specific number of registers depending on the number of executed threads. The L2 cache is a banked and unified last level cache that is shared by all CUs in GPU.

2.1.2 Address Translation in GPU. To reduce the complexity of programming models, Unified Virtual Memory (UVM) [18, 28, 32] is introduced in modern GPUs. Specifically, UVM means that in a heterogeneous system, all processors share the same address space. Programmers can directly read and write memory on different processors without additional work. This technology can greatly reduce the burden of programmers and improve the efficiency of GPU parallel programming. However, while UVM brings benefit, it also causes expensive address translation overhead. GPU needs to translate the virtual address into physical address before accessing the cache, which is very time-consuming. To reduce this overhead, modern GPUs utilize TLB to accelerate the address translation process. As shown in Figure 2, modern GPU contains a multi-level address translation hardware architecture[8, 30, 31, 37] similar to the cache hierarchy. Specifically, each CU has one private L1 TLB and equipped with TLB coalescer. Since L1 TLBs are in the critical path of GPU, they are often designed as fully associative structures to reduce conflicts. The L1 TLB capacity is usually small so as to. provide low access latency. All the available L1 TLBs are backed by a shared L2 TLB with much larger capacity. To speed up the page table walk for missed requests in L2 TLB, GPU adds the Page Table Walker (PTW). Since TLB misses are often bursty in GPUs, PTW is highly threaded to execute multiple requests in parallel. Besides, PTW adds a Page Walk Cache (PWC) to cache the page table entries. Similar to CPU, GPU also uses the four-level page table stored in main memory. In order to speed up the page table access, GPU takes advantage of the shared L2 cache to cache the page table index.In GPU address translation architecture, there are three main steps during the translation. First, CU issues the coalesced address translation request to L1 TLB. Second, if L1 TLB can serve the request, the physical address would be directly returned, or if a L1 TLB miss occurs but L2 TLB can handle the miss, the physical address would be returned and stored in L1 TLB. Third, on both L1 TLB and L2 TLB miss, PTW would retrieve page walk cache or the whole page table entries, then return the address mapping, which would be stored in both L1 TLB and L2 TLB. 2.2 Preliminary StudyTo obtain the detailed information of GPU address translation, we run eight different workloads and simulate sixteen CUs using the gem5 simulator (details in Section 4), from Rodinia [14], Pannotia [13], DNNMark [15] and HIP-Examples [5] benchmark suits, including KM, NW, DWT, Mis, Softmax, Pool, MM, and MT. And we mainly collect statistics about TLB miss rate and inter-TLB sharing. 2.2.1 TLB Miss Rate Analysis. Figure 3 shows the miss rate of L1 and L2 TLB for various workloads. From Figure 3, we can obtain two critical observations of TLB miss rate:First, many workloads suffer from high L1 TLB miss rates. The average L1 TLB miss rate reaches 47% for all workloads, and even for workloads with poor temporal locality, the miss rate reaches as high as 88% (NW) and 80% (Softmax) respectively. The main reason for the high miss rate is its limited capacity, which makes it difficult to retain page table entries with long reuse distance. The high L1 TLB miss rate will cause many threads to stall on GPU, which greatly affects the parallel efficiency of applications. Since L2 TLBs are shared by all L1 TLBs, too many miss requests of L1 TLBs will cause a lot of contention in the communication between L1 TLBs and L2 TLBs, resulting in high request latency.Second, the average L2 TLB miss rate for all workloads reaches 34%, which is lower than the L1 TLB. But each L2 TLB miss will lead to a time-consuming four-level page table traversal (unless hit in the PTW Cache). We found that some workloads exhibit high L1 TLB miss rate but low L2 TLB miss rate. For instance, Softmax has a high L1 TLB miss rate (%80) but its L2 TLB miss rate is less than 4%. This indicates that many address translations missed in the private L1 TLB can be served by the shared L2 TLB.In summary, the performance of L1 TLBs and L2 TLBs in GPU is poor, especially for L1 TLBs. Considering the high parallelism of GPU workloads, we infer that there exist common address translation shared by many L1 TLBs. In order to explore the inter-L1 TLB access characteristics, we do further experiments.2.2.2 Inter-TLB Sharing Characteristic. To explore the sharing characteristics among L1 TLBs, we analyze the proportion and distribution of inter-TLB sharing in detail.Inter-TLB Sharing Proportion. To analyze the sharing behaviour, we calculate the proportion of duplicate page table entries among L1 TLBs. The results are shown in Figure 4, where the xaxis shows the individual workloads and the y-axis is divided into multiple series, indicating the proportion of page table entries with different levels of duplication. There are seven possible intervals for y-axis: "0", "1", "2-4", "5-8", "9-12", "13-15", and "16". Some adjacent ranges with relatively small overall proportions are merged, such as "2-4" represents entries with two to four duplicates. Figure 4 shows that workloads other than NW have 60% duplication on average. The highest sharing proportion occurs in the range 1 to 4, with an average of 37% (except for NW). Even for DWT, less than 4% page table entries are private to a single L1 TLB, and more than 90% are duplicate. And for MT, about 50% of the address translation requests hits in L1 TLBs, almost all hit requests have duplication. In summary, there exist a large proportion of duplicate page table entries among L1 TLBs, which indicates a strong sharing characteristic. The duplication results in wasted capacity on the L1 TLB, reducing the overall translation address range (i.e., the L1 TLB reach). As a result, requests that could have been handled in the L1 TLB are eventually forwarded to the L2 TLB, causing a significant burden on the L2 TLB. Therefore, it can be expected that if duplicate page entries among L1 TLBs are reduced, the traffic between L1 TLBs and L2 TLBs would be reduced, thus reducing the burden of L2 TLBs and optimizing the overall GPU address translation performance. Inter-TLB Sharing Distribution. To take advantage of the sharing characteristic, we need to study the distribution of duplicate page table entries for more details. Therefore, we conduct experiments to break down the physical distribution of shared page table entries among L1 TLBs. Figure 5 shows four representative workloads with different characteristics. Each subgraph contains the distribution heatmap of missed page table entry in local TLB that could be found in remote TLB. The x-axis represents the local L1 TLB while the yaxis represents the remote L1 TLB. Dark spots in the heatmaps indicate high sharing between the corresponding TLBs at their x and y coordinates. For instance, KMeans shows a gradual sharing degree decline as we go further from the local TLB, NW shows a strong sharing between two adjacent L1 TLBs, Mis shows an evenly distributed sharing between any two L1 TLBs, Softmax shows a random and discrete sharing pattern across all L1 TLBs. Although the sharing distribution pattern differs for various workloads, we found that most evaluated workloads exhibit the similar characteristic to NW or KMeans. This indicates that physically closer L1 TLBs have more sharing for most workloads.2.3 MotivationBased on the preliminary study, it can be seen that the existing GPU address translation architecture has the following two shortcomings for higher performance.On the one hand, the miss rates of both L1 TLB and L2 TLB are high for most GPU workloads, which leads to many time-consuming page table traversal requests. On the other hand, there are many duplicate page table entries among L1 TLBs, which causes the capacity waste of L1 TLBs and aggravates the communication competition between L1 TLBs and L2 TLBs.Prior works focused on improving the private L1 TLB space utilization by exploiting inter-TLB locality and enabling inter-TLB communication. This was achieved by using a bidirectional ring network to connect all L1 TLBs [10] or neighboring-TLB sharing scheme [16]. However, these works can not effectively reduce the duplicate L1 TLB entries.In order to improve the address translation performance of GPU, the key idea of this paper is to further reduce the duplication among the L1 TLBs. A straight approach is to design L1 TLBs like shared L2 TLB. In this case, the L1 TLB is shared by all CUs, thus there is no duplication. However, this design will cause serious communication competition and high request queuing latency between the CUs and L1 TLBs. Therefore, we propose an intra-group sharing scheme to limit the range of sharing, thus controlling the low latency of L1 TLB access. Since the L1 TLB is shared by multiple CUs in the intragroup sharing scheme, we further reduce the number of requests by adding a request aggregation module.

声明:以上文章均为用户自行发布,仅供打字交流使用,不代表本站观点,本站不承担任何法律责任,特此声明!如果有侵犯到您的权利,请及时联系我们删除。