当前位置:首页 / 文章测试 / gpu performance 4

gpu performance 4

开始打字练习

5 RELATED WORKTo leverage inter-TLB locality, Baruah et al. [10] proposed Valkyrie with a prefetching mechanism for L2 TLB using a Locality Detection Table (LDT) and a probing mechanism by interconnecting the L1TLBs through an on-chip ring network. IGS-TLB similarly exploits the sharing characteristic among L1 TLBs. But IGS-TLB can eliminate duplicate page table entries among L1 TLBs within a group, thus improving the space utilization of L1 TLBs and ultimately optimizing the address translation performance of GPU.Besides, there are many other studies on GPU address translation. Power et al. [31] evaluate the main component in GPU, which shows that modest hardware changes can improve GPU address translation in low overhead. Pichai et al. [30] also designed a basic GPU MMU model but they focused on warp scheduling. Yoon et al. [39] proposed a software transparent virtual cache hierarchy for GPU, where applications can directly access data from caches bypassing TLB unless a miss occurs in caches. Shahar et al. [33] designed a software address translation, which covered a software TLB and a translation aggregation algorithm to support memory mapped file. Shin et al.[35] proposed a specific SIMT-aware scheduler, reordering the page table walks based on shortest-job-first scheduling and batching the requests, reducing the stalls and making page table walker work better. Ausavarungnirun et al. [8] proposed an application-transparent GPU memory manager. Ausavarungnirun et al. [9] designed a memory hierarchy for supporting multi-application workloads, and improves performance of page table walker cache and shared TLB by using TLB-fill tokens, L2-TLB bypassing, and an effective dram scheduler. Yan et al.[38] proposed Translation Ranger, to provide operating system support for the contiguity-aware TLBs by migrating physical pages in a background daemon process. Jaleel et al. [20] proposed DUCATI to store address translations in last-level cache and memory, increasing TLB reach. Skarlatos et al. [36] proposed a page table design based on Elastic Cuckoo Hashing. Achermann et al. [1] proposed Mitosis, to transparently replicate and migrate page-tables across sockets. Tang et al. [37] designed a compression mechanisms for TLB to increase TLB reach. Kotra et al. [24] proposed a mechanism to leverage those under-utilized instruction cache and shared memory to improve TLB reach without additional hardware modification. Li et al. [25] proposed a TLB hierarchy (least-TLB), employing least-inclusive policy and using a Cuckoo filter to track the translations in multiple GPUs' L2 TLBs to increase TLB reach.6 CONCLUSIONIn this paper, we present a detailed experimental analysis of the performance and sharing characteristics of TLBs during GPU address translation. By exploiting the data sharing characteristic, we propose the IGS-TLB hardware solution to optimize the GPU address translation performance. Furthermore, we add an aggregation module to reduce the number of requests from multiple CUs. With IGS-TLB, we eliminate the duplicate page table entries within each group. We show that our scheme is efficient to implement with low hardware overhead and can provide a 20.5% performance speedup across various GPU workloads.

声明:以上文章均为用户自行发布,仅供打字交流使用,不代表本站观点,本站不承担任何法律责任,特此声明!如果有侵犯到您的权利,请及时联系我们删除。