CUDA-NP:Realizing Nested Thread-Level Parallelism in GPGPU Applications

来源 :计算机科学技术学报(英文版) | 被引量 : 0次 | 上传用户:iovewpycoo
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both sequential code and parallel loops. In order to leverage such parallel loops, the latest NVIDIA Kepler architecture introduces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these benchmarks do not have a very high loop count or high degree of TLP. Consequently, the benefits of leveraging such parallel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implement our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically generates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and e?ciently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up to 6.69 times and 2.01 times on average.
其他文献
作为电建企业的战略管理手段,绩效管理不仅是重要的人事管理手段,也对电建企业日常经营发展有着很大的促进作用。如何调动电建企业员工的积极性,创造性与主动性,构建适合电建企业
作为企业的事务中心和参谋本部,办公室承担着多种职能,办公室工作人员责任重大,任务繁杂,要真正起到承上启下,衔接各方的作用,就需要具备较高的职业素养。而秘书工作更是内容广、服
随着国家电网“三集五大”改革的不断深入,我国电费收缴也发生了很大的变化,由传统的用电后交费方式开始转变为预存电费方式,这标志着我国电力市场越来越完善。但是由于受到一些
电力设施作为供电企业电力系统运行的,还是用电、供电、输电以及发电必不可少的物质基础,损坏任何一个部分都将会导致电力供应和使用出现中断,对国民经济的稳定有序发展以及社会
实践证明,通过营配信息的实现,能够使营销业务系统数据和生产业务系统数据完成共享。本文笔者在分析营配信息融合的业务需求及数据源规则的基础上,进一步对营配信息融合的实现方
Many machine learning and data mining (MLDM) problems like recommendation, topic modeling, and medical diagnosis can be modeled as computing on bipartite graphs
Determinism is very useful to multithreaded programs in debugging, testing, etc. Many deterministic ap-proaches have been proposed, such as deterministic multit
随着经济的快速发展,我国的电力企业得到了快速的发展,而用户对供电质量和供电服务的要求也越来越高。为提高电力企业的市场竞争力,电力企业必须加强电力营销的精细化管理,确保电
Pipeline parallelism is a popular parallel programming pattern for emerging applications. However, program-ming pipelines directly on conventional multithreaded
随着我国电力市场需求的快速发展,供电单位引进了自动抄表技术,不仅提高了抄表的准确率,而且还实现了远程抄表。本文针对电力企业电费回收工作所遇到的问题,有针对性地提出改进措