Data packing before and after communication can make up as much as 90% of the communication time on modern computers. Despite MPI's well-defined datatype interface for non-contiguous data access, many codes use manual pack loops for performance reasons. Programmers write access-pattern specific pack loops (e.g., do manual unrolling) for which compilers emit optimized code. In contrast, MPI implementations in use today interpret datatypes at pack time, resulting in high overheads. In this work we explore the effectiveness of using runtime compilation techniques to generate efficient and optimized pack code for MPI datatypes at commit time. Thus, none of the overhead of datatype interpretation is incurred at pack time and pack setup is as fast as calling a function pointer. We have implemented a library called libpack that can be used to compile and (un)pack MPI datatypes. The library optimizes the datatype representation and uses the LLVM framework to produce vectorized machine code for each datatype at commit time. We show several examples of how MPI datatype pack functions benefit from runtime compilation and analyze the performance of compiled pack functions for the data access patterns in many applications. We show that the pack/unpack functions generated by our packing library are seven times faster than those of prevalent MPI implementations for 73% of the datatypes used in MILC and in many cases outperform manual pack loops.
There is an increasing interest from the HPC community in using C++ for their applications. This stems from the growing need for abstractions, libraries, code re-use, and generally access to modern programming language features. However, the result is a need for HPC programmers to change how they think about what aspect of their code is most important for performance: it isn't just about the actual computation any more. They also need to adapt to a different toolchain such as the LLVM and Clang C++ compiler, and different patterns of structuring and organizing their code.
In this talk I will give an overview of how LLVM understands and optimizes C++ code. I will show at a fundamental level (rather than a specific level) what kinds of optimizations are expected today and in the future from LLVM. Finally, I will provide guidance on how to write code in a way that helps the optimizer rather than hindering it.
Julia is a modern general-purpose programming language with a focus on scientific computing, bringing together high expressiveness, productivity and performance in a single language. Julia uses the LLVM JIT as its code generation backend and as a result can achieve performance comparable to C or C++ codes.
This talk will explore julia's interaction and integration with LLVM itself and other LLVM-based projects. With LLVM as the foundation, julia can make use of the large set of tools written on top of the LLVM platform to provide new and powerful features to its users.
As a particular example, this talk will demonstrate Julia integration with Clang for directly inter-operating with C++ code. This approach allows julia and C++ code to make calls in both directions with virtually no overhead, automatically translating types between the two languages and providing the ability to embed either language inside the other. In addition, being able to use C++ libraries in a dynamic and interactive fashion can help speed up the development cycle considerably.
Using this capability, this talk will then explore the way julia itself uses LLVM. Making use of LLVM's C++ APIs interactively from within the program, one can modify and introspect the entire code generation stack, making use of julia's interactivity and visualization features.
Current semiconductor trends show a major shift in computer system architectures towards heterogeneous systems that combine multiple different processors such as CPUs, GPUs, and DSPs that all work together, performing many different kinds of tasks in parallel.
This has led to an increase in the desire from developers for simpler, more accessible and more powerful and performance portable parallel programming languages and models. Specifically many developers have been requesting C++ as a programming language for heterogeneous programming.
One solution to this provided by SYCL; a royalty-free, cross-platform C++ abstraction layer that provides the underlying concepts, portability and efficiency of OpenCL together with the flexibility and ease-of-use of C++.
One of the most powerful features of SYCL is its shared source programming model, which allows both host and device code to be in the same source file. This provides developers a simple and accessible interface that allows sharing of complex templated algorithms across the host and a range of heterogeneous devices.
This talk will describe how the LLVM and Clang compiler framework and OpenCL SPIR infrastructure were used to develop the shared source programming model that makes SYCL possible. It will look at the design and development of the standard programming model, as well as how Codeplay are implementing a version of it with Clang/LLVM. The talk will cover the various techniques that were used as well as the issues that were encountered along the way and how they were addressed.
There will be discussion of many common issues with developing programming models for heterogeneous programming, including: kernel function construction; call graph traversal; address space deduction; device code analysis, and diagnostics. Innovative ideas presented will include looking at how multiple compilers can extract different parts of source code from a single source file and the way in which the compiler interacts with the host side runtime in SYCL in order to make the shared source programming model seamless and transparent to the developer.
The approaches presented will be applicable to any Clang/LLVM developers interested in targeting C++ software to heterogeneous processors.
We present PACXX - a unified programming model for programming many-core systems that comprise accelerators like Graphics Processing Units (GPUs). One of the main difficulties of the current GPU programming is that two distinct programming models are required: the host code for the CPU is written in C/C++ with the restricted, C-like API for memory management, while the device code for the GPU has to be written using a device-dependent, explicitly parallel programming model, e.g., OpenCL or CUDA. This leads to long, poorly structured and error-prone codes. In PACXX, both host and device programs are written in the same programming language – the newest C++14 standard, with all modern features including type inference (auto), variadic templates, generic lambda expressions, as well as STL containers and algorithms. We implement PACXX by a custom compiler (based on the Clang front-end and LLVM IR) and a runtime system, that together perform major tasks of memory management and data synchronization automatically and transparently for the programmer. We evaluate our approach by comparing it to CUDA and OpenCL regarding program size and target performance.
GPUs devices are becoming critical building blocks of High-Performance platforms for performance and energy efficiency reasons. As a consequence, parallel programming environment such as OpenMP were extended to support offloading code to such devices. OpenMP compilers are faced with offering an efficient implementation of device-targeting constructs.
One main issue in implementing OpenMP on a GPU is related to efficiently supporting sequential and parallel regions, as GPUs are only optimized to execute highly parallel workloads. Multiple solutions to this issue were proposed in previous research. In this paper, we propose a method to coordinate threads in an NVIDIA GPU that is both efficient and easily integrated as part of a compiler. To support our claims, we developed CUDA programs that mimic multiple coordination schemes and we compare their performances. We show that a scheme based on dynamic parallelism performs poorly compared to inspector-executor schemes that we introduce in this paper. We also discuss how to integrate these schemes to the LLVM compiler infrastructure.
Profile-guided optimizations (PGO) offer more optimization opportunities that are typically hard to obtain with static heuristics and techniques. In several application domains, significant performance can be gained by using runtime profiles to guide optimization. However, traditional PGO techniques that rely on compiler instrumentation are difficult enough to use that they have not become very popular.
This paper describes SamplePGO, an LLVM implementation of a profile-guided technique that addresses the usability problem. Instead of using compiler-generated instrumentation to generate profiles, it relies on profile information gathered by external sampling profilers. These profilers use commonly available hardware counters to inspect the execution of a program.
In our experience with large applications in video/image processing, logs processing, web search and ads, we have obtained, in our GCC implementation, performance improvements of up to 30% using SamplePGO over statically optimized code.
While the LLVM implementation is fully functional as of LLVM 3.5, not many optimization passes in LLVM make use of profile information. Therefore, the speedups we have been able to obtain using LLVM are much more modest (up to 5% over statically optimized code).
A primary concern of future high performance systems is the way data movement is managed; the sheer scale of data to be processed directly affects the achievable performance these systems can attain. However, the increasingly complex but inherently symbiotic relationships between upcoming scientific applications and high-performance architectures necessitate increasingly informative and flexible tools to ensure performance goals are met.
In this work we develop a memory-hierarchy model that quantifies a given application’s cache behavior. What makes this work unique is that we instrument code at compile time, gather architecture-independent data at run time using a generic memory-hierarchy model, and delay selecting a particular cache hierarchy (levels, sizes, and associativities) to a post-processing step, where cache performance can be derived rapidly without having to re-run a slow cache simulator. We show that this approach is capable of predicting cache misses to within 13% of what is predicted by a traditional, high-fidelity, but slow cache simulator.