Performance Optimization Of GPU ELF-Codes

Artico, Fausto

GPUs (Graphic Processing Units) are of interest for their favorable ratio $\frac{GF/s}{price}$. Compared to the beginning - early 1980's - nowadays GPU architectures are more similar to general purpose architectures but with (much) larger numbers of cores - the GF100 architecture released by NVIDIA in 2009-2010, for example, has a true hardware cache hierarchy, a unified memory address space, double precision performance and has a maximum of 512 cores. Exploiting the computational power of GPUs for non-graphics applications - past or present - has, however, always been hard. Initially, in the early 2000's, the way to program GPUs was by using graphic libraries API's (exclusively), which made writing non-graphics codes non-trivial and tedious at best, and virtually impossible in the worst case. In 2003, the Brook compiler and runtime system was introduced, giving users the ability to generate GPU code from a high level programming language. In 2006 NVIDIA introduced CUDA (Compute Unified Device Architecture). CUDA, a parallel computing platform and programming model specifically developed by NVIDIA for its GPUs, attempts to further facilitate general purpose programming of GPUs. Code edited using CUDA is portable between different NVIDIA GPU architectures and this is one of the reasons because NVIDIA claims that the user's productivity is much higher than previous solutions, however optimizing GPU code for utmost performance remains very hard, especially for NVIDIA GPUs using the GF100 architecture - e.g., Fermi GPUs and some Tesla GPUs - because a) the real instruction set architecture (ISA) is not publicly available, b) the code of the NVIDIA compiler - nvcc - is not open and c) users can not edit code using the real assembly - ELF in NVIDIA parlance. Compilers, while enabling immense increases in programmer productivity, by eliminating the need to code at the (tedious) assembly level, are incapable of achieving, to date, performance similar to that of an expert assembly programmer with good knowledge of the underlying architecture. In fact, it is widely accepted that high-level language programming and compiling even with a state-of-the-art compilers loose, on average, a factor of 3 in performance - and sometimes much more - over what a good assembly programmer could achieve, and that even on a conventional, simple, single-core machine. Compilers for more complex machines, such as NVIDIA GPUs, are likely to do much worse because among other things, they face (even more) complex trade-offs between often undecidable and NP-hard problems. However, because NVIDIA a) makes it virtually impossible to gain access to the actual assembly language used by its GF100 architecture, b) does not publicly explain many of the internal mechanisms implemented in its compiler - nvcc - and c) makes it virtually impossible to learn the details of its very complex GF100 architecture in sufficient detail to be able to exploit them, obtaining an estimate of the performance difference between CUDA programming and machine-level programming for NVIDIA GPUs using the GF100 architecture - let alone achieving some a priori performance guarantees of shortest execution time - has been, prior to this current work, impossible. To optimize GPU code, users have to use CUDA or PTX (Parallel Thread Execution) - a virtual instruction set architecture. The CUDA or PTX files are given in input to nvcc that produces as output fatbin files. The fatbin files are produced considering the target GPU architecture selected by the user - this is done setting a flag used by nvcc. In a fatbin file, zero or more parts of the fatbin file will be executed by the CPU - think of these parts as the C/C++ parts - while the remaining parts of the fatbin file - think of these parts as the ELF parts - will be executed by the specific model of the GPU for which the CUDA or PTX file has been compiled. The fatbin files are usually very different from the corresponding CUDA or PTX files and this lack of control can completely ruin any effort made at CUDA or PTX level to optimize the ELF part/parts of the fatbin file that will be executed by the target GPU for which the fatbin file has been compiled. We therefore reverse engineer the real ISA used by the GF100 architecture and generate a set of editing guidelines to force nvcc to generate fatbin files with at least the minimum number of resources later necessary to modify them to get the wanted ELF algorithmic implementations - this gives control on the ELF code that is executed by any GPU using the GF100 architecture. During the process of reverse engineering we also discover all the correspondences between PTX instructions and ELF instructions - a single PTX instruction can be transformed in one or more ELF instructions - and the correspondences between PTX registers and ELF registers. Our procedure is completely repeatable for any NVIDIA Kepler GPU - we do not need to rewrite our code. Being able to get the wanted ELF algorithmic implementations is not enough to optimize the ELF code of a fatbin file, we need in fact also to discover, understand, and quantify some not disclosed GPU behaviors that could slow down the execution of ELF code. This is necessary to understand how to execute the optimization process and while we can not report here all the results we have got, we can however say that we will explain to the reader a) how to force even distributions of the GPU thread blocks to the streaming multiprocessors, b) how we have discovered and quantified several warp scheduling phenomenons, c) how to avoid phenomenons of warp scheduling load unbalancing, that it is not possible to control, in the streaming multiprocessors, d) how we have determined, for each ELF instruction, the minimum quantity of time that it is necessary to wait before a warp scheduler can schedule again a warp - yes, the quantity of time can be different for different ELF instructions - e) how we have determined the time that it is necessary to wait before to be able to read again the data in a register previously read or written - this too can be different for different ELF instructions and different whether the data has been previously read or written - and f) how we have discovered the presence of an overhead time for the management of the warps that does not grow linearly to a liner increase of the number of residents warps in a streaming multiprocessor. Next we explain a) the procedures of transformation that it is necessary to apply to the ELF code of a fatbin file to optimize the ELF code and so making its execution time as short as possible, b) why we need to classify the fatbin files generated from the original fatbin file during the process of optimization and how we do this using several criteria that as final result allow us to determine the positions, occupied by each one of the fatbin files generated, in a taxonomy that we have created, c) how using the position of a fatbin file in the taxonomy we determine whether the fatbin file is eligible for an empirical analysis - that we explain - a theoretical analysis or both, and d) how - if the fatbin file is eligible for a theoretical analysis - we execute the theoretical analysis that we have devised and give an a priori - without any previous execution of the fatbin file - shortest ELF code execution time guarantee - this if the fatbin file satisfies all the requirements of the theoretical analysis - for the ELF code of the fatbin file that will be executed by the target GPU for which the fatbin file has been compiled.

GPUs (Graphic Processing Units) sono di interesse per il loro favorevole rapporto $\frac{GF/s}{price}$. Rispetto all'inizio - primi anni 70 - oggigiorno le architectture GPU sono più simili ad architectture general purpose ma hanno un numero (molto) più grande di cores - la architecttura GF100 rilasciata da NVIDIA durante il 2009-2010, per esempio, ha una vera gerarchia di memoria cache, uno spazio unificato per l'indirizzamento in memoria, è in grado di eseguire calcoli in doppia precisione ed ha un massimo 512 core. Sfruttare la potenza computazionale delle GPU per applicazioni non grafiche - passate o presenti - è, comunque, sempre stato difficile. Inizialmente, nei primi anni 2000, la programmazione su GPU avveniva (esclusivamente) attraverso l'uso librerie grafiche, le quali rendevano la scrittura di codici non grafici non triviale e tediosa al meglio, e virtualmente impossibile al peggio. Nel 2003, furono introdotti il compilatore e il sistema runtime Brook che diedero agli utenti l'abilità di generare codice GPU da un linguaggio di programmazione ad alto livello. Nel 2006 NVIDIA introdusse CUDA (Compute Unified Device Architecture). CUDA, un modello di programmazione e computazione parallela specificamente sviluppato da NVIDIA per le sue GPUs, tenta di facilitare ulteriormente la programmazione general purpose di GPU. Codice scritto in CUDA è portabile tra differenti architectture GPU della NVIDIA e questa è una delle ragioni perché NVIDIA afferma che la produttività degli utenti è molto più alta di precedenti soluzioni, tuttavia ottimizare codice GPU con l'obbiettivo di ottenere le massime prestazioni rimane molto difficile, specialmente per NVIDIA GPUs che usano l'architecttura GF100 - per esempio, Fermi GPUs e delle Tesla GPUs - perché a) il vero instruction set architecture (ISA) è non pubblicamente disponibile, b) il codice del compilatore NVIDIA - nvcc - è non aperto e c) gli utenti non possono scrivere codice usando il vero assembly - ELF nel gergo della NVIDIA. I compilatori, mentre permettono un immenso incremento della produttività di un programmatore, eliminando la necessità di codificare al (tedioso) livello assembly, sono incapaci di ottenere, a questa data, prestazioni simili a quelle di un programmatore che è esperto in assembly ed ha una buona conoscenza dell'architettura sottostante. Infatti, è largamente accettato che programmazione ad alto livello e compilazione perfino con compilatori che sono considerati allo stato dell'arte perdono, in media, un fattore 3 in prestazione - e a volte molto di più - nei confronti di cosa un buon programmatore assembly potrebbe ottenere, e questo perfino su una macchina convenzionale, semplice, a singolo core. Compilatori per macchine più complesse, come le GPU NVIDIA, sono propensi a fare molto peggio perché tra le altre cose, essi devono determinare (persino più) complessi trade-offs durante la ricerca di soluzioni a problemi spesso indecidibili e NP-hard. Peraltro, perché NVIDIA a) rende virtualmente impossibile guadagnare accesso all'attuale linguaggio assembly usato dalla architettura GF100, b) non spiega pubblicamente molti dei meccanismi interni implementati nel suo compilatore - nvcc - e c) rende virtualmente impossible imparare i dettagli della molto complessa architecttura GF100 ad un sufficiente livello di dettaglio che permetta di sfruttarli, ottenere una stima delle differenze prestazionali tra programmazione in CUDA e programmazione a livello macchina per GPU NVIDIA che usano la architecttura GF100 - per non parlare dell'ottenimento a priori di garanzie di tempo di esecuzione più breve - è stato, prima di questo corrente lavoro, impossbile. Per ottimizare codice GPU, gli utenti devono usare CUDA or PTX (Parallel Thread Execution) - un instruction set architecture virtuale. I file CUDA or PTX sono dati in input a nvcc che produce come output fatbin file. I fatbin file sono prodotti considerando l'architecttura GPU selezionata dall'utente - questo è fatto settando un flag usato da nvcc. In un fatbin file, zero o più parti del fatbin file saranno eseguite dalla CPU - pensa a queste parti come le parti C/C++ - mentre le rimanenti parti del fatbin file - pensa a queste parti come le parti ELF - saranno eseguite dallo specifico modello GPU per il quale i file CUDA or PTX sono stati compilati. I fatbin file sono normalmente molto differenti dai corrispodenti file CUDA o PTX e questa assenza di controllo può completamente rovinare qualsiasi sforzo fatto a livello CUDA o PTX per otimizzare la parte o le parti ELF del fatbin file che sarà eseguita / saranno eseguite dalla GPU per la quale il fatbin file è stato compilato. Noi quindi scopriamo quale è il vero ISA usato dalla architettura GF100 e generiamo un insieme di linea guida per scrivere codice in modo tale da forzare nvcc a generare fatbin file con almeno il minimo numero di risorse successivamente necessario per modificare i fatbin file per ottenere le volute implementazioni algoritmiche in ELF - questo da controllo sul codice ELF che è eseguito da qualsiasi GPU che usa l'architettura GF100. Durante il processo di scoperata del vero ISA scopriamo anche le corrispondenze tra istruzioni PTX e istruzioni ELF - una singola istructione PTX può essere transformata in one o più istruzioni ELF - e le corrispondenze tra registri PTX e registri ELF. La nostra procedura è completamente ripetibile per ogni NVIDIA Kepler GPU - non occorre che riscrivamo il nostro codice. Essere in grado di ottenere le volute implementazioni algoritmiche in ELF non è abbastanza per ottimizzare il codice ELF di un fatbin file, ci occorre infatti anche scoprire, comprendere e quantificare dei comportamenti GPU che non sono divulgati e che potrebbero rallentare l'esecuzione di codice ELF. Questo è necessario per comprendere come eseguire il processo di ottimizzazione e mentre noi non possiamo riportare qui tutti i risultati che abbiamo ottenuto, noi possiamo comunque dire che spiegheremo al lettore a) come forzare una distribuzione uniforme dei GPU thread blocks agli streaming multiprocessors, b) come abbiamo scoperto e quantificato diversi fenomeni riguardanti il warp scheduling, c) come evitare fenomeni di warp scheduling load unblanacing, che è non possible controllare, negli streaming multiprocessors, d) come abbiamo determinato, per ogni istruzione ELF, la minima quantità di tempo che è necessario attendere prima che un warp scheduler possa schedulare ancora un warp - si, la quantità di tempo può essere differente per differenti istruzioni ELF - e) come abbiamo determinato il tempo che è necessario attendere prima di essere in grado di leggere ancora un dato in un registro precedentemente letto o scritto - questo pure può essere differente per differnti istruzioni ELF e differente se il dato è stato precedentemente letto o scritto - e f) come abbiamo scoperto la presenza di un tempo di overhead per la gestione dei warp che non cresce linearmente ad un incremento lineare del numero di warp residenti in uno streaming multiprocessor. Successivamente, noi spiegamo a) le procedure di trasformazione che è necessario applicare al codice ELF di un fatbin file per ottimizzare il codice ELF e così rendere il suo tempo di esecuzione il più corto possibile, b) perché occorre classificare i fatbin file generati dal fatbin file originale durante il processo di ottimizzazione e come noi facciamo questo usando diversi criteri che come risultato finale permettono a noi di determinare le posizioni, occupate da ogni fatbin file generato, in una tassonomia che noi abbiamo creato, c) come usando la posizione di un fatbin file nella tassonomia noi determiniamo se il fatbin file è qualificato per una analisi empirica - che noi spieghiamo - una analisi teorica o entrambe and d) come - supponendo il fatbin file sia qualificato per una analisi teorica - noi eseguiamo l'analisi teorica che abbiamo ideato e diamo a priori - senza alcuna precedente esecuzione del fatbin file - la garanzia - questo supponendo il fatbin file soddisfi tutti i requisiti dell'analisi teorica - che l'esecuzione del codice ELF del fatbin file, quando il fatbin file sarà eseguito sulla architettura GPU per cui è stato generato, sarà la più breve possibile.

Performance Optimization Of GPU ELF-Codes / Artico, Fausto. - (2014 Jan 27).