Abstract
Since Babbage and Lovelace's analytical machine, the machine language executed by processors has typically consisted of a succession of instructions in sequence. But efficient implementation in hardware requires instructions to be executed in parallel. Two approaches are being followed in general-purpose processors and graphics processors respectively to bridge this gap.
A current processor core maintains the illusion of sequential execution, but in reality processes several hundred instructions in flight, and executes them in disorder. This balancing act relies on numerous hardware mechanisms, including branch prediction and register renaming.
In a similar way, graphics processors, or GPUs, follow an almost unlimited parallel computing model, with hundreds of thousands of independent threads. A GPU core maintains the illusion of a large number of independent threads, but actually assembles them into a smaller number of linked thread convoys, or warps, to amortize the cost of managing them.
These two worlds, general-purpose processors and GPUs, are notoriously incompatible. Indeed, neither branch prediction nor register renaming works in the context of multiple threads assembled in warps. We will see, however, that it is possible to overcome this incompatibility by extending the prediction and renaming mechanisms, in order to design GPUs with out-of-order execution that combine the sequential and parallel computing advantages of both architectures.