«Automated Design of Finite State Machine Predictors Timothy Sherwood Brad Calder Department of Computer Science and Engineering University of ...»
A stride value predictor [7, 8, 33] keeps track of not only the last value brought in by an instruction, but also the difference between that value and the previous value. This difference is called the stride. The predictor speculates that the new value seen by the instruction will be the sum of the last value seen and the stride. We chose to use the two-delta stride predictor [8, 33], which only replaces the predicted stride with a new stride if that new stride has been seen twice in a row. Each entry contains a tag, the predicted value, the predicted stride, the last stride seen, and a saturating up and down conﬁdence counter. We use a table size of 2K entries for the stride value predictor, which means there are also 2K conﬁdence counters used (one for each table entry). We performed value prediction for only load instructions.
6.2 Value Prediction Conﬁdence There have been several papers that point out the beneﬁt of applying value prediction to various troublesome instructions. It has further been shown that actually predicting the values of these instructions is error prone. It was noted by both Calder et. al. [5, 4] and by Burtscher and Zorn  that in order for value prediction to be a viable option, an improved conﬁdence estimation technique is needed. Conﬁdence estimation is a way for the processor to more prudently allocate resources based on whether or not it believes the values predicted are correct or not. In effect, two predictions are made, a value prediction and a meta-prediction as to whether or not the value prediction was correct.
In [5, 4], Calder et al. examined Saturating Up and Down (SUD) counters with several different saturation thresholds and wrong decrement values to vary the accuracy and coverage, and this was done by trial and error.
They showed that no one SUD counter worked best for all programs. A very accurate SUD counter was needed for mispredicted values when using squash recovery to obtain increases in performance, but this resulted in low coverage of potential value predictions. In contrast, when value prediction used re-execution recovery, it did not have to be as accurate, since the miss penalty is small, and the SUD counter could instead concentrate on achieving a high coverage.
6.3 Automatically Designing Conﬁdence Predictors Instead of using a Saturating Up/Down counter for estimating the conﬁdence of the value predictions we propose using our automatically designed FSM predictors to generate the conﬁdence estimation hardware.
Each time a load was executed, we put into the trace, which was used to create the Markov model, whether the load was correctly value predicted (1) or not (0). We then applied the techniques in Section 4 to automatically generate an FSM predictor. This automatically generated FSM provides a conﬁdence prediction each executed load predicting if it will going to be correctly value predicted or not.
We use the same design techniques as described in Section 4, except that instead of targeting a single application we target a suite of general purpose applications. In order to make sure that the predictors generated are not specialized to a speciﬁc application, but rather to a suite of applications, we provide cross-trained results. For each application in our suite, we combine the traces from all of the other programs excluding the application to be used for reporting results. So for example, when we show results for gcc, the predictor has been trained on all of the other programs in the suite, but not on gcc.
6.4 Results Figure 2 shows value prediction conﬁdence for saturating up/down counters versus state machine predictors that were designed automatically with cross-training across all benchmarks. The x-axis is the Accuracy of the value predictor when using the various conﬁdence schemes. The accuracy is the percent of value predictions that were marked as conﬁdent, that were in fact correct predictions. An accuracy of 100% means that every-time the conﬁdence predictor marked a prediction as conﬁdent, it was a correct prediction. The y-axis is the coverage which is deﬁned as the percent of correct value predictions that were allowed through by the conﬁdence predictor.
A coverage of 100% means that all of the times that the value predictor predicted a value correctly, it was marked as conﬁdent by the conﬁdence predictor.
Accuracy and coverage trade off with each other. It is possible to get higher accuracy by being conservative and only marking those value predictions that you are very sure are correct as being conﬁdent. On the other hand, higher coverage can be had by being more trusting of the value predictor. This tradeoff can be clearly seen in ﬁgure 2. The points for the up/down counters are the accuracy and tradeoff points for a variety of counter conﬁgurations. The points shown are for counters with a maximum value (number of states) of 5, 10, 20, and 40, miss penalties of 1, 2, 5, 10, and full, and for thresholds of 50% 80% and 90%.
This is to be compared to the lines in Figure 2 that show the coverage and accuracy tradeoffs that can be made by using automatically designed state machines with history lengths of size 2 to 10. These results indicate that there is a signiﬁcant amount of coverage to be gained for any given accuracy. Take gcc for example, at a target accuracy of 80%, the best conﬁguration of saturating up-down counter gets a coverage of less than 10%. In comparison, our automatically designed state machine (derived from cross-training) is capable of covering 20% of correct predictions, which would result in a more than twice as many correct value predictions at the same accuracy.
It is interesting to note that our automatically generated FSM predictors converge with the saturating updown counter results for extremely high accuracy requirements. This is because, in order to achieve the highest accuracies possible, the best solution is to simply wait for very long sequences of correct predictions before becoming conﬁdent. Being this conservative comes at price though, which is very low coverage.
7 Branch Predictors for Customized Processors In this section we examine applying our automated approach to designing FSM predictors to the customization of branch predictors for customized processors. We begin with a discussion of customized processors and describe our customized branch prediction architecture, and then the training approach we use for building the customized FSM branch predictors. We then evaluate the performance and area of the customized predictor and compare it to a range of prior branch predictors.
7.1 Customized Processors The rapidly growing embedded electronics industry demands high performance, low cost systems with increased pressure on design time. The gap between the two current methods for dealing with such systems, ASICs and off
40% 40% 20% 20%
40% 30% 20% 20% 10%
Figure 2: Value Prediction Conﬁdence for saturating up/down counters versus state machine predictors that were designed automatically from cross-training across benchmarks. The x-axis is the Accuracy of the value predictor when using the various conﬁdence schemes. The y-axis is the percent of correct value predictions that were allowed through by the conﬁdence predictor.
the-shelf processors, leave many unsatisﬁed with the trade offs between long design cycles and lower performance.
The designer could choose to design a custom ASIC, which has the advantages of high performance at the cost of long design and veriﬁcation times. An alternative choice is an off the shelf embedded processor. Embedded processors allow rapid development times and low development cost in terms of both time and money, but with performance typically lagging behind ASICs.
To address this problem there is an emerging technique which will add a new and important point in the spectrum of solutions. Automatically generated customized processors have the promise of delivering the needed performance with only slightly inﬂated design time.
A customized processor is a processor tailored to an individual application or a set of applications. The idea became a common research subject in the late eighties and early nineties with projects such as SCARCE  and The Architect’s Workbench (AWB) . SCARCE was a ﬂexible VLSI core for building application speciﬁc processors. AWB was a system built to help the designers evaluate design tradeoffs for building embedded processors.
Currently there is a resurgence of interest customized processors with many research projects and companies working in this domain. A great deal of the interest comes from increased system level integration, where it is increasingly common to place different parts of a system all onto one chip. To support this it is now common for vendors to sell descriptions of processor cores rather than actual silicon itself. These processor core descriptions can then be customized for a given application.
There are two major approaches to customized processors, one approach working towards pre-silicon customization, and the other approach pressing for post-silicon reconﬁguration or adaptability. The ﬁrst approach, which is being adopted by such systems as Hewlett Packard’s PICO project [29, 1, 34] and Tensilica’s Xtensa [15, 10], is intended to produce low-cost high-speed ﬁxed hardware for embedded systems. The other approach attempts to take advantage of post-silicon customization through reconﬁgurability. Examples of systems that support reconﬁgurability are Altera’s NIOS  processor core and the CHIMEARA chip .
There is also another range of freedom for these chips, which is the design level. Some systems, such as CHIMEARA , and PRISC , concentrate their application tailoring at the level internal to a functional unit.
These systems work by tailoring the processor’s functional units to the application running on it. For example, they might merge commonly executed expressions into a single instruction. Other options are to perform the customization at the architectural level of registers and number of functional units such as Xtensa , SCARCE  and Lx . This allows the system designer to make high level architectural tradeoffs for the application such as relieving register pressure or removing unused functional units. Still other designs have co-processors for a given application or type of application. Xtensa  supports the tight integration of co-processors into an architecture, while PICO [29, 1, 34] automatically generates co-processors in the form of a custom designed systolic array.
Even though there is a long history and many different projects, all of the projects have a similar high level overview. Begin with a conﬁgurable or parameterizable architectural template and customize it to ﬁt the application at hand. Because all of these systems target a very speciﬁc application or suite of applications, and these applications are under the designers control, any one of these approaches could beneﬁt from the techniques we present.
7.2 Customized Branch Prediction Architecture It is increasingly common for current embedded processors to have branch prediction. For example, Intel’s XScale (StrongARM-2) processor  has a 128 entry Branch Target Buffer (BTB), and each entry in the BTB has a 2bit saturating counter which is used for branch prediction. Our goal is to build a branch predictor for a given application that will have the performance of a large general purpose predictor, with about the same area that is already being used by embedded branch predictors.
Our approach is to take a baseline predictor such as the local 2-bit counters from the XScale architecture, and extend this with custom designed FSM predictors for the hard to predict branches. We use the standard two bit counters for most branches and use the custom FSM predictors for branches that do not work well with the default predictor. In doing this, we limit both the amount of additional area we have to add to get good performance and the amount of code we have to hard-wire into the processor.
The custom branch architecture we propose is seen in Figure 3. We extend XScale’s coupled BTB branch prediction architecture with a set of custom predictors that are hard-wired to particular branches. These custom predictors have the addresses that are associated with them locked down by the system software. While the state machines are ﬁxed in hardware and are targeting speciﬁc branches, we wish to allow some software conﬁgurability
Figure 3: Customized branch prediction architecture. The architecture contains a traditional coupled BTB with 2-bit saturating up-down counters for conditional branch prediction. In addition, customized branch predictors are added for individual branches. A customized branch entry contains a tag (address), target and a custom FSM predictor for predicting the direction of the branch. The tag is associated with the branch that the FSM predictor was built for and is locked down by the system software.
should a re-compile of the software be needed after fabrication. This will allow the branches to move around in the address space but still use their custom state machines as long as the branch prediction patterns do not change.
The address of the branch is used to index into the BTB as well as the custom predictors. The custom branch entries perform a fully associative tag lookup to search for a match. If there is a match in the standard table then the two-bit saturating up-down counter is used to predict the bias of the branch. If instead there is a match in the custom table then the output on the current state of the corresponding state machine is used for prediction. In the next section we describe how we generated the FSM predictors hard-wired into this architecture.