«Automated Design of Finite State Machine Predictors Timothy Sherwood Brad Calder Department of Computer Science and Engineering University of ...»
7.3 Generating Traces to Train FSM Branch Predictors There are many different types of branch traces one could concentrate on to generate FSM predictors, and we only focus on one in this paper. For the custom branch prediction architecture in Figure 3, traces are generated on a global basis, and then used to generate a FSM predictor for individual branches. To generate a FSM for a specific branch, the traces used to train the FSM generator could include the local history of the branch, global history, or a combination of the two. We examined all of these and found that it is better to concentrate on capturing global correlation rather than a local history pattern because of the tendency of the global correlation to be more repeatable across different inputs, and our approach is efficient at finding and taking advantage of global correlations between branches.
The first step we perform in building our custom branch prediction architecture is to profile the application with our baseline predictor, in this case the Xscale predictor which is a bimodal table of 2-bit counters. This identifies those branches that are causing the greatest amount of mispredictions. For each of these branches we generate a Markov Model as discussed in Section 4. To generate the Markov Models that we need for the branches we are concentrating on, we keep track of a single global history register of length N. When a branch is encountered in the trace, we update that branch’s Markov Model with the outcome of the branch, given the history in the global history register. The Markov table is then fed into the FSM generator, and a customize FSM predictor is generated for that branch. Since the number of global histories that a given branch might see before it is being predicted is small compared to the 2N possible histories, the Markov Models can be compressed down significantly by only storing non-zero entries. For all the custom branch prediction results in this paper we use a history of length 9.
For this design only the branch that the custom predictor is generated for uses the FSM for prediction, based on a tag match as described earlier. We update the custom finite state machines in a different manner than the standard local 2-bit counters. We update all of the custom predictors in parallel on every branch, rather than only matching branches. On an update, the branch predictor moves each FSM predictor from one state to the next state based upon the prediction of the branch.
7.4 Estimating Area of FSM Predictors In order to enable us to make high level design choices before synthesis, it is important to be able to estimate the area that our automatically generated state machines will take up. To establish a relationship between the state machine descriptions and their area we took a random sample of custom FSM predictors generated across all of the benchmarks we examined. These account for 10% of all of the FSM predictors generated. We then synthesized these state machines with Synopsys.
Figure 4 shows the number of states in the state machine versus the area of the implementation. Each triangle represents one FSM predictor. The x-axis is the number of states in a given state machine, while the y-axis is the area as reported by Synopsys.
Figure 4: Area of a random sample of the custom FSM predictors taken from all of the benchmarks we examined.
The area was determined by Synopsys and plotted against the number of states in that machine. The strong linear bound allows us to estimate area for design tradeoffs and evaluation The results show, for most state machines, that the area is linearly proportional to the number of states in the machine. This trend line is drawn in with a dashed line. However this is not the case for all of the FSM predictors. For the state machines with a large number of states, the area is much less than would be predicted by this approximation. For these FSM predictors, there is a large number of states, but the machine is highly regular.
Because of this the state machine can be optimized down to a size much smaller than even some of the more chaotic state machines with less state.
We use the linear line in Figure 4 to estimate the area for the rest of the FSM predictors in the results to follow.
Even though the approximation does not hold for all of the predictors, it does bound the area of the predictors by the number of states in the state machine. We can use this approximation to make conservative estimates of area.
For the rest of this paper we use this approximation to quantify area rather than performing synthesis on each state we wish to examine.
7.5 Branch Prediction Results We compare out customized predictor against three other branch predictors. The first predictor we compare against is the the gshare predictor of McFarling [26]. The second predictor is a meta chooser predictor that contains a two-level local history branch prediction table, a global history table, and a meta chooser table that determines whether to use the local or global prediction for predicting the current branch. We call this the Local Global Chooser (LGC) predictor, and it is similar to the predictor found in the Alpha 21264. The final predictor we compare against is a set of per-branch two bit counters, as is found in the XScale processor. The XScale processor has a 2-bit counter associated with every branch target buffer entry, and not-taken is predicted on a BTB miss.
Figure 5 contains the results for the six programs comparing our custom FSM predictors to the gshare, LGC, and XScale predictors. The x-axis is the total area of the predictor, including the BTB structure, while the y-axis is the misprediction rate. The results for the LGC and gshare were gathered over a range of sizes. We present results for custom predictors trained on inputs, which are different than the ones used for measuring performance.
These results are denoted custom-diff. The custom-same results are when we use the same input for training and comparison, and it provides a limit to how well the FSM predictors may perform for that input.
The curve for the custom FSM predictors is generated by increasing the number of branches to be customized.
The top-left most point in the curve is the XScale architecture with one custom branch predictor. The points on the curve immediately to the right are for using 2 custom branch predictors, then three, and so on. As we add more FSM predictors, the number of mispredictions is reduced and at the same time the area to implement the predictor grows. The results show that for all programs the misprediction rate decreases as we devote more and more chip area to the prediction of branches.
The first thing that is very noticeable is that there is little to no difference between custom-diff and customsame. This implies that our training output has done a good job capturing the behavior that is inherent to the program. One could certainly use more than one input for training, and indeed this would be a good idea for verifying that the models generated correctly capture the behavior of the program.
For all of the programs, the custom predictors achieve the lowest misprediction rate of any predictor for their size. To beat the performance of these small predictors you would need a general purpose predictor that is 2 to 5 times larger. For some of the programs, even these large table sizes cannot perform better than our custom predictors.
For the program compress all of the benefit comes from the state machine for one branch. The misprediction rate is reduced from 23% to 16.5% by adding one custom FSM predictor. Adding more FSM predictors simply increases the area with little to no improvement in misprediction rate. Moderate table sizes of a LGC can outperform our customized predictors because there the branch causing the most mispredictions benefits from having local history. This branch would benefit from having a loop count instruction in a embedded processor, or
10.0% 14.0% 8.0% 13.0% 6.0% 12.0% 4.0% 11.0% 2.0%
Figure 5: Misprediction rate versus estimated area for the six benchmarks examined. Results are shown for the baseline XScale predictor, gshare, a meta predictor with a chooser between local and global history (LGC), and the customized branch predictor. The customized branch predictor results are for training on the same input used as gathering results testing, custom-same, and training on a different input from gathering results, custom-diff.
could easily be captured via customizing the branch predictor to perform loop termination prediction to predict the branch [35].
For g721, we can see that the XScale does a very good job of capturing the behavior of most of the branches in the program. However with little extra area we can reduce the misprediction rate from 8% to just over 7%. For vortex and gs the misprediction rate is reduced significantly from XScale’s default local 2-bit counters. For gs it is reduced from just under 5% to just over 4%, and for vortex the improvement is a reduction in miss rate from 13% to 3%. For these two programs, a local-global chooser of over two times the size of the custom predictor is needed to get a lower misprediction rate.
The best results are seen for ijpeg and gsm. For both of these programs the misprediction rate is far below that of even the largest table we examined. These programs do not benefit from local history. This can be seen by comparing how LGC performs against gshare for these programs. Since our scheme captures the global correlation so efficiently we can outperform the largest gshare by half a percent for ijpeg and over a full percent for gsm while only using a fifth of the area.
7.6 Custom Finite State Machine Examples Figures 6 and 7 are two examples of simple finite state machines that were generated using the techniques presented. Figure 6 serves as a good starting point for understanding how the state machines are used to generate predictions.
The state machine in figure 6 was automatically generated to target a particular branch in ijpeg. Each state is annotated with a prediction, shown in brackets. The state machine is updated by traversing an edge. An update with taken will traverse the edge labeled 1, while not-taken will traverse the edge labeled 0. This particular state machine was generated to capture a branch that is highly correlated with the branch that is two branches back in the history. For example, we would like to predict a 1 if the history patterns is “10” or “11” and predict 0 in all other cases. As can be seen in the figure, this is successfully captured by the finite state machine.
If you start in any state of the machine and you follow two transitions, first a 1 and then either a 0 or a 1, you will end up in a state that is labeled a 1. The converse is also true. If you follow an edge labeled 0, followed by either a 1 or a 0, the state you end up in will predict 0. This property is is maintained into even the most
Figure 7: A slightly more complex state machine generated from gs that captures two patterns. Patterns that match 0x1x or 0xx1x predict taken while all others predict not-taken.
complicated predictor.
Figure 7 is a slightly more complex state machine, this time generated for one of the branches in gs. This state machine captures two different patterns, 0x1x and 0xx1x where x is a “don’t care”. If, as above, you traverse any set of edges that match either of these patterns you will end up at a state predicting a 1. If the edges you traverse do not match these patterns you will end up in a state predicting 0.
For this branch there are four global history patterns that are seen the majority of the time, 001001010 which is biased taken, 010011010 which is biased not-taken, 010101010 which is biased taken, and 110010010 which is biased taken. There are other patterns which contribute but these represent over 90% of the patterns seen by this state machine. If you trace through these patterns, or just the last five digits of them, starting from any state you will end up in a state that predicts correctly. Because of this fact, it does not matter that we are updating the branch predictors for branches that may have never been trained on. No matter what state the machine has gotten itself into, as long as the final sequence of branches leading up to the branch is captured by one of the patterns the state machine will make the correct prediction.
These results show that our automated steps can be used to accurately identify branches that are highly correlated, and the nature of the correlation. This information could then be used to potentially guide additional compiler and hardware optimizations.
Note that for a given custom FSM predictor, only a Markov table with histories of length H leading up to the branch to be predicted are used to generate the FSM predictor. Therefore, using our update on every branch policy, we could be updating a custom FSM predictor with branches that were never seen in the histories used to generate the custom FSM predictor. This is fine because the FSM predictor we generated guarantees that the predictor will be in the desired prediction state when we fetch the candidate branch. The FSM predictor will be updated with state transitions for the H branches that come right before the branch that is assigned to the FSM predictor. Therefore, no matter what state the FSM predictor was in before performing the H branch updates, after the updates it will be in the desired prediction state.
8 Summary Predictive finite state machines are a fundamental building block in computer architecture and are used to control and optimize all types of prediction and speculation. In this paper we present an automated approach to generating FSM predictors using profile information. This approach can be used to develop FSM predictors that perform well over a suite of applications, tailored to a specific application, or even a specific instruction.
The algorithm we present uses profiling to build a Markov model of the correlated behavior present in a target application. From this model we define a set of patterns to capture. These patterns are compressed into a usable form and then translated into a regular expression. From the regular expression we use subset construction to build a finite state machine. This state machine is then optimized for our uses using Hopcroft’s Partitioning Algorithm and Start State Reduction.