# «Automated Design of Finite State Machine Predictors Timothy Sherwood Brad Calder Department of Computer Science and Engineering University of ...»

In this way, the problem of creating a FSM predictor reduces to that of ﬁnding a regular expression that describes L in a compact manner. Once we have the regular expression that recognizes L we can use established methods to convert it into a ﬁnite state machine that recognizes L. If we do this translation correctly, the ﬁnite state machine will indicate that it recognizes a member from L when it sees an input string in L. Because we picked L to be only the subset of s that we want to predict 1 on, we know that when we recognize this input string sequence that we should predict 1.

To ﬁnd the language L we use proﬁle information from the application. From the proﬁle a model of the data is built, and this model can then be analyzed to ﬁnd histories that are biased toward predicting 1. This set of histories is then compressed to a usable form and converted into a regular expression.

4.2 Modeling To design a FSM predictor we start by tracing the target application suite to create a representative sequence of predictions for the applications.

An issue is determining what to trace, which depends on the intended use of the predictor. For branch prediction, we have two states 0 and 1, and the trace consists of a series of branch outcomes. We will use the following

**trace for the rest of the examples in this section:**

The next step after trace generation, is to build up a statistical model for the data. For this we use an Nth order Markov Model. An Nth order Markov Model is a table of size 2N which contains P [1|last N inputs] for each of the possible 2N last N inputs in the trace. N serves as a limit to the amount of history that we may use in making our predictions.

**For our example trace, we build up the following probabilities for a second order Markov table (N=2):**

P [1|00] is generated by ﬁnding all times that 00 is followed by a 1, in this case there are 5 cases of the pattern 00, and 2 of them are followed by a 1.

The corresponding predict 0 probabilities are calculated by subtracting the predict 1 probability from 1 because we only have 2 symbols in our alphabet. Note that while this scales exponentially with the size of N, it is still very reasonable for even the largest values of N we have examined. Having more knowledge of history after a certain point does not improve accuracy and we found that point was well within the reach of the techniques described.

For the predictors we are generating we did not see the need to go beyond N = 10.

4.3 Pattern Deﬁnition Now that we have the probabilities that we need, we can continue with the next step which is picking the histories that we will eventually build the language L from. In the case of a branch predictor, where we simply wish to minimize the number of mispredictions, the task is straight forward. We simply pick all the histories that have a probability of preceding a 1 which is greater than or equal to 1/2 to form the language “predict 1”. In our example above, that would be the set of histories {01, 10, 11}. We would like to predict a 1 whenever one of these histories the trace is divided up into groups of four only for readability, it has nothing to do with the trace itself appears in the input based upon our proﬁle, since they had a probability greater than 1/2. Of course histories with probability equal to 1/2 can go either way.

We can make further design trade offs at this point in the form of don’t care patterns. Some patterns only occur very rarely, and their inclusion into the set of histories will have almost no effect on the performance of the predictor but will make the job of minimizing more complicated. These history patterns can be placed into a third set, separate from the “predict 1” and “predict 0” sets, called the “don’t care” set. We have found that by placing only the 1% least seen histories in the “don’t care” set can reduce the size of the predictor by a factor of two with negligible impact on prediction accuracy.

Once we have passed this stage of the design ﬂow, the function of the predictor is set and will not change signiﬁcantly. The ﬁnite state machine that will be designed will return a 1 when histories in the “predict 1” set are seen, and 0 when histories in the “predict 0” set are seen. The output of the “don’t care” set is still undecided at this point, and the next stage will take advantage of this to compress the description of the sets.

4.4 Pattern Compression Now that we have our three sets, “predict 1”, “predict 0” and “don’t care”, the next step is to compress our description of the sets into a usable form. This is done by a standard logic minimization tool, where the input is a truth table. The input side of the truth table is the history patterns captured by our Markov Model, and the output side is the set that the history pattern belongs to. If we continue with our example, we have the out set partitioned

**up as follows:**

From this we generate a truth table, where all the histories that are contained in the “predict 1” set have there output deﬁned as one, and likewise for the “predict 0” set.

{00 → 0, 01 → 1, 10 → 1, 11 → 1} We then use the logic minimization tool Espresso [32], to minimize this truth table down to a set of logic functions that describe the set of inputs that produce a 1. The logic function that is output is our compact description of the “predict 1” set, and it recognizes those histories that we wish to predict 1 or 0 on. This step also merges the “don’t care” set into the “predict 1” and “predict 0” sets in such a way as to minimize the number of unique terms. The logic function that is generated is a sum of products representation. The “predict 1” set will satisfy this function and the “predict 0” set will not.

Here we see the sum of products representation of our truth table, where an x represents an input that is unimportant to the outcome of the function. From this description we can now build our regular expression that will capture the language L.

4.5 Regular Expression Building Once we have the minimized representation of the set of histories we need to build a regular expression which will match these history patterns whenever they come up in the sample input. More speciﬁcally, whenever we see an input string, whose trailing N bits match the minimized patterns for the “predict 1” set we return true.

We can build a regular expression as follows. Each term is a clause, each 0 is a 0, each 1 is a 1, and each don’t care, denoted as x, matches either a 0 or a 1. Let use start with a single term, from above, (1 x). This translates to the regular expression 1{0|1} because it is a 1 followed by either a 0 or 1. Similarly (x 1) translates to {0|1}1.

Since a match can be caused by either one, we write the whole expression as {1{0|1}} | {{0|1}1}. However this is not quite complete.

Remember that the language L must describe all possible inputs for which it needs to return 1, not just the last two bits of the input. We overcome this problem by specifying the language to be any string that ends in the desired histories, which can be done by concatenating any number of 1s or 0s in the front of the histories. Thus,

**our ﬁnal regular expression for the above history is:**

{0|1} { 1{0|1} | {0|1}1 } Once we have the desired regular expression there is still the matter of converting it to an efﬁcient FSM.

Figure 1: The state machine on the left was generated from the sequence t shown in section 4.2 and optimized with Hopcroft’s partitioning algorithm. The state machine on the right is what results from removing the start-up states and re-numbering. This is the ﬁnal state machine for t. Note that the patterns ending in 01, 10, and 11 are still captured correctly. The number in brackets shows the prediction produced by each state in the state machine.

4.6 FSM Creation The ﬁrst step in building a FSM from a regular expression is the construction of a non-deterministic ﬁnite state machine, which is a state machine that can be in more than one state at a time. Building a non-deterministic FSM is a fairly straight forward process of enumerating paths. Once the non-deterministic FSM is completed it is converted to a deterministic state machine using subset construction [18]. Subset construction can sometimes lead to an exponential blow up in the number of states, and there are better algorithms known which take advantage of reduced alphabets, but we have found subset construction to be more than sufﬁcient for the predictors we examine in this paper.

At this point we now have a fully implementable ﬁnite state machine, however there are two more steps we take to reduce the number of states used by the state machine. We start by applying Hopcroft’s partitioning algorithm [18]. This algorithm removes both unreachable and redundant states, although there are still unneeded states in the state machine.

4.7 Start State Reduction Since the state machine must recognize all strings in the language, there are unnecessary start up states that are only used at the beginning of the execution where history bits are undeﬁned. Since we are only interested in the steady state operation of the state machine, i.e. those strings with a length greater than N, we can cut out those nodes that are only used in parsing strings less than N in length. This goes against what was said earlier about not changing the semantics of the state machine, but this optimization only effects the behavior of the state machine on a small constant number of strings. There can be up to 2N start-up states, and they typically account for around one half of all states in the machine.

If we look to Figure 1 the problem can be clearly seen. The ﬁgure shows the state machines generated from our original sequence t. In the state machine on the left, the states S1 and S3 are not need needed after just the ﬁrst couple of accesses to the state machine. We would like to remove these to save both state and transition logic complexity. This can be done by removing all nodes unreachable from the steady state operation of the state machine. All start-up states are unreachable from the steady state of operation because you can never have undeﬁned history past the start-up phase. We exploit this fact by simply removing all nodes unreachable from steady state. Figure 1 shows that the start-up states S1 and S3 have been removed and the remaining states have been renumbered. Note that the behavior of the ﬁnite state machine is still unchanged past the removed start-up states.

4.8 Synthesis At this point we have almost reached our ﬁnal destination. We have a ﬁnite state machine that produces a prediction based on the input. The predictions are governed by the last N bits of the input string, and that information is efﬁciently encoded into a ﬁnite state machine, as can be seen in Figure 1. Starting from any node, and traversing the edges in the FSM following patterns (01, 10, 11) in our “predict 1” set will end at a node with a prediction of

1. Similarly, we will predict 0 for the pattern (00) in our “predict 0” set, starting from any node.

The ﬁnal step is to actually do synthesis. Since the ﬁnite state machine is a well understood and commonly used primitive, every synthesis tool has some form of ﬁnite state machine input format. The job of synthesis is to ﬁnd an efﬁcient hardware implementation for the state machine. This includes ﬁnding a good encoding for the states and their transitions. We translate our description of the ﬁnite state machine to VHDL, which is then read and analyzed by the Synposys design tool.

5 Methodology We used two sets of benchmarks to gather the results for this paper. The ﬁrst set of benchmarks were chosen because of their interesting conﬁdence estimation behavior for value prediction as shown in [4]. These programs were groff, gcc, li, go, and perl.

The other set of benchmarks were chosen based on their applicability to an embedded environment and because they had interesting branch behavior. Three of the applications compress, ijpeg, and vortex are from the SPEC95 benchmark suite. We also include three programs from the MediaBench application suite. Gsm decode does full-rate speech transcoding, g721 decode does voice decompression, and gs is an implementation of a postscript interpreter.

We used ATOM [43] to proﬁle the applications and generate the Markov Models over the full execution of the benchmarks. All applications were compiled on a DEC Alpha AXP-21264 architecture using the DEC C compiler under OSF/1 V4 operating system using full optimizations (-O4). Traces were gathered for 300 million instructions from the SimPoints recommended in [37, 38]. After the Markov models were generated, generating all of the FSM predictors for each program using our automated approach took from 20 seconds to 2 minutes on a 500 MHZ Alpha 21264.

6 FSM Predictors in General Purpose Processors Structures in general purpose processors are customized to a suite of common applications used to evaluate the architecture during its design. This can include the SPEC2000 and other benchmarks suites. Any customization applied to a general purpose processor needs to be general purpose, so specializing to individual programs is not practical. Therefore, we use an aggregate trace of several benchmarks together to represent the common behavior of the suite of programs. We then build a customized predictor to this aggregate trace, which is used to represent the general purpose behavior of the workload the architecture is designed to.

6.1 Value Prediction Several architectures have been proposed for value prediction including last value prediction [22, 23], stride prediction [13, 14], context predictors [33], and hybrid approaches [48, 31]. In this study we focus on using a stride-based value predictor, since it provides the most performance for a reasonable amount of area devoted to value prediction [3].