Computation of WCET using Program Slicing and Real-Time Model-Checking
On this page you will find the output of the tool we have designed to compute WCET on the ARM920T. The methodology is described in [Bech11] (see References below). A long version of a paper submitted to RTSS'2012 is available here: submitted RTSS'2012.
Abstract
Computing accurate WCET on modern complex architectures is a challenging task. This problem has been devoted a lot of attention in the last decade but there are still some open issues. First, the control flow graph (CFG) of a binary program is needed to compute the WCET and this CFG is built using some internal knowledge of the compiler that generated the binary code; moreover once constructed the CFG has to be manually annotated with loop bounds.Second, the algorithms to compute the WCET (combining Abstract Interpretation and Integer Linear Programming) are tailored for specific architectures: changing the architecture (e.g., replacing an ARM7 by an ARM9) requires the design of a new ad hoc algorithm. Third, the tightness of the computed results (obtained using the available tools) are not compared to actual execution times measured on the real hardware.
In the paper [Bech11] we address the above mentioned problems. We first describe a fully automatic method to compute a CFG based solely on the binary program to analyse. Second, we describe the model of the hardware as a product of timed automata, and this model is independent from the program description. The model of a program running on a hardware is obtained by synchronizing (the automaton of) the program with the (timed automata) model of the hardware. Computing the WCET is reduced to a reachability problem on the synchronised model and solved using the model-checker Uppaal. Finally, we present a rigorous methodology that enables us to compare our computed results to actual execution times measured on a real platform, the ARM920T. The tools we use are: Uppaal and the GNU-ARM tool suite from CodeSourcery.
Interships Proposals
Master's intership proposals are available here. For any enquiry you can send me an email here (remove no-spam).Architecture of the ARM920T
The development board we model and use is an Armadeus APF9328 board which bears a 200MHz Freescale MC9328MXL micro-controller with an ARM920T processor. The processor embeds an ARM9TDMI core that implements the ARM v4T architecture. An overview of the architecture is given in Fig. 2. Gray arrows are address buses/connections. White arrows are data buses/connections. Both are 32 bits wide. Coprocessor 15 hosts control registers for the caches and the MMUs. Actually Register R13 (which should not be confused with the ARM9TDMI r13 register) is not duplicated and is located in Coprocessor 15. It hosts a process ID used for virtual address to physical address translation. Some blocks like the Write back Physical TAG RAM and various debug and/or coprocessor interfaces are not shown. The timing specifications of the components are:
Tool Chain
The architecture of our tool is given in Fig. 1.The starting point is a binary program (that can be obtained with objdump using the CodeSourcery suite). This file is transformed into a Uppaal timed automaton by performing the following steps:
- generate the CFG of the program (Compute CFG)
- compute a reduced program (Compute WCET-equiv)
- generate the Uppaal description of the reduced program.
In the result file we embed models of the pipeline and the caches of the ARM920T. This produces a ready-to-analyse Uppaal file that generates the runs of the program on an ARM920T.
The WCET is computed by computing the largest value a clock (never reset in the program) can take in a final state. This amounts to checking a reachability property.
We have implemented the construction of the CFG (Compute CFG) and the computation of the WCET-equivalent program (Compute WCET-equiv). Together with a parser of ARM binary programs this makes a few thousand C++ lines of code.
Our tool produces a bundle of files: a ready-to-analyse file containing the UPPAAL timed automata models of the program a binary P and the hardware models (the layout of the CFG is produced using dot), a dot file with the graph of the program and a ready-to-compile C++ file that contains a simulator of the program. This last file can be compiled and used to compute useful information like the ranges of registers. Notice that during the first phase ComputeCFG we compute the range of the stack pointer and thus the tool can also be used as a stack analyser.
Results and UPPAAL files
Original C sources files are available from the Mälardalen WCET Research Group . We provide the files (first column of Table 1) as we have slightly modified some of them: the WCET for most of the files is very small, and measurement errors are relatively large; this is why we have changed (when necessary and possible) the number of iterations in the C program. For example, the original C program for fib computes the Fibonacci number 30 and in our example we have computed fib(300) so that the WCET is larger and measurement errors are less than 1%.The program P to analyse is wrapped in a template function main: an example for program fib(300) is given in Listing 1.
Given P, we let t(P) be the wrapped program. Measuring the execution-time of P consists in (1) reading a hardware timer (timerGetValue) into a start variable, (2) calling the program P, and (3) reading the timer again in a stop variable and (4) printing1 the difference stop − start. The function timerGetValue (assembly code) has been designed to read a hardware timer (See next paragraph). The measurement error is plus or minus 12 processor cycles. The program t(P) is compiled and linked. Running it on the ARM9 will print out the number of cycles taken by the program P: this figure is given in column “Measured WCET” in Table 1.
To faithfully compute the WCET of P using our method, we take as
input of our tool chain t(P). t(P) is transformed (using Compute
CFG and Slice) into a Uppaal automaton. In this automaton a dedicated clock
GBL_CLK is reset when the instruction2 of t(P) that
reads the hardware timer flows out of the M stage (reading the timer
in function timerGetValue is done using a load instruction).
The final state of the automaton is reached when the second occurrence
of the instruction that reads the timer flows out of the W stage. The
computed WCET is given in column “Computed WCET” in
Table 1. Column “Uppaal”
in Table 1
gives the time Uppaal takes to check the property
sup {WriteBackStage.DONE} : GBL\_CLK clock that gives the
maximal value of a clock in a designated set of states. To do this we
used a recent version of UPPAAL (64-bit, UPPAAL 4.1.4 (rev. 4857)),
July 2011, non-commercial). For some of the files with subscript u
this version of UPPAAL could not parse the input (the reason why is
still unknown) and we used an earlier version of UPPAAL (4.0.6) to
check the reachability property “Is it possible to reach a final
states with GBL_CLK ≥ K+1 ?”when this property is
false and was true for K. In this case K is the computed
WCET. This means we made a binary search for the WCET.
UPPAAL models of some programs with the computed WCET given on Table 1: Each link (first column) points to a zipped directory that contains: the .xml file in UPPAAL (and the query), the .arm file which is the assembly code we analyse and is obtained using objdump, the .dot file that gives some graphical representation of the CFG of the program (you should install GraphViz) to display it.
Note 1: The de-assembled files file.arm have been stripped of from non useful code. The caches are ON only after init_mmu has completed. Thus some functions (printbin,readstate,timerInit,TimerSetPrescaler) are ignored and replaced by a return statement (bx lr). Moreover, the last instruction in the file is the one right after the second call of timerGetValue. The signal prog_completed! is sent in the UPPAAL file at this point. The files also contain an odd instruction: mov lr, #3 followed by bx lr. This branching corresponds to a return to the caller of main and is only used in the UPPAAL models.
Note 2: Results in Table 1 are obtained with the latest hardware models we have designed anf may differ from the published results in [Bech11] (see References below).
Note 3: The caches are 64-way associative caches i.e., 8 sets of 64 lines (the paper [Bech11] erroneously mentioned 8-way associative caches).
Note 4: Column Measured WCET (M) is the most frequently obtained execution-time out of 10 runs of the programs. Slight differences in the measures may appear because the main memory is a dynamic RAM (DRAM) and is refreshed every few mseconds. During a refreshing phase, the main memory is unavailable and this incurs some extra delay in the execution-time which accounts for the differences between two measures.
Note 5: In the UPPAAL models, the variables that describe the state of the caches (instruction and data) are declared as meta variables. This reduces the encoding needed for a state and is correct because it is never the case that two states are equal (even without the cache description). Hence the whole state space is explored even with this reduced state encoding.
Program
loc†
UPPAAL Time/States Explored¶
Computed WCET (C)
Measured WCET (M)
(C−M)/M × 100 Abs§ fib-O0 74 2.32s/74181 8098 8064 0.42% 47/131 fib-O1 74 0.79s/22333 2597 2544 2.0% 18/72 fib-O2 74 0.43s/9711 1209 1164 3.8% 22/71 janne-complex-O0∗ 65 2.05s/38014 4264 4164 2.4% 78/173 janne-complex-O1∗ 65 0.67s/14600 1715 1680 2.0% 30/89 janne-complex-O2∗ 65 0.59s/13004 1557 1536 1.3% 32/78 fdct-O1 238 25.48s/60418 4245 4092 3.7% 100/363 fdct-O2u 238 3.24s/55285 19231 18984 1.3% 166/3543 Single-Path Programs‡ with MUL/MLA/SMULL instructions (instructions durations depend on data) fdct-O0 238 155.96s/85008 [11242,11800] 11448 3.0% 253/831 matmult-O0∗,u 162 5m9s/10531230 [502850,529250] [511584,528684] 0.1% 158/314 matmult-O1∗ 162 32.58s/1112527 [130001,156367] [127356,153000] 2.2% 71/172 matmult-O2∗ 162 31.23s/1001580 [122046,148299] [116844,140664] 5.4% 75/288 jfdcint-O0 374 109.35s/100861 [12699,12918] 12588 2.6% 159/792 jfdcint-O1 374 14.82s/35419 [4897,5072] 4668 8.6% 25/325 jfdcint-O2u 374 5.38s/175661 [16746,16938] 16380 3.4% 56/2512 Multiple-Path Programs bs-O0 174 37.59s/1421274 1068 1056 1.1% 75/151 bs-O1 174 29.61s/1214673 738 720 2.5% 28/82 bs-O2 174 15.8s/655870 628 600 4.6% 28/65 cnt-O0∗ 115 5.21s/77002 9027 8836 2.1% 99/235 cnt-O1∗ 115 1.75s/27146 4123 3996 3.1% 42/129 cnt-O2∗ 115 11.55s/11490 3067 2928 4.6% 39/263 insertsort-O0∗ 91 598.98s/24250738 3133 3108 0.8% 79/175 insertsort-O1∗ 91 353.80/11455296 1533 1500 2.2% 40/115 insertsort-O2∗ 91 11.68s/387292 1326 1344 0.4% 43/108 ns-O0∗ 497 78.89s/3064316 30968 30732 0.8% 132/215 ns-O1∗ 497 10.29s/368720 11701 11568 1.1% 61/124 ns-O2∗ 497 70.93s/1030746 7280 7236 0.6% 566/863
†lines of code in the C source file
‡ (C−M)/M × 100 computed using the upper bound for C.
§Non Abstracted instructions/Instructions
∗Program selected for the WCET Challenge 2006
¶Time in min/seconds on Intel Dual Core i3 3.2Ghz 8GB RAM
u UPPAAL 64-bit could not analyse the file. Used earlier 32-bit version of UPPAAL 4.0.6
Table 1: file-ox indicates that file was compiled using gcc -ox (optimization option). Comments on the Results
For the Single-Path programs the results show that our hardware formal model (and sliced program) are not too coarse.The second section Single-Path with MUL/SMULL requires some more explainations. When we compute the WCET using the UPPAAL models, we use an interval for durations of the instructions MUL/MLA and SMULL: indeed the time it takes to perform one of this instruction depends on the data in the operands. For MUL/MLA execution-time in the execution stage of the pipeline is within [3,6] cycles and for SMULL within [4,7]. The computed WCET is an interval. Notice that there is no need for assuming that no timing anomalies occur as we do not assume that the duration of MUL/MLA or SMULL is the upper bound of the interval. The exploration of the state-space encompasses all the cases for all the possible durations of MUL/MLA and SMULL within their respective intervals. For the matmult program, we have measured the execution-time for two different input data: these data were intended to yield respectively the shortest and largest execution-time. We cannot guarantee that they actually yield the WCET. For instance for the O2 version, the binary program is optimized and it is rather difficult to guess which input data produces the WCET.
The last section Multiple-Path programs contains results for programs with data dependent decisions during the course of execution. The computed WCET takes into account all the possible choices. For measurements, we have again tried to choose the input that produces the WCET (for some of the programs this is difficult to assert that this choice is correct).
Overall, the results are very tight even on large execution-times (matmult-O0).
References
[Bec12] Jean-Luc Béchennec and Franck Cassez. Computation of WCET using Program Slicing and Real-Time Model-Checking. Extended version of Submitted paper to RTSS 2012, May 2012. 16 pages.
[PDF File ][Bec11] Jean-Luc Béchennec and Franck Cassez. Computation of WCET using Program Slicing and Real-Time Model-Checking. Research report, IRCCyN, CNRS, Nantes, France, May 2011. 20 pages, arXiv:1105.1633v3 [cs.SE]. Updated October 2011.
[ bib | arXiv | PDF File ][Cas11] Franck Cassez. Timed Games for Computing WCET for Pipelined Processors with Caches. To appear in ACSD'2011, Newcastle upon Tyne, UK, June 2011
[ bib | PDF ][Cas10] Franck Cassez. Timed Games for Computing Worst-Case Execution-Times. Research report, National ICT Australia, Sydney, June 2010. 31 pages, arXiv:1006.1951 [cs.SE].
[ bib | arXiv | PDF File ]Last updated May 14, 2012 by Franck Cassez