In a previous post (here) we discussed what are and the types of real-time systems. We also gave some examples to highlight their importance and clarify some of the confusion regarding the real-time and run-time performance tradeoff. In this post, we will focus exclusively on “hard” real-time systems and how they are statically analyzed (i.e. before the program is run) to guarantee that deadlines are never missed.

NOTE: You can find a great survey of hard real-time analysis methods and tools (both commercial and open-source) here. I will only discuss one method in this post, but there are alternatives.

Overview

Our goal is to ensure that the hard real-time system never misses a deadline. To achieve that, we are going to count how many clock cycles the processor will take to run our given hard real-time program through any possible execution path. Then we will select the code path that takes the longest to run and declare that our Worst-Case Execution Time (WCET). The hard real-time system will meet all its deadlines if the WCET is smaller than the deadline, otherwise the system needs to be re-designed so that it does meet its deadlines. For example, the WCET to draw a frame must be less than 16 ms if we would like the Chrome browser to operate at 60 fps although that is not a hard real-time application!

NOTE: I do mean counting how many clock cycles it takes to execute each add, sub, ld, st, etc instruction. For this reason, I have sometimes heard people call this method cycle-counting analysis.

The worst-case idea is actually very important and perhaps the most troublesome in this analysis. For example, the WCET of an if-statement is given by the branch with the longest run-time. And this is regardless of how often that particularly long-running branch of the if-statement is executed because in practice it is hard to know how many times this branch is run. So our estimated program run-time can add up in unrealistic ways when the code has widely different best-case and worst-case execution times!

Lets consider the simple pseudo-code example below to nail down this idea. The for-loop executes the if-statement N times and the run-times of statements A and B are 2 ms and 10 ms respectively. So the WCET for this code is 10 * 10 ms = 100 ms because we assume that the program is invariably going to execute the worst-case branch B from the if-statement. However, it might be the case that B is only executed 50% of the time on average, so a more realistic run-time for this code would be 5 * 2 ms + 5 * 10 ms = 60 ms. Even in this simple example we are over-estimating the WCET by about 40%! But we cannot improve our WCET with the information provided because we have no idea how frequently x(i) evaluates to true or false. If we heavy-handedly assumed that x(i) is true say 50% of the time, but its actually 60% or 70% in practice, we are at risk of missing deadlines.

for (i = 0; i < N; ++i)
    if x(i)
        A
    else
        B

It is for these reasons that we can often hear engineers saying that caching and other hardware optimization features are incompatible with hard real-time analysis. The idea of caching is to use smaller and faster memories close to the processor to reduce the time it takes to load or store data. However, caching makes the run-time of load and store instructions difficult to estimate because we cannot tell whether a memory access will result in a cache hit, so it has short run-time, or a miss so we need to check slower memories which takes longer. Therefore, the real-time analysis must often assume that the caches will miss, since this is the worst-case, and the overall WCET estimate will be far detached from the real run-time.

NOTE: As far as I know, hard real-time systems often rely on small processors, like the ARM Cortex-M0, or processors with relatively simple pipelines and caches because these are easier to analyze. For example, you can see this list of targets supported by a commercial real-time analysis tool. There is also extensive research to try to account for more complex processors, but I have not looked much into it.

In summary, we would like to make WCET analysis easier for ourselves to check whether the system will meet its deadlines. We can do so by ensuring that the program’s worst-case and best-case run-times are not outrageously different. We can also use simpler processors that do not have complicated features, like caching, that introduce unpredictability in the run-time of instructions. And without further ado, lets analyze a simple program to see how WCET works. We will perform the analysis in two steps:

Control Flow Graph (CFG) construction
Implicit Path Enumeration (IPET)

Control Flow Graph

We will analyze the following C program that implements bubblesort to sort 10 integers. I took this code from the BEEBS benchmarks mostly because it is relatively straight-forward to analyze.

#define FALSE 0
#define TRUE 1
#define NUMELEMS 10

void BubbleSort(int Array[]) {
    int Sorted = FALSE;
    int Temp, Index, i;

    for (i = 0; i < NUMELEMS; i++) {
        Sorted = TRUE;
        for (Index = 0; Index < NUMELEMS; Index++) {
            if (Index >= NUMELEMS - i) {
                break;
            }
            if (Array[Index] > Array[Index + 1]) {
                Temp = Array[Index];
                Array[Index] = Array[Index + 1];
                Array[Index + 1] = Temp;
                Sorted = FALSE;
            }
        }

        if (Sorted) {
            break;
        }
    }
}

The C code above is there mostly for illustration purposes because we are actually interested in the disassembly of that program’s binary. Analysing the disassembly, as opposed to the C code, takes into account any optimizations and transformations introduced by the compiler, so we eliminate potential errors in our estimation. In this case, it looks like the compiler inlined the BubbleSort function into another function called benchmark to produce the code shown below. The program was compiled with LLVM 10 for the ARMv6-M architecture.

00000074 <benchmark>:
  74:	b5f0      	push	{r4, r5, r6, r7, lr}
  76:	2000      	movs	r0, #0
  78:	2164      	movs	r1, #100	; 0x64
  7a:	4a0d      	ldr	r2, [pc, #52]	; (b0 <benchmark+0x3c>)
  7c:	6813      	ldr	r3, [r2, #0]
  7e:	2601      	movs	r6, #1
  80:	460c      	mov	r4, r1
  82:	4615      	mov	r5, r2
  84:	e006      	b.n	94 <benchmark+0x20>
  86:	602f      	str	r7, [r5, #0]
  88:	606b      	str	r3, [r5, #4]
  8a:	2600      	movs	r6, #0
  8c:	1e64      	subs	r4, r4, #1
  8e:	1d2d      	adds	r5, r5, #4
  90:	2c00      	cmp	r4, #0
  92:	d004      	beq.n	9e <benchmark+0x2a>
  94:	686f      	ldr	r7, [r5, #4]
  96:	42bb      	cmp	r3, r7
  98:	dcf5      	bgt.n	86 <benchmark+0x12>
  9a:	463b      	mov	r3, r7
  9c:	e7f6      	b.n	8c <benchmark+0x18>
  9e:	2e00      	cmp	r6, #0
  a0:	d103      	bne.n	aa <benchmark+0x36>
  a2:	1e49      	subs	r1, r1, #1
  a4:	1c40      	adds	r0, r0, #1
  a6:	2864      	cmp	r0, #100	; 0x64
  a8:	d1e8      	bne.n	7c <benchmark+0x8>
  aa:	2000      	movs	r0, #0
  ac:	bdf0      	pop	{r4, r5, r6, r7, pc}
  ae:	46c0      	nop			; (mov r8, r8)
  b0:	00000164 	.word	0x00000164

The first step in the analysis is to construct a Control Flow Graph (CFG) out of this disassembly. The CFG is a directed graph representation that captures the control-flow information, i.e. branches, conditionals, loops, etc, from the program. The nodes of a CFG represent blocks of code without branches or branch destinations, while the edges are branches as shown in the figure below. Each node and edge is given an identifier (e.g. block0 or b0 for short, e0, b1, etc) for convenience and a cost coefficient.

The cost coefficient represents the worst-case cost, in clock cycles, of executing the sequence of instructions in a node’s block of code. And this is where the choice of processor is very important! A cost model of the chosen processor is used to estimate the time that the microarchitecture takes to execute each instructions. From this, we calculate each node and edge’s cost coefficient. If the processor has unpredictable features like complex caching then its hard to approximate the worst-case run-time of instructions, and as a result, the cost cofficients could be way off-board.

In our case, the cost model is very simple because hard real-time analysis targets a processor similar to an ARM Cortex-M0 which has a very simple microarchitecture with the very predictable execution times listed here. For example, node b6 has a compare (cmp) instruction followed by a conditional branch (bne) with 1 and 3 clock cycles of run-time respectively, so the cost coefficient for b6 is 1 + 3 = 4.

NOTE: I generated this CFG using a tool I jointly developed with a colleague during my PhD. It takes an ARMv6-M program binary as an input along with a configuration file to produce the WCET estimate (among other things like the CFG!). The tool is open-source and is available here.

NOTE: CFG’s are not unique to real-time analysis. They are used in many other applications like compilers.

Control Flow Graph (CFG) for bubblesort.

Control Flow Graph for the bubblesort program. block0 is the start of the function. block10 is a placeholder exit node to ensure that the CFG has a single exit point. The exit node has no "cost" because it does not include any real instructions.

Integer Linear Programming

The second part of the analysis uses Implict Path Enumeration (IPET) to calculate the WCET. IPET derives an Integer Linear Programming (ILP) problem by combining the flow information and coefficients encoded in the CFG. It consists of an objective function, that encodes the run-time of the code when solved, and a set of constraints over the variables in that function.

The ILP problem for our bubblesort CFG is shown below. It tells us that the run-time of the code will be 11 clock cycles (i.e. b0’s cost coefficient) multiplied by the number of times that node b0 in the CFG is executed plus 9 clock cycles multiplied by the number of times that node b1 is executed and so on. But the ILP also gives us a few restrictions or contraints when calculating a solution for the variables b0, b1, e4, e6, etc. For example, there is an input constraint b0 = 1 meaning that the code at node b0 can only be executed once on entry to the function which makes sense because that code is not within a loop. There is also an output constraint b0 = e0 restricting edge e0 to be “executed” as many times as the code at b0 is executed.

The ILP formulation is constructed from a simple graph traversal of the CFG. Each node and edge has a cost coefficient (or 0) and a unique identifier; the objective function is a summation of these terms. The majority of the constraints relate the number of times that the edges are followed with how often the node is executed. Loop bounds are the only constraints that are difficult to construct as the information is not readily available in the CFG, so these bounds are either input manually or estimated automatically using a tool like SWEET.

/* Objective function */
max: 11 b0 + 9 b1 + 5 b2 + 6 b3 + 7 b4 + 4 b5 + 4 b6 + 6 b7 + 11 b8 + 0 b10
     - 2 e4 - 2 e6 - 2 e9 - 2 e11;

/* Output constraints */
b0 = e0;
b1 = e1;
b2 = e2;
b3 = e3 + e4;
b4 = e5 + e6;
b5 = e7;
b6 = e8 + e9;
b7 = e10 + e11;
b8 = e12;
b10 = 1;

/* Input constraints */
b0 = 1;
b1 = e0 + e10;
b2 = e5;
b3 = e2 + e7;
b4 = e1 + e4;
b5 = e6;
b6 = e3;
b7 = e9;
b8 = e8 + e11;
b10 = e12;

/* Loop constraints */
10 b0 <= b1;
b1 <= 10 b0;
10 b1 <= b4;
b4 <= 10 b1;

Finally, we need to solve the ILP problem by maximizing the objective function as we would like to calculate the worst-case execution time. We can do just that with a nice tool called lp_solve that you probably have installed in your Linux computer without realizing. The lp_solve output for our bubblesort ILP is shown below, so the WCET estimate for this code is 1810 clock cycles!

Value of objective function: 1810.00000000

Actual values of the variables:
b0                              1
b1                             10
b2                            100
b3                            100
b4                            100
b5                              0
b6                             10
b7                             10
b8                              1
b10                             1
e4                             90
e6                              0
e9                             10
e11                             1
e0                              1
e1                             10
e2                            100
e3                             10
e5                            100
e7                              0
e8                              0
e10                             9
e12                             1

Conclusion

And that is it! We statically analyzed a hard real-time implementation of bubblesort and estimated a WCET. It is worth pointing out that this is frustratingly difficult in practice as the complexity of the problem increases very quickly when programs are larger or the processor has more features. Also, there is a lot of manual labour involved in getting the right parameters, like look bounds, and constantly re-designing bits of your system to try to fulfil the real-time constraints. But hopefully this was a helpful introduction to hard real-time systems!