1
00:00:00,080 --> 00:00:04,720
Hello, I will briefly present you here two
breakthroughs that we did for formally verified

2
00:00:04,720 --> 00:00:08,020
compilers in Coq.
A formally verified compiler is a compiler

3
00:00:08,020 --> 00:00:12,179
with a proof of correctness.
So, all the transformations have to be proven

4
00:00:12,179 --> 00:00:14,860
correct.
Our first breakthrough is to provide scalable

5
00:00:14,860 --> 00:00:19,000
scheduling optimizations.
A scheduling optimization is one where the

6
00:00:19,000 --> 00:00:22,430
instructions are reordered.
This kind of transformation is very hard to

7
00:00:22,430 --> 00:00:27,160
prove structurally, so we split it into an
untrusted oracle which gives us a schedule

8
00:00:27,160 --> 00:00:33,180
proposition, and then we execute symbolically
the returned schedule to check its correctness.

9
00:00:33,180 --> 00:00:37,740
Our second contribution is that we are the
first to target an interlocked VLIW processor,

10
00:00:37,740 --> 00:00:42,150
that is, a processor with explicit instruction
parallelism.

11
00:00:42,150 --> 00:00:46,030
So here is a tiny example to introduce you
the impact of scheduling.

12
00:00:46,030 --> 00:00:50,750
In this example, we assume that the ADD and
SUB have a latency of 1, and the LOAD has

13
00:00:50,750 --> 00:00:54,820
a latency of 2.
So we consider here a simple model, a 3 stage

14
00:00:54,820 --> 00:00:59,170
pipeline.
There is the DECODE stage, the EXEC1 stage

15
00:00:59,170 --> 00:01:04,760
where ADD and SUB finish, and EXEC2 where
LOAD finishes.

16
00:01:04,760 --> 00:01:10,280
In this example, if we run the instructions
in order, there will be a stall because of

17
00:01:10,280 --> 00:01:16,190
the R3 register not being available.
However, if we permute the instructions 2

18
00:01:16,190 --> 00:01:21,420
and 1, we see that we win one cycle but there
is no more stall.

19
00:01:21,420 --> 00:01:27,140
In practice, our untrusted oracle has a bunch
of resource constraints, and also latency

20
00:01:27,140 --> 00:01:32,290
constraints which depend on both the processor
latencies and the original program.

21
00:01:32,290 --> 00:01:37,600
Then, to ensure that this scheduling is actually
correct, we execute symbolically the original

22
00:01:37,600 --> 00:01:41,780
code and the scheduled code, and we check
that they are the same.

23
00:01:41,780 --> 00:01:47,700
For architectures such as the Kalray processor,
a bundle is a group of instructions that are

24
00:01:47,700 --> 00:01:52,090
to be executed in parallel.
In our work, we defined semantics for bundle

25
00:01:52,090 --> 00:01:55,229
execution.
And then we used a scheduling pass in order

26
00:01:55,229 --> 00:01:59,350
to generate bundles into CompCert.
So, our verifier works in two steps.

27
00:01:59,350 --> 00:02:03,620
As mentionned earlier, the first step verifies
the equivalence of the unordered and ordered

28
00:02:03,620 --> 00:02:07,190
basic block.
And then the last step verifies that each

29
00:02:07,190 --> 00:02:11,210
bundle has the same execution between sequential
and parallel semantics.

30
00:02:11,210 --> 00:02:15,569
Finally, we give some performance comparisons.
The top chart compares the performance of

31
00:02:15,569 --> 00:02:19,840
the original code to the scheduler.
Compared to the original version without any

32
00:02:19,840 --> 00:02:27,420
scheduling we gain 40% in average, and compared
to the greedy bundler we gain 20% in average.

33
00:02:27,420 --> 00:02:31,180
The bottom chart compares CompCert to the
GCC of Kalray.

34
00:02:31,180 --> 00:02:34,650
Most safety critical applications use GCC
without any optimization.

35
00:02:34,650 --> 00:02:38,900
We can see that we always beat GCC -O0 by
a large margin.

36
00:02:38,900 --> 00:02:43,810
We also beat GCC -O1 most of the time.
As for -O2 and -O3, we are not there yet,

37
00:02:43,810 --> 00:02:47,459
but we are getting closer.
In order to further bridge the gap, we are

38
00:02:47,459 --> 00:02:53,069
currently working on superblock scheduling
before register allocation.

39
00:02:53,069 --> 00:03:00,040
I'm going to demonstrate the CompCert compiler
and our optimizations on a simple example.

40
00:03:00,040 --> 00:03:06,430
We are going here to see an example of a matrix
multiplication.

41
00:03:06,430 --> 00:03:09,300
So this is the very simple code that performs
the matrix multiplication.

42
00:03:09,300 --> 00:03:15,940
We are going to compile it, and we are then
going to execute it and measure the amount

43
00:03:15,940 --> 00:03:20,700
of cycles.
Now we will run CompCert on the example without

44
00:03:20,700 --> 00:03:25,330
any postpass optimization.
So you will see the actual original code.

45
00:03:25,330 --> 00:03:29,130
So here is the code from the innermost loop
- that one.

46
00:03:29,130 --> 00:03:34,310
That code isn't optimal at all because we
only have one instruction per bundle.

47
00:03:34,310 --> 00:03:40,580
And also we have some stalls in the pipeline.
For example we have this multiply-add here

48
00:03:40,580 --> 00:03:45,091
that takes two cycles, except that we issue
a load on the cycle right after.

49
00:03:45,091 --> 00:03:55,239
So that makes for a very suboptimal performance.
If we try to run it, we get this amount of

50
00:03:55,239 --> 00:03:57,069
cycles.

51
00:03:57,069 --> 00:04:01,350
Now we will try to run CompCert with the first
version of the scheduler, which will just

52
00:04:01,350 --> 00:04:06,290
pack instructions together into bundles.
On this example, the oracle packed together

53
00:04:06,290 --> 00:04:11,110
some instructions, and the verificator verified
that this is actually correct.

54
00:04:11,110 --> 00:04:15,440
This will give a better performance, however
we still have issues with stalls as you can

55
00:04:15,440 --> 00:04:21,739
see here.
We can try to run it.

56
00:04:21,739 --> 00:04:28,169
And we see we are slightly better, but it
could still be improved.

57
00:04:28,169 --> 00:04:31,159
So now let's try to re-order the instructions
with our list scheduler.

58
00:04:31,159 --> 00:04:35,229
Here is the result. We can see that a lot
of instructions have been moved, which result

59
00:04:35,229 --> 00:04:41,550
in a much more efficient code.
In particular, we can execute four instructions

60
00:04:41,550 --> 00:04:48,889
at the same time on the same bundle.
Now if you run that example, we can see we

61
00:04:48,889 --> 00:04:52,449
have a much better performance than before.

62
00:04:52,449 --> 00:04:53,449
Thanks for your attention.