1 00:00:00,080 --> 00:00:04,720 Hello, I will briefly present you here two breakthroughs that we did for formally verified 2 00:00:04,720 --> 00:00:08,020 compilers in Coq. A formally verified compiler is a compiler 3 00:00:08,020 --> 00:00:12,179 with a proof of correctness. So, all the transformations have to be proven 4 00:00:12,179 --> 00:00:14,860 correct. Our first breakthrough is to provide scalable 5 00:00:14,860 --> 00:00:19,000 scheduling optimizations. A scheduling optimization is one where the 6 00:00:19,000 --> 00:00:22,430 instructions are reordered. This kind of transformation is very hard to 7 00:00:22,430 --> 00:00:27,160 prove structurally, so we split it into an untrusted oracle which gives us a schedule 8 00:00:27,160 --> 00:00:33,180 proposition, and then we execute symbolically the returned schedule to check its correctness. 9 00:00:33,180 --> 00:00:37,740 Our second contribution is that we are the first to target an interlocked VLIW processor, 10 00:00:37,740 --> 00:00:42,150 that is, a processor with explicit instruction parallelism. 11 00:00:42,150 --> 00:00:46,030 So here is a tiny example to introduce you the impact of scheduling. 12 00:00:46,030 --> 00:00:50,750 In this example, we assume that the ADD and SUB have a latency of 1, and the LOAD has 13 00:00:50,750 --> 00:00:54,820 a latency of 2. So we consider here a simple model, a 3 stage 14 00:00:54,820 --> 00:00:59,170 pipeline. There is the DECODE stage, the EXEC1 stage 15 00:00:59,170 --> 00:01:04,760 where ADD and SUB finish, and EXEC2 where LOAD finishes. 16 00:01:04,760 --> 00:01:10,280 In this example, if we run the instructions in order, there will be a stall because of 17 00:01:10,280 --> 00:01:16,190 the R3 register not being available. However, if we permute the instructions 2 18 00:01:16,190 --> 00:01:21,420 and 1, we see that we win one cycle but there is no more stall. 19 00:01:21,420 --> 00:01:27,140 In practice, our untrusted oracle has a bunch of resource constraints, and also latency 20 00:01:27,140 --> 00:01:32,290 constraints which depend on both the processor latencies and the original program. 21 00:01:32,290 --> 00:01:37,600 Then, to ensure that this scheduling is actually correct, we execute symbolically the original 22 00:01:37,600 --> 00:01:41,780 code and the scheduled code, and we check that they are the same. 23 00:01:41,780 --> 00:01:47,700 For architectures such as the Kalray processor, a bundle is a group of instructions that are 24 00:01:47,700 --> 00:01:52,090 to be executed in parallel. In our work, we defined semantics for bundle 25 00:01:52,090 --> 00:01:55,229 execution. And then we used a scheduling pass in order 26 00:01:55,229 --> 00:01:59,350 to generate bundles into CompCert. So, our verifier works in two steps. 27 00:01:59,350 --> 00:02:03,620 As mentionned earlier, the first step verifies the equivalence of the unordered and ordered 28 00:02:03,620 --> 00:02:07,190 basic block. And then the last step verifies that each 29 00:02:07,190 --> 00:02:11,210 bundle has the same execution between sequential and parallel semantics. 30 00:02:11,210 --> 00:02:15,569 Finally, we give some performance comparisons. The top chart compares the performance of 31 00:02:15,569 --> 00:02:19,840 the original code to the scheduler. Compared to the original version without any 32 00:02:19,840 --> 00:02:27,420 scheduling we gain 40% in average, and compared to the greedy bundler we gain 20% in average. 33 00:02:27,420 --> 00:02:31,180 The bottom chart compares CompCert to the GCC of Kalray. 34 00:02:31,180 --> 00:02:34,650 Most safety critical applications use GCC without any optimization. 35 00:02:34,650 --> 00:02:38,900 We can see that we always beat GCC -O0 by a large margin. 36 00:02:38,900 --> 00:02:43,810 We also beat GCC -O1 most of the time. As for -O2 and -O3, we are not there yet, 37 00:02:43,810 --> 00:02:47,459 but we are getting closer. In order to further bridge the gap, we are 38 00:02:47,459 --> 00:02:53,069 currently working on superblock scheduling before register allocation. 39 00:02:53,069 --> 00:03:00,040 I'm going to demonstrate the CompCert compiler and our optimizations on a simple example. 40 00:03:00,040 --> 00:03:06,430 We are going here to see an example of a matrix multiplication. 41 00:03:06,430 --> 00:03:09,300 So this is the very simple code that performs the matrix multiplication. 42 00:03:09,300 --> 00:03:15,940 We are going to compile it, and we are then going to execute it and measure the amount 43 00:03:15,940 --> 00:03:20,700 of cycles. Now we will run CompCert on the example without 44 00:03:20,700 --> 00:03:25,330 any postpass optimization. So you will see the actual original code. 45 00:03:25,330 --> 00:03:29,130 So here is the code from the innermost loop - that one. 46 00:03:29,130 --> 00:03:34,310 That code isn't optimal at all because we only have one instruction per bundle. 47 00:03:34,310 --> 00:03:40,580 And also we have some stalls in the pipeline. For example we have this multiply-add here 48 00:03:40,580 --> 00:03:45,091 that takes two cycles, except that we issue a load on the cycle right after. 49 00:03:45,091 --> 00:03:55,239 So that makes for a very suboptimal performance. If we try to run it, we get this amount of 50 00:03:55,239 --> 00:03:57,069 cycles. 51 00:03:57,069 --> 00:04:01,350 Now we will try to run CompCert with the first version of the scheduler, which will just 52 00:04:01,350 --> 00:04:06,290 pack instructions together into bundles. On this example, the oracle packed together 53 00:04:06,290 --> 00:04:11,110 some instructions, and the verificator verified that this is actually correct. 54 00:04:11,110 --> 00:04:15,440 This will give a better performance, however we still have issues with stalls as you can 55 00:04:15,440 --> 00:04:21,739 see here. We can try to run it. 56 00:04:21,739 --> 00:04:28,169 And we see we are slightly better, but it could still be improved. 57 00:04:28,169 --> 00:04:31,159 So now let's try to re-order the instructions with our list scheduler. 58 00:04:31,159 --> 00:04:35,229 Here is the result. We can see that a lot of instructions have been moved, which result 59 00:04:35,229 --> 00:04:41,550 in a much more efficient code. In particular, we can execute four instructions 60 00:04:41,550 --> 00:04:48,889 at the same time on the same bundle. Now if you run that example, we can see we 61 00:04:48,889 --> 00:04:52,449 have a much better performance than before. 62 00:04:52,449 --> 00:04:53,449 Thanks for your attention.