CBMC as proof
Think of CBMC as a tool for doing mathematical proof. CBMC does an exhaustive search for issues affecting functional correctness and memory safety. If CBMC finds an issue, as we have seen, then CBMC has likely found a bug. But if CBMC does not find any issues, then you can rest assured that the issues like memory safety that you are asking CBMC to check for will not occur in your program with the inputs you are asking CBMC to check.
We note, however, that using CBMC may require bounding the size of the inputs you are asking CBMC to check. It may be necessary, for example, to assume that packets coming off the network are of size at most 1024 bytes in order to fit the problem into CBMC. We refer to assumptions like this as proof assumptions. It is important to remember that proof assumptions exists, and to communicate them clearly to the reader or consumer of any proof you write with CBMC.
Let's look again at memory-safety.c
#define SIZE 20
char buffer[SIZE];
char read_buffer(int i) { return buffer[i]; }
char read_pointer(int i) { return *(buffer + i); }
int main() {
int index;
read_buffer(index);
read_pointer(index);
}
that we found to have memory safety issues in the last section. The problem was obviously the lack of bounds checking in the code. Let's fix that.
Memory safety proof
Let's try adding bounds checking. Consider memory-safety1.c
#define SIZE 20
char buffer[SIZE];
char read_buffer(int i) {
if (i < SIZE) return buffer[i];
return '\0';
}
char read_pointer(int i) {
if (i < SIZE) return *(buffer + i);
return '\0';
}
int main() {
int index;
read_buffer(index);
read_pointer(index);
}
We probably want better error handling than just returning NULL
,
but let's run CBMC to prove memory safety:
cbmc memory-safety1.c --bounds-check --pointer-check
** Results:
memory-safety1.c function read_buffer
[read_buffer.array_bounds.1] line 10
array 'buffer' lower bound in buffer[(signed long int)i]: FAILURE
[read_buffer.array_bounds.2] line 10
array 'buffer' upper bound in buffer[(signed long int)i]: SUCCESS
memory-safety1.c function read_pointer
[read_pointer.pointer_dereference.1] line 14
dereference failure: pointer outside object bounds in buffer[(signed long int)i]:
FAILURE
** 2 of 3 failed (2 iterations)
VERIFICATION FAILED
Oh, nuts. We failed. We added bounds checking only for the upper bound.
Memory safety proof 2.0
Let's try adding bounds checking for both the upper and lower bounds. Consider memory-safety2.c
#define SIZE 20
char buffer[SIZE];
char read_buffer(int i) {
if (0 <= i && i < SIZE) return buffer[i];
return '\0';
}
char read_pointer(int i) {
if (0 <= i && i < SIZE) return *(buffer + i);
return '\0';
}
int main() {
int index;
read_buffer(index);
read_pointer(index);
}
and run CBMC
cbmc memory-safety2.c --bounds-check --pointer-check
** 0 of 3 failed (1 iterations)
VERIFICATION SUCCESSFUL
and declare victory.
Memory safety proof harness
Let's pivot back to the notion of a proof harness before we declare complete victory.
Back when we introduced CBMC as a form of unit testing,
we had a function quartile under test
and a unit test for quartile
.
The unit test set up the software environment for the function under test
and invoked the function.
The function quartile
was so simple that all the unit test had to do
was initialize the argument x
passed to quartile
and invoke
quartile
on x
.
Then we removed the initialization of x
so that CBMC could consider
all possible integer values for the argument x
.
We transformed the unit test into what we call a proof harness for
the function quartile
and used that with CBMC.
We routinely find ourselves with a library and wanting to use CBMC to prove that all of the public entry points in the library's API are memory safe. Our style is to write a memory safety proof for each entry point independently. We write one proof harness for each entry point. The purpose of the proof harness is to model the software environment in which the entry point will be called. The proof harness will always have to model the inputs to the entry point. But the proof harness might also have to model some aspects of the global state if, for example, there is a system table accessed by the entry point.
In the ideal world, a proof harness will construct arbitrary, unconstrained values for every variable and data structure in the software environment of an entry point, and invoke the entry point. In the real world, the entry point makes some assumptions about its inputs. It may, at the very least, assume that the system table is well-formed in the sense that it satisfies some data structure invariant. In the ideal world, the proof author will talk with the code author and convince the developer to write more bullet-proof code. In the real world, the proof author will model these assumptions in the proof harness to prove memory safety. But now these assumptions become part of the proof: CBMC has demonstrated that the entry point is memory safe, but only when the assumptions described in the proof harness are true when the entry point is invoked. A developer might want to include in the testing framework some checks that these assumptions really are true at all call sites calling the entry point. Or a developer might want to use CBMC to prove this...
But let's return to our memory safety example and transform it into something you are more likely to encounter in the real world.
The library
First, the buffer and the accessor functions are likely to be part of a library. Here is library.h
char read_buffer(int i);
char read_pointer(int i);
and library.c
#include "library.h"
#define SIZE 20
char buffer[SIZE];
char read_buffer(int i) {
if (0 <= i && i < SIZE) return buffer[i];
return '\0';
}
char read_pointer(int i) {
if (0 <= i && i < SIZE) return *(buffer + i);
return '\0';
}
The proof harnesses
Second, we write one proof harness for each entry point. Here is read_buffer_harness.c
#include "library.h"
harness() {
int index;
read_buffer(index);
}
#include "library.h"
harness() {
int index;
read_pointer(index);
}
Notice that we have changed the function name from main
to harness
in
these proof harnesses. This is not essential, it is just a matter of style,
but it does avoid name conflicts with a real main
that might exist in
the real environment.
The proof
Now we can run the proofs just as before, giving both the library
and the proof harness on the command line, and also indicating to CBMC
that the function to test is now called harness
and not main
:
cbmc library.c read_buffer_harness.c --bounds-check --pointer-check --function harness
cbmc library.c read_pointer_harness.c --bounds-check --pointer-check --function harness
In both cases:
** 0 of 3 failed (1 iterations)
VERIFICATION SUCCESSFUL
Success!!
Building code for proof
One last thing.
In the real world, the library is not this simple.
There is probably a build process to compile the source code into a library
using a compiler like gcc
.
One approach is to build the library using goto-cc
in place of gcc
.
The compiler goto-cc
is intended to be a drop-in replacement for gcc
.
Using goto-cc
should be as easy as building with CC=goto-cc
in place
of CC=gcc
.
The difference is that goto-cc
produces an intermediate representation for
the program called
a "goto program" or "goto binary" that is used by CBMC for model checking.
When we run cbmc
directly on a source file, CBMC is invoking goto-cc
on the source file
before model checking, but we can run goto-cc
ourselves to build the
goto program, and run CBMC on that.
Our approach, however, is not to build the entire library. There are performance and complexity issues that we won't go into here related to symbolic execution and constraint solving. We usually build only the portions of the library that are required to prove that a particular entry point is memory safe.
Using our simple library above to illustrate, we would build and prove
memory safe the entry point read_buffer
as follows:
goto-cc library.c -o library.goto
goto-cc read_buffer_harness.c -o read_buffer_harness.goto
goto-cc library.goto read_buffer_harness.goto -o read_buffer.goto --function harness
cbmc read_buffer.goto --bounds-check --pointer-check
The first two lines build the source files, the third links them
together (and identifies harness
as the entry point), and the fourth line
runs CBMC on the result. The process for read_pointer
is similar.
Now, finally, and forever:
** 0 of 3 failed (1 iterations)
VERIFICATION SUCCESSFUL
Success!!