/ Compiler says no!

Wrapping SQLite's UTF-8 strings using Rust's FFI

Wrapping raw pointers to UTF-8 encoded text from SQLite allows offloading a lot of safety checks to the Rust compiler, but requires a lot of detailed insights into the workings of both SQLite and memory allocation.

With a way to open an SQLite database in place we can look towards running a query. The sqlite3_exec function offers a quick start, bundling all steps required to run a query into a single function at the cost of flexibility:

int sqlite3_exec(
  sqlite3*,                                  /* An open database */
  const char *sql,                           /* SQL to be evaluated */
  int (*callback)(void*,int,char**,char**),  /* Callback function */
  void *,                                    /* 1st argument to callback */
  char **errmsg                              /* Error msg written here */
);

From the signature a type sticks out: char** (or *const *const c_char in Rust), an array of C-style1 string values. These parameters to the callback function contain column values and headers for a given result row.

When CStr is not an Option

The double pointer (char** / *mut *mut c_char) is conceptually similar to what in Rust would be a &[&CStr], a slice of zero-terminated string references, with the core difference that the C type does not contain the length. In Rust parlance, a char** is called a called thin pointer, which is the size of a single word (i.e. usize), whereas fat pointers are larger at typically two words, containing the length or other information to determine the total size of the object they point at. Fat pointers are important for memory safety if the pointee (i.e. the object being pointed at) is variably sized, it is impossible to determine whether or not a read is out of bounds from thin pointers alone.

The length is present in the C interface as well, in the form of the int parameter passed to the callback function. For a safe wrapper, we should wrap these SQLite-allocated strings using native Rust types, allowing the compiler check for memory safety. Here is a first attempt:

// Naming our arguments for the sake of example.
let arg_1_count: c_int = todo!();
let arg_3_columns: *mut *mut c_char = todo!();

// !!! DO NOT DO THIS, UNDEFINED BEHAVIOR !!!
let columns: &[&CStr] = unsafe { slice::from_raw_parts(
        arg_3_columns as *const &CStr,
        arg_1_count as usize
) };

The slice::from_raw_parts function is used to construct a slice, a fat pointer, from a thin pointer and a length. It does come with a set of requirements (condition numbers $C_{a..f}$ have been added):

  • $C_a:$ data must be valid for reads for len * mem::size_of::<T>() many bytes, and it must be properly aligned. […]
    • $C_b:$ The entire memory range of this slice must be contained within a single allocated object!
    • $C_c:$ data must be non-null and aligned even for zero-length slices.
  • $C_d:$ data must point to len consecutive properly initialized values of type T.
  • $C_e:$ The memory referenced by the returned slice must not be mutated for the duration of lifetime ‘a, except inside an UnsafeCell.
  • $C_f:$ The total size len * mem::size_of::<T>() of the slice must be no larger than isize::MAX, and $C_g:$ adding that size to data must not “wrap around” the address space. See the safety documentation of pointer::offset.

The SQLite documentation states that it uses the system memory allocator by default, which on the majority of Linux systems will be glibc’s malloc implementation2. From the documentation of its internals, we know that it is aligned, taking care of $C_a$ and nothing leads us to believe that SQLite is performing alchemy on memory fragments that would violate $C_b$. However, $C_c$ is an issue, we have no guarantees that something returning zero result columns would supply a non-NULL pointer for arg_3_columns, we will need to explicitly guard against this case later on. malloc will also not wrap around, satisfying $C_g$, and since our returned types are all immutable and not modified by SQLite itself, $C_e$ will not be an issue.

For $C_f$, we can determine its maximum size by looking at the SQLite limits: SQLITE_MAX_COLUMN, set at compile time and defaulting to 2000, results in 2000 * mem::size_of::<*mut c_char>() being roughly 16 KB, which is well below roughly 9 exobytes of address space that isize::MAX affords us.

The biggest problem with this code is the explicit cast that leads to undefined behavior by violating $C_d$ : A *const CStr cannot be constructed from casting a *const char, due to a lack of #[repr(C)] ensuring ABI compatibility, as the documentation for CStr explicitly states:

Note that this structure is not repr(C) and is not recommended to be placed in the signatures of FFI functions. Instead, safe wrappers of FFI functions may leverage the unsafe CStr::from_ptr constructor to provide a safe interface to other consumers.

Unfortunately CStr::from_ptr is a bad fit for our array of strings - we cannot call it on every slice without creating a new container to hold the resulting &CStr instances, which is wasteful. Another issue is that CStr::from_ptr may be expensive to call as it is actually $\mathbb{O}(n)$ in complexity, scanning the entire string for a NUL-byte to determine its length.

A custom type: SQLiteTextStr

We will need to come with our own type without CStr’s drawbacks. In the simplest case it could look like this:

/// A NUL-terminated UTF-8 bytestring.
#[repr(transparent)]
#[derive(Clone, Copy)]
struct SQLiteTextStr(*const c_char);

// Note: Do not use, still not well defined.
let columns: &[SQLiteTextStr] = unsafe { slice::from_raw_parts(
        arg_3_columns as *const SQLiteTextStr,
        arg_1_count as usize
) };

The cast of *mut c_char to this type is perfectly fine, going from mutable to constant, and the #[repr(transparent)] decoration ensures the binary representation is the same.

The value of our wrapping newtype is in its added invariants: According to the docs the strings are created by sqlite3_column_text are guaranteed to be valid UTF-8. SQLiteTextStr reflects this, as well as the strings NUL-termination. Copy is added since the type is just a read-only pointer, like any &T in Rust is Copy as well.

Converting to str

A well-defined conversion function fn to_str(self) -> &str would seem useful useful, except that the function is invalid due to self not having a lifetime3. This is fixed by adding a lifetime parameter 'a to the type, which in turn requires PhantomData:

#[repr(transparent)]
struct SQLiteTextStr<'a>{
    ptr: *const c_char,
    _phantom: PhantomData<&'a str>,
}

This type is still ABI-compatible with *const c_char due to the fact that PhantomData is guaranteed to be zero-sized — otherwise repr(transparent) would cause it to not compile.

Handling the NULL case

Our type is not handling NULL-pointers correctly, the planned to_str function not returning Option<&str> means we did not properly plan for this case. At this point, we can either change the function signature to return an Option or choose a neater variant: Due to niche optimization4 we know that we can build a type that will let us convert to Option<SQLiteTextStr> instead, with the Option<_> having the same size as a pointer. The standard library offers std::ptr::NonNull<T> for this:

#[repr(transparent)]
struct SQLiteTextStr<'a>{
    ptr: NonNull<c_char>,
    _phantom: PhantomData<&'a str>,
}

We need a guarantee that Option<SQLiteTextStr>::None will be represented as a NULL-pointer5 and niche optimization took place, i.e. it is the same size as just SQLiteTextStr. We can add a test to be sure:

#[test]
fn ensure_size_and_repr_assertions_hold_for_sqlite_cstr() {
    // Size matches.
    assert_eq!(mem::size_of::<Option<SQLiteTextStr<'static>>>(),
               mem::size_of::<*const c_char>());

    // `NULL/0` is `None`.
    let val = <Option<SQLiteTextStr<'static>>>::None;
    assert_eq!(unsafe { *(&val as *const Option<SQLiteTextStr<'static>> as *const usize) }, 0);
}

An issue remains: If a user compiles our crate or application on a different architecture and does not run the tests, they could accidentally end up with undefined behavior. Using const assertions through the static_assertions crate6, we can express them as compile time checks instead:

// Ensure niche optimization worked as intended.
const_assert_eq!(
    mem::size_of::<Option<SQLiteTextStr<'static>>>(),
    mem::size_of::<*const c_char>()
);

// Ensure that a `None` value is represented as `NULL`.
const_assert_eq!(
    unsafe {
        *(&(<Option<SQLiteTextStr<'static>>>::None) as *const Option<SQLiteTextStr<'static>>
            as *const usize)
    },
    0
);

According to the docs, NonNull<T> requires covariance7 as well, which is a fancy way of saying that if two lifetimes 'a : 'b (meaning 'a lives as long as 'b or longer) are given, our SQLiteTextStr<'a> must live as long as SQLiteTextStr<'b> or longer as well.

Our final consideration is Send and Sync: Due it containing a pointer, the compiler will not derive these traits automatically for SQLiteTextStr. Our type is plain old data that is not manipulated (there’s no mutating our SQLiteTextStr in any form), thus it safe to send to other threads (Send) and shareable (Sync)8, so we can add the unsafe trait impls required.

We can finally write out the entire type, along with the documentation of its guarantees:

/// A reference to a NUL-terminated, UTF-8 encoded bytestring.
///
/// Typically allocated and managed by SQLite internally.
///
/// Guaranteed to be valid UTF-8 encoded data. Any pointer to a `c_char`
/// that is returned by SQLite as valid UTF-8 encoded, zero terminated
/// text can safely be cast into an `Option<SQLiteTextStr<'_>>`, a value of
/// `NULL` will result in `None`.
#[repr(transparent)]
pub struct SQLiteTextStr<'a> {
    /// Pointer to the allocated string.
    ptr: NonNull<c_char>,
    /// Reference tag.
    _phantom: PhantomData<&'a str>,
}

unsafe impl<'a> Send for SQLiteTextStr<'a> {}
unsafe impl<'a> Sync for SQLiteTextStr<'a> {}

impl<'a> SQLiteTextStr<'a> {
    /// Converts to a string reference.
    ///
    /// This function is not fallible since `SQLiteTextStr` is guaranteed to
    /// be valid UTF-8, however it requires scanning the entire string to
    /// determine its length.
    ///
    /// If the string contains any `NUL` bytes, the returned reference will
    /// end up shortened.
    #[inline(always)]
    pub fn to_str(self) -> &'a str {
        // Safety: We know `self.ptr` is both valid and `NUL`-terminated.
        let byte_slice = unsafe {
                CStr::from_ptr(self.ptr.as_ptr() as *const i8)
        }.to_bytes();

        // Safety: SQLite guarantees correct UTF-8, no need to check again.
        unsafe { str::from_utf8_unchecked(byte_slice) }
    }

    /// Converts to a string reference with a known length.
    #[inline(always)]
    pub unsafe fn to_str_with_len(self, len: usize) -> &'a str {
        // Safety: We have to rely on `len` being correct here.
        //         Even a zero-length string should be at least 1 byte long,
        //         due to the `NUL` byte, thus `ptr.as_ptr()` is valid.
        let byte_slice: &[u8] = slice::from_raw_parts(self.ptr.as_ptr() as *const u8, len);

        // Safety: SQLite guarantees correct UTF-8, no need to check again.
        str::from_utf8_unchecked(byte_slice)
    }
}

// Ensure niche optimization worked as intended.
const_assert_eq!(
    mem::size_of::<Option<SQLiteTextStr<'static>>>(),
    mem::size_of::<usize>()
);

// Ensure that a `None` value is represented as `NULL`.
const_assert_eq!(
    unsafe {
        *(&(<Option<SQLiteTextStr<'static>>>::None) as *const Option<SQLiteTextStr<'static>>
            as *const usize)
    },
    0
);

The to_str implementation is safe, as it relies entirely on assumptions baked into the type itself — an SQLiteTextStr<'_> that does not satisfy these should not exist. However, to_str_with_len relies on external information that must be checked by the caller, namely the length of the string. It is still a useful function to have, in the case where a user of the type already knows the length and wants to avoid a potentially expensive scan of the entire string. All of these are barely more than pointer casts, so #[inline(always)] is added to ensure they just “melt away” in most cases.

When casting pointers to references, it pays to keep an eye out for unbounded lifetimes. In the case of SQLiteTextStr<'_> the burden is shifted to the creater of this instance to construct the correct lifetime 'a. The lifetime of the returned str reference does not live longer than our actual SQLiteTextStr, which is the only thing the type itself can ensure.

Conclusion

Creating a safe wrapper for SQLite provided UTF-8 encoded, NUL-terminated strings is more work than anticipated, but rewards with a type that encodes numerous safety guarantees that are only present in the documentation in C code in actual compiler-enforced types. With these in place, our next step is to finally run out first query.


  1. As before, “C-style string” refers to zero- or NUL-terminated strings↩︎

  2. For extremely strict requirements we could leverage that SQLite allows swapping out the memory allocator, thus we could fit one in ourselves with the same behavior on all platforms that satisfies all the requirements. For this article, we will stick to assuming that either glibc’s is present, or the alternative similar enough to not violate the required invariants. Should we ever consider switching on memory mapped IO, we will need to reexamine this section. ↩︎

  3. If we specified the function as fn to_str(&self) -> &str, lifetime elision would turn it into fn to_str<'a>(&'a self) -> &'a str, which works out fine. However, we made the type Copy earlier, so we want to use owned receivers. ↩︎

  4. Admittedly the link does not the best job of explaining niche optimization, I may write a post about it in the future. Roughly summed up it means that if a binary representation of a type has unused bit patterns, the compiler may use these to represent various enum variants to save space. A typical example is an Option<NonZeroU32>, which is the same size as a regular u32. Since its Some values can only be Some(value) without value ever being 0, the compiler can declare the bit pattern of just 0u32 equal so None, and any other value v as u32 equal to Some(v). In the end, the Option<NonZeroU32> has the same size as a regular u32↩︎

  5. C guarantees at least 8 bits per byte, but makes no guarantees of it being exactly 8 bits. The jury is still out on whether NULL must be exactly the value 0, but our assertions will keep us covered. If we wanted to be extra safe, we should assert that std::ptr::null is 0 and the same as NULL in the C headers, but that sounds like overkill at this point. ↩︎

  6. Proper support via rustc is still unstable. ↩︎

  7. It would be a little lazy to just link the Rust nomicon’s subtyping section here, which is not easily digestable, so I will add an observation that Rust does not prominently feature inheritance like some object oriented languages, there is a concept of subtyping for lifetimes (only), which puts covariance back on the menu. ↩︎

  8. This may change based on the memory allocator, as the memory holding the test must allow this to happen, but since we looked at the malloc internals earlier, we know it is a multithread-aware. ↩︎