With a way to open an SQLite database in place we can look towards running a query. The sqlite3_exec
function offers a quick start, bundling all steps required to run a query into a single function at the cost of flexibility:
int sqlite3_exec(
sqlite3*, /* An open database */
const char *sql, /* SQL to be evaluated */
int (*callback)(void*,int,char**,char**), /* Callback function */
void *, /* 1st argument to callback */
char **errmsg /* Error msg written here */
);
From the signature a type sticks out: char**
(or *const *const c_char
in Rust), an array of C-style1 string values. These parameters to the callback
function contain column values and headers for a given result row.
When CStr
is not an Option
The double pointer (char**
/ *mut *mut c_char
) is conceptually similar to what in Rust would be a &[&CStr]
, a slice of zero-terminated string references, with the core difference that the C type does not contain the length. In Rust parlance, a char**
is called a called thin pointer, which is the size of a single word (i.e. usize
), whereas fat pointers are larger at typically two words, containing the length or other information to determine the total size of the object they point at. Fat pointers are important for memory safety if the pointee (i.e. the object being pointed at) is variably sized, it is impossible to determine whether or not a read is out of bounds from thin pointers alone.
The length is present in the C interface as well, in the form of the int
parameter passed to the callback
function. For a safe wrapper, we should wrap these SQLite-allocated strings using native Rust types, allowing the compiler check for memory safety. Here is a first attempt:
// Naming our arguments for the sake of example.
let arg_1_count: c_int = todo!();
let arg_3_columns: *mut *mut c_char = todo!();
// !!! DO NOT DO THIS, UNDEFINED BEHAVIOR !!!
let columns: &[&CStr] = unsafe { slice::from_raw_parts(
arg_3_columns as *const &CStr,
arg_1_count as usize
) };
The slice::from_raw_parts
function is used to construct a slice, a fat pointer, from a thin pointer and a length. It does come with a set of requirements (condition numbers $C_{a..f}$ have been added):
- $C_a:$
data
must be valid for reads forlen * mem::size_of::<T>()
many bytes, and it must be properly aligned. […]
- $C_b:$ The entire memory range of this slice must be contained within a single allocated object!
- $C_c:$
data
must be non-null and aligned even for zero-length slices.- $C_d:$
data
must point to len consecutive properly initialized values of typeT
.- $C_e:$ The memory referenced by the returned slice must not be mutated for the duration of lifetime ‘a, except inside an UnsafeCell.
- $C_f:$ The total size
len * mem::size_of::<T>()
of the slice must be no larger thanisize::MAX
, and $C_g:$ adding that size to data must not “wrap around” the address space. See the safety documentation ofpointer::offset
.
The SQLite documentation states that it uses the system memory allocator by default, which on the majority of Linux systems will be glibc’s malloc
implementation2. From the documentation of its internals, we know that it is aligned, taking care of $C_a$ and nothing leads us to believe that SQLite is performing alchemy on memory fragments that would violate $C_b$. However, $C_c$ is an issue, we have no guarantees that something returning zero result columns would supply a non-NULL
pointer for arg_3_columns
, we will need to explicitly guard against this case later on. malloc
will also not wrap around, satisfying $C_g$, and since our returned types are all immutable and not modified by SQLite itself, $C_e$ will not be an issue.
For $C_f$, we can determine its maximum size by looking at the SQLite limits: SQLITE_MAX_COLUMN
, set at compile time and defaulting to 2000, results in 2000 * mem::size_of::<*mut c_char>()
being roughly 16 KB, which is well below roughly 9 exobytes of address space that isize::MAX
affords us.
The biggest problem with this code is the explicit cast that leads to undefined behavior by violating $C_d$ : A *const CStr
cannot be constructed from casting a *const char
, due to a lack of #[repr(C)]
ensuring ABI compatibility, as the documentation for CStr
explicitly states:
Note that this structure is not
repr(C)
and is not recommended to be placed in the signatures of FFI functions. Instead, safe wrappers of FFI functions may leverage the unsafeCStr::from_ptr
constructor to provide a safe interface to other consumers.
Unfortunately CStr::from_ptr
is a bad fit for our array of strings - we cannot call it on every slice without creating a new container to hold the resulting &CStr
instances, which is wasteful. Another issue is that CStr::from_ptr
may be expensive to call as it is actually $\mathbb{O}(n)$ in complexity, scanning the entire string for a NUL
-byte to determine its length.
A custom type: SQLiteTextStr
We will need to come with our own type without CStr
’s drawbacks. In the simplest case it could look like this:
/// A NUL-terminated UTF-8 bytestring.
#[repr(transparent)]
#[derive(Clone, Copy)]
struct SQLiteTextStr(*const c_char);
// Note: Do not use, still not well defined.
let columns: &[SQLiteTextStr] = unsafe { slice::from_raw_parts(
arg_3_columns as *const SQLiteTextStr,
arg_1_count as usize
) };
The cast of *mut c_char
to this type is perfectly fine, going from mutable to constant, and the #[repr(transparent)]
decoration ensures the binary representation is the same.
The value of our wrapping newtype is in its added invariants: According to the docs the strings are created by sqlite3_column_text
are guaranteed to be valid UTF-8. SQLiteTextStr
reflects this, as well as the strings NUL
-termination. Copy
is added since the type is just a read-only pointer, like any &T
in Rust is Copy
as well.
Converting to str
A well-defined conversion function fn to_str(self) -> &str
would seem useful useful, except that the function is invalid due to self
not having a lifetime3. This is fixed by adding a lifetime parameter 'a
to the type, which in turn requires PhantomData
:
#[repr(transparent)]
struct SQLiteTextStr<'a>{
ptr: *const c_char,
_phantom: PhantomData<&'a str>,
}
This type is still ABI-compatible with *const c_char
due to the fact that PhantomData
is guaranteed to be zero-sized — otherwise repr(transparent)
would cause it to not compile.
Handling the NULL
case
Our type is not handling NULL
-pointers correctly, the planned to_str
function not returning Option<&str>
means we did not properly plan for this case. At this point, we can either change the function signature to return an Option
or choose a neater variant: Due to niche optimization4 we know that we can build a type that will let us convert to Option<SQLiteTextStr>
instead, with the Option<_>
having the same size as a pointer. The standard library offers std::ptr::NonNull<T>
for this:
#[repr(transparent)]
struct SQLiteTextStr<'a>{
ptr: NonNull<c_char>,
_phantom: PhantomData<&'a str>,
}
We need a guarantee that Option<SQLiteTextStr>::None
will be represented as a NULL
-pointer5 and niche optimization took place, i.e. it is the same size as just SQLiteTextStr
. We can add a test to be sure:
#[test]
fn ensure_size_and_repr_assertions_hold_for_sqlite_cstr() {
// Size matches.
assert_eq!(mem::size_of::<Option<SQLiteTextStr<'static>>>(),
mem::size_of::<*const c_char>());
// `NULL/0` is `None`.
let val = <Option<SQLiteTextStr<'static>>>::None;
assert_eq!(unsafe { *(&val as *const Option<SQLiteTextStr<'static>> as *const usize) }, 0);
}
An issue remains: If a user compiles our crate or application on a different architecture and does not run the tests, they could accidentally end up with undefined behavior. Using const assertions through the static_assertions
crate6, we can express them as compile time checks instead:
// Ensure niche optimization worked as intended.
const_assert_eq!(
mem::size_of::<Option<SQLiteTextStr<'static>>>(),
mem::size_of::<*const c_char>()
);
// Ensure that a `None` value is represented as `NULL`.
const_assert_eq!(
unsafe {
*(&(<Option<SQLiteTextStr<'static>>>::None) as *const Option<SQLiteTextStr<'static>>
as *const usize)
},
0
);
According to the docs, NonNull<T>
requires covariance7 as well, which is a fancy way of saying that if two lifetimes 'a : 'b
(meaning 'a
lives as long as 'b
or longer) are given, our SQLiteTextStr<'a>
must live as long as SQLiteTextStr<'b>
or longer as well.
Our final consideration is Send
and Sync
: Due it containing a pointer, the compiler will not derive these traits automatically for SQLiteTextStr
. Our type is plain old data that is not manipulated (there’s no mutating our SQLiteTextStr
in any form), thus it safe to send to other threads (Send
) and shareable (Sync
)8, so we can add the unsafe trait impl
s required.
We can finally write out the entire type, along with the documentation of its guarantees:
/// A reference to a NUL-terminated, UTF-8 encoded bytestring.
///
/// Typically allocated and managed by SQLite internally.
///
/// Guaranteed to be valid UTF-8 encoded data. Any pointer to a `c_char`
/// that is returned by SQLite as valid UTF-8 encoded, zero terminated
/// text can safely be cast into an `Option<SQLiteTextStr<'_>>`, a value of
/// `NULL` will result in `None`.
#[repr(transparent)]
pub struct SQLiteTextStr<'a> {
/// Pointer to the allocated string.
ptr: NonNull<c_char>,
/// Reference tag.
_phantom: PhantomData<&'a str>,
}
unsafe impl<'a> Send for SQLiteTextStr<'a> {}
unsafe impl<'a> Sync for SQLiteTextStr<'a> {}
impl<'a> SQLiteTextStr<'a> {
/// Converts to a string reference.
///
/// This function is not fallible since `SQLiteTextStr` is guaranteed to
/// be valid UTF-8, however it requires scanning the entire string to
/// determine its length.
///
/// If the string contains any `NUL` bytes, the returned reference will
/// end up shortened.
#[inline(always)]
pub fn to_str(self) -> &'a str {
// Safety: We know `self.ptr` is both valid and `NUL`-terminated.
let byte_slice = unsafe {
CStr::from_ptr(self.ptr.as_ptr() as *const i8)
}.to_bytes();
// Safety: SQLite guarantees correct UTF-8, no need to check again.
unsafe { str::from_utf8_unchecked(byte_slice) }
}
/// Converts to a string reference with a known length.
#[inline(always)]
pub unsafe fn to_str_with_len(self, len: usize) -> &'a str {
// Safety: We have to rely on `len` being correct here.
// Even a zero-length string should be at least 1 byte long,
// due to the `NUL` byte, thus `ptr.as_ptr()` is valid.
let byte_slice: &[u8] = slice::from_raw_parts(self.ptr.as_ptr() as *const u8, len);
// Safety: SQLite guarantees correct UTF-8, no need to check again.
str::from_utf8_unchecked(byte_slice)
}
}
// Ensure niche optimization worked as intended.
const_assert_eq!(
mem::size_of::<Option<SQLiteTextStr<'static>>>(),
mem::size_of::<usize>()
);
// Ensure that a `None` value is represented as `NULL`.
const_assert_eq!(
unsafe {
*(&(<Option<SQLiteTextStr<'static>>>::None) as *const Option<SQLiteTextStr<'static>>
as *const usize)
},
0
);
The to_str
implementation is safe, as it relies entirely on assumptions baked into the type itself — an SQLiteTextStr<'_>
that does not satisfy these should not exist. However, to_str_with_len
relies on external information that must be checked by the caller, namely the length of the string. It is still a useful function to have, in the case where a user of the type already knows the length and wants to avoid a potentially expensive scan of the entire string. All of these are barely more than pointer casts, so #[inline(always)]
is added to ensure they just “melt away” in most cases.
When casting pointers to references, it pays to keep an eye out for unbounded lifetimes. In the case of SQLiteTextStr<'_>
the burden is shifted to the creater of this instance to construct the correct lifetime 'a
. The lifetime of the returned str
reference does not live longer than our actual SQLiteTextStr
, which is the only thing the type itself can ensure.
Conclusion
Creating a safe wrapper for SQLite provided UTF-8 encoded, NUL
-terminated strings is more work than anticipated, but rewards with a type that encodes numerous safety guarantees that are only present in the documentation in C code in actual compiler-enforced types. With these in place, our next step is to finally run out first query.
-
As before, “C-style string” refers to zero- or
NUL
-terminated strings. ↩︎ -
For extremely strict requirements we could leverage that SQLite allows swapping out the memory allocator, thus we could fit one in ourselves with the same behavior on all platforms that satisfies all the requirements. For this article, we will stick to assuming that either glibc’s is present, or the alternative similar enough to not violate the required invariants. Should we ever consider switching on memory mapped IO, we will need to reexamine this section. ↩︎
-
If we specified the function as
fn to_str(&self) -> &str
, lifetime elision would turn it intofn to_str<'a>(&'a self) -> &'a str
, which works out fine. However, we made the typeCopy
earlier, so we want to use owned receivers. ↩︎ -
Admittedly the link does not the best job of explaining niche optimization, I may write a post about it in the future. Roughly summed up it means that if a binary representation of a type has unused bit patterns, the compiler may use these to represent various enum variants to save space. A typical example is an
Option<NonZeroU32>
, which is the same size as a regularu32
. Since itsSome
values can only beSome(value)
withoutvalue
ever being0
, the compiler can declare the bit pattern of just0u32
equal soNone
, and any other valuev as u32
equal toSome(v)
. In the end, theOption<NonZeroU32>
has the same size as a regularu32
. ↩︎ -
C guarantees at least 8 bits per byte, but makes no guarantees of it being exactly 8 bits. The jury is still out on whether
NULL
must be exactly the value0
, but our assertions will keep us covered. If we wanted to be extra safe, we should assert thatstd::ptr::null
is0
and the same asNULL
in the C headers, but that sounds like overkill at this point. ↩︎ -
Proper support via
rustc
is still unstable. ↩︎ -
It would be a little lazy to just link the Rust nomicon’s subtyping section here, which is not easily digestable, so I will add an observation that Rust does not prominently feature inheritance like some object oriented languages, there is a concept of subtyping for lifetimes (only), which puts covariance back on the menu. ↩︎
-
This may change based on the memory allocator, as the memory holding the test must allow this to happen, but since we looked at the
malloc
internals earlier, we know it is a multithread-aware. ↩︎