In Place String Mapping
Date: Message-Id: https://www.5snb.club/posts/2021/in-place-string-mapping/
A while ago, I saw a post about how C did in-place string modification better than Rust. So, for example, you could take a string that is percent encoded, and decode it, perfectly in place, without external allocations.
This seemed rather neat, so I went out to implement it in Rust, so this can be done safely.
The C version that inspired this post is:
#include <stdlib.h>
#include <stdio.h>
void decode_inplace(char *s) {
char *in = s, *out = s;
char cur[3] = { 0 };
while (*in) {
if (*in == '%') {
cur[0] = *(in + 1);
cur[1] = *(in + 2);
*out = strtol(cur, NULL, 16);
in += 2;
} else {
*out = *in;
}
++in;
++out;
}
*out = '\0';
}
int main() {
char s[] = "abc%64%65fg";
decode_inplace(s);
printf("%s\n", s);
}
Let’s see what that looks like using my library.
use in_place_string_map::MapInPlace;
fn decode_percent(s: &mut str) -> &mut str {
let mut m = MapInPlace::new(s);
while let Some(c) = m.pop() {
match c {
'%' => {
let num = m.pop_chars(2).expect("not enough chars");
let n = u8::from_str_radix(num, 16).expect("invalid hex");
m.push(n as char).expect("no more capacity");
}
_ => {
m.push(c).expect("no more capacity");
}
}
}
m.into_mapped()
}
fn main() {
let mut percent = "abc%64%65fg".to_string();
let mapped_percent = decode_percent(&mut percent);
assert_eq!(mapped_percent, "abcdefg");
}
Pretty neat!
The MapInPlace
takes a &mut str
of arbitrary lifetime, and lets you do
operations such as pop_chars
and push
on it.
It works by internally converting the &mut str
into a &mut [u8]
and doing
operations on that byte array. It needs to be careful to ensure that as soon as
it gives control back to you, it’s valid UTF-8, since you cannot rely on
Drop
being called in order to ensure safety, since you could leak the object
and regain access to the backing &mut str
.
There are two offsets internally, mapped_head
and unmapped_head
. Anything
from the start of the string to mapped_head
is data that’s been push
ed to
the string. Anything from unmapped_head
to the end of the string is data that
is yet to be popped. And the in-between zone is unspecified but must still be
valid UTF-8 (But the current implementation zeros it out after every operation,
conservatively.)
If you try to push more data than you are popping, then you will get an error, as there’s no space. So this technique only works when the output string is always as long as or shorter than the input string.
After you do your string modifications, you can get a reference to the part of
the slice that has been mapped using into_mapped()
. Continuing to use the
original slice that was passed in is a footgun, as it will also contain the
unspecified section, as well as any unmapped characters. Originally the design
took a &mut String
to avoid that, and truncated the buffer on Drop
, but
taking a reference to a String
means this library can’t be used in no_std
crates.
The library is on my github at
https://github.com/5225225/in-place-string-map. There’s definitely room
for performance optimisations (in push_str
, the zeroing step isn’t ideal,
since it makes push_str
O(n) where n is the number of characters in the
in-between zone, but other solutions seemed to allow invalid UTF-8 to slip
through).
I haven’t proven this code safe to myself yet, but it’s probably fine.