What is Datafix?
Datafix is a serialization & deserialization framework for Rust, using Codec and CodecOps. It also has facilities for allowing you to fix schemas of old data with TypeRewriteRules. Datafix has its roots in Mojang’s DataFixerUpper library, reimplemented in Rust and updated to be more ergonomic to Rust. The library has also been slightly simplified from DataFixerUpper, especially in the TypeRewriteRules.
The reason I originally created it was because I realized the original DataFixerUpper was implemented in Java. While it is a very good library to work with in my opinion, it fell flat due to Java’s limitations & type erasure. This also meant there was lots of overhead, with things such as Codecs requiring lots of pointer chasing. I felt that these issues could be solved very easily in Rust, and I believe I was correct.
Please note that currently, this project is unmaintained.
Definitions
Datafix has a few core types for serialization:
Codecs are structures that allow you to transform types into eachother. For example, you could turn a user’s data into JSON, and vice versa.CodecOpsare helper types thatCodecs use. It is a trait that defines an interface to facilitate converting between the different types.
Code Example
Let’s say you have a struct of UserData:
#[derive(Debug, Clone, PartialEq)]
struct UserData {
username: String,
id: i32
}
impl UserData {
pub fn new(username: String, id: i32) -> Self { ... }
pub fn username(&self) -> &String { ... }
pub fn id(&self) -> &i32 { ... }
}
You want to be able to serialize & deserialize this to different in-memory formats. This can be accomplished by giving it a DefaultCodec implementation.
impl<OT, O: CodecOps<OT>> DefaultCodec<OT, O> for UserData {
pub fn codec() -> impl Codec<Self, OT, O> {
MapCodecBuilder::new()
.field(String::codec().field_of("username", UserData::username))
.field(i32::codec().field_of("id", UserData::id))
.build(UserData::new)
}
}
Now, you can use it with a CodecOps to serialize & deserialize.
let data = UserData::new("Endistic".to_string(), 19);
let encoded = UserData::codec().encode(&JsonOps, &data);
let decoded = UserData::codec().decode(&JsonOps, &encoded);
assert_eq!(data, decoded);
In this example, JsonOps is provided by datafix as the primary CodecOps for transforming values to and from JSON.
encoded here is the same as data, but encoded in JSON. The value is encoded in JSON due to the JsonOps being passed into encode.
decoded here should be equal to data, since encode and decode are obligated to return the same value with the same inputs.
Given a type X with a variable of value x, and a CodecOps named ops:
assert_eq!(X::codec().encode(ops, X::codec().decode(ops, x)), x);
should always hold true. This is important since it ensures purity between encoding & decoding. It also means that even when you change formats, your types should be represented relatively the same across formats.
Using this principle, you can create a test to ensure your Codecs work as you want using the following patterns:
// Test data encoding & decoding
let data = <sample value>;
let encoded = <type>::codec().encode(&JsonOps, &data);
let decoded = <type>::codec().decode(&JsonOps, &encoded);
assert_eq!(encoded, decoded);
// test encoding
let data = <sample value>;
let encoded = <type>::codec().encode(&JsonOps, data);
assert_eq!(encoded, <sample encoded value in Json expected>);
// test decoding
let encoded = <sample encoded value in Json>;
let decoded = <type>::codec().decode(&JsonOps, &encoded);
assert_eq!(<expected value>, decoded);
Make sure to replace the variables above appropiately.
Built-in Codec Types
However, this is too simplistic for what you may want to do. All of the above are boring transformations that you could probably write yourself in a few minutes. Codecs can be fed into adapters to do more interesting things.
Adapters
All adapters are usable as self methods on Codecs.
-
To get a codec as a list as opposed to a single instance, eg
T -> Vec<T>, use thelist_ofmethod to turn it into a list. (e.gf64::codec().list_of()gives aCodec<Vec<f64>>) -
To add a mapping function for serializing & deserializing, use
xmap.xmaplets you serialize values fromTtoUand vice versa on deserialization. This allows you to store a value of typeTbut expose it as typeU. -
To turn
(T, U)into a codec, you should doT::codec().pair(U::codec()). This may later be expanded to be a general tuple codec. -
If a codec fails to encode, use the
try_elseadapter to try another codec after and return that. You can also use theor_elseto fall back to a default value, butor_elseonly works for deserialization.
For an example, try_else allows you to have different Codecs for a type built into one. If you have a dynamic map type, you could represent it as a [(key1, value1), (key2, value2)], or a { key1: value1, key2: value2 }. Using try_else would allow you to handle both of these cases, defaulting to the first codec passed into try_else.
- To make a
Codecbe dynamically dispatched, use theDynamicCodectype, obtained by calling thedynamicmethod on aCodec. You can also wrap it in aBox<T>usingboxed, and anArc<T>usingarc.
Dedicated types
There are also special ways to make Codecs using types & methods on Codecs.
-
MapCodecBuildercan be used to make aCodecfor astruct. Adapting a codec withfield_ofcan turn it into a field for a map, andoptional_field_ofif the field is not required. If the field is not required and has a default value, usedefault_field_of. -
Use
Codecs::recursiveto define a codec that contains itself. This is useful for things like recursive trees & linked lists. -
Use
Codecs::eitherto attempt to serialize & deserialize with two codecs. This will attempt to use the first codec, and if it fails, try the second codec. This returns anEither<T, U>so you can have different types in this, unliketry_else. -
Use
Codec::dispatchto delegate behavior to different codecs depending on the input value passed into encoding & decoding steps. The functions generally takes a form offn(T) -> Codec.
Full list
You can see a current full list of the codecs available in CodecAdapters and Codecs in the source code here.
Datafixing
Datafix, as its name implies, also allows you to fix up data. What does this mean?
Let’s say your updating your UserData struct from above, and you want to give it a volume field with the user’s volume level.
Before:
#[derive(Debug, Clone, PartialEq)]
struct UserData {
username: String,
id: i32
}
impl UserData {
pub fn new(username: String, id: i32) -> Self { ... }
pub fn username(&self) -> &String { ... }
pub fn id(&self) -> &i32 { ... }
}
And after:
#[derive(Debug, Clone, PartialEq)]
struct UserData {
username: String,
id: i32,
volume: i32
}
impl UserData {
pub fn new(username: String, id: i32) -> Self { ... }
pub fn username(&self) -> &String { ... }
pub fn id(&self) -> &i32 { ... }
pub fn volume(&self) -> &i32 { ... }
}
You can update your Codec to account for this too:
impl<OT, O: CodecOps<OT>> DefaultCodec<OT, O> for UserData {
pub fn codec() -> impl Codec<Self, OT, O> {
MapCodecBuilder::new()
.field(String::codec().field_of("username", UserData::username))
.field(i32::codec().field_of("id", UserData::id))
// this bounded call restricts the value from 0..100
// in serialization & deserialization
.field(i32::codec().bounded(0..100).field_of("volume", UserData::volume))
.build(UserData::new)
}
}
However, old data files will still look like this:
{
"username": "Endistic",
"id": 19
}
So how do you automatically upgrade these old data files? You can use TypeRewriteRules.
fn volume_rule<OT: Clone, O: CodecOps<OT>>() -> impl TypeRewriteRule<OT, O> {
Rules::new_field(
"volume",
// This creates a new i32 value that will be inserted into the DataStructure.
|ctx: CodecOps<OT>| ctx.create_int(100),
|_ctx| Type::Int,
)
}
Now, before deserializing your data, you can apply this rule to your data:
let decoded: JsonValue = unimplemented!();
let fixed = JsonOps.repair(decoded, volume_rule());
let final_value = UserData::codec().decode(&Jsonops, fixed);
Built-in Datafixers
There are a few built-in datafixers:
Rules::add_fieldlets you add a new field to a mapRules::remove_fieldlets you remove a field from a mapRules::apply_to_fieldapplies a rule to a field inside a mapTypeRewriteRule::and_thenallow you to chainTypeRewriteRules together.
The Future
More will be added for more streamlined manipulation of scalar values & lists. The plan is eventually, TypeRewriteRules will be so abstracted that types can be inferred. You might be able to rewrite the above volume_rule as:
fn volume_rule<OT: Clone, O: CodecOps<OT>>() -> impl TypeRewriteRule<OT, O> {
Rules::new_field(
"volume",
Vals::new_int(100)
)
}
Notice how there is no longer a need to specify a type, and instead it will be inferred. This future model will be chainable too, and be able to use it’s context:
fn complex_rule<OT: Clone, O: CodecOps<OT>>() -> impl TypeRewriteRule<OT, O> {
Rules::new_field(
"xp",
Vals::read_int_from(
// this would apply to the current object being read
Vals::field_of("level")
).multiply(10)
)
}
Advantages
In my opinion, this framework does have some advantages compared to alternatives like serde:
serdeis very imperative. While this can be helpful, it isn’t always desired. Thederivemacro system is declarative, but is limited.Codecs try to blend the two together, still remaining as declarative as thederivemacro system, but with more power.
One example I personally encountered was with WyvernMC’s Id type. Minecraft needs to be able to serialize it from a single string, e.g minecraft:my_id into Id::new("minecraft", "my_id"). However, serde does not make this easy and forces you to implement the Serialize trait. While datafix does a similar thing by forcing you to implement the DefaultCodec trait, I would argue it’s much easier to reason about structure-based declarative code than the visitor-based imperative code of serde.
-
Due to the heavy generic uses, all uses will boil down into the equivalent code of just transforming it yourself.
-
datafix’s data model is much simpler thanserdes. Whileserdedoes have more depth in it’s data model, even differentiating between newtype structs, tuple structs, normal structs, etc. this is usually not necessary. Keeping it simple, in my opinion, is a big advantage.CodecOpsseems much simpler to implement thanSerializer, even though it has more types associated with it.
I would argue this is an advantage due to simplicity. When implementing CodecOps, you only need to handle a few fundamental operations. Meanwhile, while Serde’s massive data model can be more powerful in expressivity, most formats do not need it, such as JSON, YAML, NBT, etc.
Disadvantages
There are some downsides:
-
Datafix is definitely not mature.
serdeis still very good for most projects due to it’s massive ecosystem and projects already using it, and documentation. Datafix does not have any of this since it is new. (Seriously, do not underestimate the power of a library having an ecosystem around it.) -
Since this is not
serde, compatibility layers will need to be written to integrate it into codebases already usingserde. Utilities for this will be provided bydatafixitself in the near future. -
Currently, there is no derive macro support for auto-generating
Codecs. While this isn’t necessarily a bad thing, it adds a lot of boilerplate. Additionally, derive macro support is challenging since a type can have multipleCodecs for it, and the default behavior is not always desired. -
Due to lots of generics being used,
datafixhas the potential to explode compile times. Using aDynamicCodecdoes not get around this, since the generic types will still be computed, before the Codec becomes dynamically dispatched. TheDynamicCodeconly is dynamically dispatched once it gets to LLVM, and does not help with generic type computation at all the stages before Rust gets to LLVM.
Testing
If you are interested in testing and playing with it for yourself, for the moment, you should import it as a git dependency:
datafix = { git = "https://github.com/akarahdev/datafix.git" }
For the moment, do not use datafix in production. It is unstable, please try to only use it for experiments at the moment.