Serialization format

Labels: area/serialization kind/feature

Timeline

Michael Muré (MichaelMure) opened

Bug's data are stored using git Commit, Tree and Blob. Inside a Blob is serialized an OperationPack, that is an array of edit operation on the bug's state.

This OperationPack is currently serialized using golang's gob, which is neat because it just works. However, it might not be the best option for interoperability with other tools in the future.

How should that be serialized ? Json ? In any case, git will compress the data using zlib so a text format might not be that terrible.

Feel free to argue a case here.

Michael Muré (MichaelMure) added label RFC

daurnimator (daurnimator) commented

I just found the project via hackernews. I'd love to give this sort of thing a try, and integrate it into other tools. However using go serialisation rules out all my languages of choice.

I'd say use something JSON based, or if that's not enough, CBOR.

Luke Champine (lukechampine) commented

My two cents: gob is convenient and efficient, but not a great choice if you want interop with other languages. Unfortunately there just aren't many binary formats that are widely supported, except perhaps protobufs.

JSON is probably your best bet. As you noted, it will be compressed anyway, and if performance is an issue you can always switch to a faster JSON encoder. The only big downside to JSON that I'm aware of is poor support for encoding binary blobs (encoding/json encodes []byte as a base-64 string). But if OperationPack is almost entirely textual data anyway, there's little reason to worry about that.

Ævar Arnfjörð Bjarmason (avar) commented

Git does its own delta-compression on top of zlib. You should decide this using a combination of whatever format needs you have (can you add more fields, is it extensible etc.), and how git manages to compress this using both delta compression and zlib, which you can figure out using a large enough set of realistic test data.

Michael Muré (MichaelMure) commented

To give more details about the requirement, OperationPack hold currently very simple data (string, int, array..), and it's likely to stay the same even when adding new operations. For instance, embedded files are stored in git blobs and then linked in the git tree.

The only tricky part is that an OperationPack is an mixed array of Operation so the decoder need to support that and match the correct go struct for each operation.

Michael Muré (MichaelMure) commented

With all these format that could fit the bill, the best way to choose would be a benchmark for both performance and blob size for several format (at least JSON and CBOR). Who knows how the git compression behave on something that is already binary.

Maybe the git people could do an educated guess.

Jed Fox (j-f1) commented

MessagePack is another option, but I feel like MessagePack and CBOR are both designed for getting the smallest possible representation of data, whereas JSON is designed to be human-readable, ASCII-compatible, and simple to parse. Compare JSON’s spec (the sidebar) with the CBOR and MessagePack specs.

Michael Muré (MichaelMure) added label Core

Michael Muré (MichaelMure) commented

I wrote some throwaway code to test the resulting blob size for various format. Here is one run:

Creating repo: /tmp/512275589

GOB raw: 5210, git: 2216, ratio: 42.53359% raw: 5536, git: 2320, ratio: 41.907513% raw: 3987, git: 1768, ratio: 44.34412% raw: 4407, git: 1893, ratio: 42.95439% raw: 6368, git: 2593, ratio: 40.71922% raw: 4905, git: 2143, ratio: 43.690113% raw: 6524, git: 2660, ratio: 40.772533% raw: 3315, git: 1549, ratio: 46.726997% raw: 4116, git: 1780, ratio: 43.24587% raw: 3928, git: 1751, ratio: 44.577393% total: 20673

JSON raw: 4862, git: 1966, ratio: 40.436035% raw: 5188, git: 2072, ratio: 39.93832% raw: 3633, git: 1528, ratio: 42.058907% raw: 4055, git: 1657, ratio: 40.863132% raw: 6026, git: 2339, ratio: 38.815136% raw: 4555, git: 1903, ratio: 41.778267% raw: 6184, git: 2411, ratio: 38.987713% raw: 2965, git: 1314, ratio: 44.31703% raw: 3764, git: 1551, ratio: 41.20616% raw: 3568, git: 1515, ratio: 42.460762% total: 18256

CBOR raw: 4746, git: 1961, ratio: 41.319008% raw: 5071, git: 2065, ratio: 40.72175% raw: 3524, git: 1527, ratio: 43.33144% raw: 3944, git: 1656, ratio: 41.98783% raw: 5902, git: 2337, ratio: 39.59675% raw: 4440, git: 1899, ratio: 42.77027% raw: 6062, git: 2410, ratio: 39.755856% raw: 2852, git: 1308, ratio: 45.862553% raw: 3652, git: 1543, ratio: 42.25082% raw: 3463, git: 1507, ratio: 43.51718% total: 18213

MsgPack raw: 4746, git: 1980, ratio: 41.71934% raw: 5072, git: 2087, ratio: 41.147476% raw: 3521, git: 1541, ratio: 43.765976% raw: 3941, git: 1665, ratio: 42.24816% raw: 5902, git: 2357, ratio: 39.935616% raw: 4439, git: 1914, ratio: 43.117817% raw: 6060, git: 2425, ratio: 40.016502% raw: 2853, git: 1323, ratio: 46.372242% raw: 3654, git: 1558, ratio: 42.638203% raw: 3464, git: 1526, ratio: 44.053116% total: 18376

As expected, there is not that much differences after encoding + compression. CBOR consistently win the size contest though.

Michael Muré (MichaelMure) commented

Note: each serialization format is tested on the same set of randomly generated OperationPack with one Create and 4 AddComment ops.

Ævar Arnfjörð Bjarmason (avar) commented

@MichaelMure This test case really isn't meaningful. You're just testing how a given payload compresses with zlib when creating loose objects, since when you add a new object it's compressed, a header is added to it, and it's added to the object store.

Instead, you should after every addition do git add && git commit && git gc. Then measure the total size of the now-packed .git/objects directory, not individual objects.

At that point, these objects will be delta-compressed, so you can see how the size of the repo grows as they're added.

The size of individual objects is pretty much irrelevant. You can have 10 objects that are all 1GB, but delta-compress down to 1GB + 1MB or whatever, or 10GB if they don't delta-compress at all.

Ævar Arnfjörð Bjarmason (avar) commented

@MichaelMure Also in reply to:

Who knows how the git compression behave on something that is already binary.

I'm sure there's some obscure edge case where the compression is tweaked for textual content in some way that'll prove me wrong, but in general this doesn't matter at all.

Git's just as good at delta-compressing binary data and non-binary data. What it's not good at compressing (and this goes for any compression), is data that's wildly different from one object to the next.

It just so happens that generally binary data is less delta-compressible, think say two *.mp3s with different songs v.s. a *.txt change to its lyrics.

But for these sort of pack formats I wouldn't expect them to delta-compress any worse than say JSON. It's going to be other things that matter, e.g. let's say you use a JSON encoder where the keys of the payload aren't sorted, and thus are different every time, that'll compress worse than if they're sorted, same for doing the same in some binary key-value format.

I do think that for UI purposes it makes sense to pick a widely implemented & used text format like JSON for introspection purposes and the availability of tooling (e.g. jq), if the compression numbers for it aren't much worse that is.

Michael Muré (MichaelMure) commented

That's a good point, I'll check the repo size as well, before and after a git gc.

Michael Muré (MichaelMure) commented

Alright, another run with the size of the repo before and after a git gc (initial empty size substracted), 1000 OperationPack serialized:

GOB
Creating repo: /tmp/272689118
raw: 4446, git: 1944, ratio: 43.724697%
raw: 4774, git: 2075, ratio: 43.4646%
raw: 5075, git: 2203, ratio: 43.408867%
raw: 4135, git: 1795, ratio: 43.409916%
raw: 5901, git: 2437, ratio: 41.298084%
raw: 2919, git: 1372, ratio: 47.0024%
raw: 4974, git: 2098, ratio: 42.179333%
raw: 5074, git: 2153, ratio: 42.432007%
raw: 3600, git: 1613, ratio: 44.805557%
raw: 4663, git: 2016, ratio: 43.23397%
...
Unpacked: 1926463
GC packed: 1926510
Packing diff: 47
GC packed aggressive: 1926510
Packing diff: 0

JSON
Creating repo: /tmp/263735205
raw: 4094, git: 1706, ratio: 41.67074%
raw: 4428, git: 1837, ratio: 41.485996%
raw: 4731, git: 1968, ratio: 41.597973%
raw: 3776, git: 1547, ratio: 40.96928%
raw: 5554, git: 2192, ratio: 39.467052%
raw: 2566, git: 1136, ratio: 44.27124%
raw: 4628, git: 1863, ratio: 40.254967%
raw: 4732, git: 1921, ratio: 40.595943%
raw: 3242, git: 1377, ratio: 42.47378%
raw: 4320, git: 1773, ratio: 41.041668%
...
Unpacked: 1687200
GC packed: 1687247
Packing diff: 47
GC packed aggressive: 1687247
Packing diff: 0

CBOR
Creating repo: /tmp/701783232
raw: 3984, git: 1705, ratio: 42.796185%
raw: 4311, git: 1838, ratio: 42.63512%
raw: 4613, git: 1965, ratio: 42.597008%
raw: 3674, git: 1550, ratio: 42.18835%
raw: 5438, git: 2192, ratio: 40.308937%
raw: 2462, git: 1134, ratio: 46.060116%
raw: 4514, git: 1863, ratio: 41.2716%
raw: 4613, git: 1916, ratio: 41.534794%
raw: 3137, git: 1376, ratio: 43.863564%
raw: 4202, git: 1766, ratio: 42.027603%
...
Unpacked: 1685158
GC packed: 1685205
Packing diff: 47
GC packed aggressive: 1685205
Packing diff: 0

MsgPack
Creating repo: /tmp/132917535
raw: 3984, git: 1723, ratio: 43.247993%
raw: 4310, git: 1854, ratio: 43.01624%
raw: 4611, git: 1985, ratio: 43.049232%
raw: 3672, git: 1562, ratio: 42.538128%
raw: 5436, git: 2204, ratio: 40.544518%
raw: 2460, git: 1152, ratio: 46.82927%
raw: 4512, git: 1875, ratio: 41.55585%
raw: 4614, git: 1932, ratio: 41.872562%
raw: 3138, git: 1395, ratio: 44.455067%
raw: 4202, git: 1783, ratio: 42.432175%
...
Unpacked: 1700178
GC packed: 1700225
Packing diff: 47
GC packed aggressive: 1700225
Packing diff: 0

Whatever the format, there is no compression taking advantage of the similarity between each OperationPack. The packed repo is actually bigger by 47 bytes, and a git gc --agressive does nothing.

Michael Muré (MichaelMure) commented

Another run with 100k OperationPack (so 500k operations), just for the sake of it:

GOB
Creating repo: /tmp/235087672
raw: 3231, git: 1492, ratio: 46.177654%
raw: 4688, git: 2097, ratio: 44.731228%
raw: 3611, git: 1625, ratio: 45.001385%
raw: 3566, git: 1620, ratio: 45.42905%
raw: 3911, git: 1718, ratio: 43.927383%
raw: 6047, git: 2526, ratio: 41.77278%
raw: 3487, git: 1595, ratio: 45.741325%
raw: 5425, git: 2267, ratio: 41.788017%
raw: 3013, git: 1341, ratio: 44.507137%
raw: 6101, git: 2549, ratio: 41.780037%
...
Unpacked: 194 MB
GC packed: 194 MB
Packing diff: 47
GC packed aggressive: 194 MB
Packing diff: 0

JSON
Creating repo: /tmp/145768759
raw: 2870, git: 1261, ratio: 43.937283%
raw: 4332, git: 1842, ratio: 42.520775%
raw: 3248, git: 1398, ratio: 43.041874%
raw: 3215, git: 1392, ratio: 43.297047%
raw: 3553, git: 1485, ratio: 41.795666%
raw: 5699, git: 2280, ratio: 40.00702%
raw: 3130, git: 1356, ratio: 43.32268%
raw: 5083, git: 2032, ratio: 39.97639%
raw: 2660, git: 1119, ratio: 42.06767%
raw: 5753, git: 2301, ratio: 39.99652%
...
Unpacked: 170 MB
GC packed: 170 MB
Packing diff: 47
GC packed aggressive: 170 MB
Packing diff: 0

CBOR
Creating repo: /tmp/170025770
raw: 2773, git: 1255, ratio: 45.257843%
raw: 4227, git: 1851, ratio: 43.789925%
raw: 3149, git: 1395, ratio: 44.299778%
raw: 3107, git: 1395, ratio: 44.898617%
raw: 3448, git: 1480, ratio: 42.92343%
raw: 5587, git: 2284, ratio: 40.880615%
raw: 3027, git: 1363, ratio: 45.02808%
raw: 4964, git: 2030, ratio: 40.89444%
raw: 2558, git: 1113, ratio: 43.510555%
raw: 5641, git: 2300, ratio: 40.77291%
...
Unpacked: 170 MB
GC packed: 170 MB
Packing diff: 47
GC packed aggressive: 170 MB
Packing diff: 0

MsgPack
Creating repo: /tmp/418211457
raw: 2778, git: 1272, ratio: 45.788338%
raw: 4228, git: 1868, ratio: 44.181644%
raw: 3150, git: 1409, ratio: 44.73016%
raw: 3109, git: 1408, ratio: 45.287876%
raw: 3447, git: 1495, ratio: 43.371048%
raw: 5587, git: 2302, ratio: 41.202793%
raw: 3026, git: 1379, ratio: 45.571712%
raw: 4965, git: 2041, ratio: 41.107754%
raw: 2562, git: 1125, ratio: 43.911007%
raw: 5641, git: 2316, ratio: 41.05655%
...
Unpacked: 171 MB
GC packed: 171 MB
Packing diff: 47
GC packed aggressive: 171 MB
Packing diff: 0

Jed Fox (j-f1) commented

Interesting that JSON and CBOR end up almost the same size.

Ævar Arnfjörð Bjarmason (avar) commented

In a lot of cases --aggressive does nothing, since e.g. if you have files that keep growing they'll already be in the --window and --depth described in the git-repack manpage, --aggressive just tweaks those values from the default of 10/50 to 250/50. I wouldn't be surprised if for such an artificial testcase you got simliar/the same results with --window=1 --depth=1 or whatever.

Is the history this go tool produces accessible somewhere?

Michael Muré (MichaelMure) commented

Each OperationPack are independent, the similarities would be between them would be only the serialization format structure. There is no file growing.

It's not that surprising that git doesn't compress that.

Is the history this go tool produces accessible somewhere?

I'm not sure it answer your question, but have a look at https://github.com/MichaelMure/git-bug/blob/master/doc/model.md.

Ævar Arnfjörð Bjarmason (avar) commented

I mean you're producing some git data during the benchmark in a repo, is the result available somewhere? I could run it myself, but then I have to figure out how to run/install go etc.

On Fri, Sep 7, 2018 at 1:13 PM Michael Muré notifications@github.com wrote:

Each OperationPack are independent, the similarities would be between them would be only the serialization format structure. There is no file growing.

It's not that surprising that git doesn't compress that.

Is the history this go tool produces accessible somewhere?

I'm not sure it answer your question, but have a look at https://github.com/MichaelMure/git-bug/blob/master/doc/model.md.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/MichaelMure/git-bug/issues/5#issuecomment-419407804, or mute the thread https://github.com/notifications/unsubscribe-auth/AACw9XsR4Y3Xk732raq435m7ONnOQ8BYks5uYlTJgaJpZM4Vuu9B .

Michael Muré (MichaelMure) commented

@avar These blobs are not tied up in a branch, it's rather impractical to push that somewhere. Please install go (probably just a package), checkout the branch and run go run misc/serial_format_research/main.go.

Michael Muré (MichaelMure) commented

With 60fcfcdcb0e89741528cfc99a94a48f204d48e6b, I changed the serialization format for Json.

Here are a few measurement with 10k random bugs and 10op/bug (100k ops total, same as the previous test):

generation & writing 61s
repo size 161M
git gc 4s
repo size 21M
cache building 40s
cache size 1.5M
bug query 0.04s

Quite happy with these results! Note that the cache building is currently mono-processor. There is still perf to gain.

Also, now that the blobs are connected in a chain of commit, git gc start to actually compress them. 21Mo for 10k bugs is nice.

Michael Muré (MichaelMure) commented

With no sign of troubles after various tests, let's consider the matter resolved :-)

Michael Muré (MichaelMure) closed the bug

andyl (andyl) commented

Is there a CLI command to generate a JSON dump from the issues?

Michael Muré (MichaelMure) commented

@andyl there is not. What's your usecase ?

andyl (andyl) commented

@MichaelMure I'm working on a project that allows people to post auction-style bids for issues (see bugmark.net). We'd very much like to integrate with git-bug. To do this, we need to be able to poll the issue repository and grab a json-like representation. JSON would be simple for us, but if there is another way to integrate open to that too.

Michael Muré (MichaelMure) commented

@andyl that's certainly doable and should be supported by the CLI tools.

Could you open a new issue where we can discuss that ?

andyl (andyl) commented

Could you open a new issue where we can discuss that ?

@MichaelMure see #45

sudoforge added label kind/feature

sudoforge added label area/serialization

sudoforge removed label RFC

sudoforge removed label Core