1# git-bug's reusable entity data model
2
3This document explains how git-bug's reusable distributed data structure in git
4is working. This data structure is capable of:
5
6- storing an entity (bug, pull-request, config...) and its complete history in
7 git
8- carry signed authorship of editions
9- use git remotes as a medium for synchronisation and collaboration
10- merge conflicts
11- respect the rules you define as to what edition are possible
12- carry attached media
13
14If you are looking for a different writing format or to see how you can easily
15make your own, checkout [the example code](../../entity/dag/example_test.go).
16
17If you are not familiar with
18[git internals](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects), you
19might first want to read about them, as the `git-bug` data model is built on top
20of them.
21
22## Entities (bug, author, ...) are a series of edit operations
23
24As entities are stored and edited in multiple processes at the same time, it's
25not possible to store the current state like it would be done in a normal
26application. If two processes change the same entity and later try to merge the
27states, we wouldn't know which change takes precedence or how to merge those
28states.
29
30To deal with this problem, you need a way to merge these changes in a meaningful
31way. Instead of storing the final bug data directly, we store a series of edit
32`Operation`s. This is a common idea, notably with
33[Operation-based CRDTs](https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type#Operation-based_CRDTs).
34
35
36
37To get the final state of an entity, we apply these `Operation`s in the correct
38order on an empty state, to compute (aka "compile") our view.
39
40## Entities are stored in git objects
41
42An `Operation` is a piece of data, including:
43
44- a type identifier
45- an author (a reference to another entity)
46- a timestamp (there is also one or two [Lamport time](#time-is-unreliable))
47- all the data required by that operation type (a message, a status ...)
48- a random [nonce](https://en.wikipedia.org/wiki/Cryptographic_nonce) to ensure
49 we have enough entropy, as the operation identifier is a hash of that data
50 (more on that later)
51
52These `Operation`s are aggregated in an `OperationPack`, a simple array. An
53`OperationPack` represents an edit session of the entity. As the operation's
54author is the same for all the `OperationPack` we only store it once.
55
56We store this pack in git as a git `Blob`; that consists of a string containing
57a JSON array of operations. One such pack -- here with two operations -- might
58look like this:
59
60```json
61{
62 "author": {
63 "id": "04bf6c1a69bb8e9679644874c85f82e337b40d92df9d8d4176f1c5e5c6627058"
64 },
65 "ops": [
66 {
67 "type": 3,
68 "timestamp": 1647377254,
69 "nonce": "SRQwUWTJCXAmQBIS+1ctKgOcbF0=",
70 "message": "Adding a comment",
71 "files": null
72 },
73 {
74 "type": 4,
75 "timestamp": 1647377257,
76 "nonce": "la/HaRPMvD77/cJSJOUzKWuJdY8=",
77 "status": 1
78 }
79 ]
80}
81```
82
83To reference our `OperationPack`, we create a git `Tree`; it references our
84`OperationPack` `Blob` under `"/ops"`. If any edit operation includes a media
85(for instance in a text message), we can store that media as a `Blob` and
86reference it here under `"/media"`.
87
88To complete the picture, we create a git `Commit` that references our `Tree`.
89Each time we add more `Operation`s to our bug, we add a new `Commit` with the
90same data-structure to form a chain of `Commit`s.
91
92This chain of `Commit`s is made available as a git `Reference` under
93`refs/<namespace>/<id>`. We can later use this reference to push our data to a
94git remote. As git will push any data needed as well, everything will be pushed
95to the remote, including the media.
96
97Here is the complete picture:
98
99
100
101## Time is unreliable
102
103Before being able to merge conflicts, let's start with some building blocks.
104
105It would be very tempting to use the `Operation`'s timestamp to give us the
106order to compile the final state. However, you can't rely on the time provided
107by other people (their clock might be off) for anything other than just display.
108This is a fundamental limitation of distributed system, and even more so when
109actors might want to game the system.
110
111Instead, we are going to use
112[Lamport logical clock](https://en.wikipedia.org/wiki/Lamport_timestamps). A
113Lamport clock is a simple counter of events. This logical clock gives us a
114partial ordering:
115
116- if L1 \< L2, L1 happened before L2
117- if L1 > L2, L1 happened after L2
118- if L1 == L2, we can't tell which happened first: it's a concurrent edition
119
120Each time we are appending something to the data (create an `Entity`, add an
121`Operation`) a logical time will be attached, with the highest time value we are
122aware of, plus one. This declares a causality in the events and allows ordering
123entities and operations.
124
125The first commit of an `Entity` will have both a creation time and edit time
126clock, while a later commit will only have an edit time clock. These clocks
127value are serialized directly in the `Tree` entry name (for example:
128`"create-clock-4"`). As a `Tree` entry needs to reference something, we
129reference the git `Blob` with an empty content. As all of these entries will
130reference the same `Blob`, no network transfer is needed as long as you already
131have any entity in your repository.
132
133Example of a `Tree` of the first commit of an entity:
134
135```
136100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 create-clock-14
137100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 edit-clock-137
138100644 blob a020a85baa788e12699a4d83dd735578f0d78c75 ops
139```
140
141Example of a `Tree` of a later commit of an entity:
142
143```
144100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 edit-clock-154
145100644 blob 68383346c1a9503f28eec888efd300e9fc179ca0 ops
146```
147
148## Entities and Operation's ID
149
150`Operation`s can be referenced - in the data model or by users - with an
151identifier. This identifier is computed from the `Operation`'s data itself, with
152a hash of that data: `id = hash(json(op))`
153
154For entities, `git-bug` uses as identifier the hash of the first `Operation` of
155the entity, as serialized on disk.
156
157The same way as git does, this hash is displayed truncated to a 7 characters
158string to a human user. Note that when specifying a bug id in a command, you can
159enter as few characters as you want, as long as there is no ambiguity. If
160multiple entities match your prefix, `git-bug` will complain and display the
161potential matches.
162
163## Entities support conflict resolution
164
165Now that we have all that, we can finally merge our entities without conflict,
166and collaborate with other users. Let's start by getting rid of two simple
167scenarios:
168
169- if we simply pull updates, we move forward our local reference. We get an
170 update of our graph that we read as usual.
171- if we push fast-forward updates, we move forward the remote reference and
172 other users can update their reference as well.
173
174The tricky part happens when we have concurrent editions. If we pull updates
175while we have local changes (non-straightforward in git term), `git-bug` creates
176the equivalent of a merge commit to merge both branches into a DAG. This DAG has
177a single root containing the first operation, but can have branches that get
178merged back into a single head pointed by the reference.
179
180As we don't have a purely linear series of commits/`Operations`s, we need a
181deterministic ordering to always apply operations in the same order.
182
183`git-bug` applies the following algorithm:
184
1851. load and read all the commits and the associated `OperationPack`s
1862. make sure that the Lamport clocks respect the DAG structure: a parent
187 commit/`OperationPack` (that is, towards the head) cannot have a clock that
188 is higher or equal than its direct child. If such a problem happens, the
189 commit is refused/discarded.
1903. individual `Operation`s are assembled together and ordered given the
191 following priorities:
192 1. the edition's lamport clock if not concurrent
193 2. the lexicographic order of the `OperationPack`'s identifier
194
195Step 2 is providing and enforcing a constraint over the `Operation`'s logical
196clocks. What that means, is that **we inherit the implicit ordering given by the
197DAG**. Later, logical clocks refine that ordering. This - coupled with signed
198commits - has the nice property of limiting how this data model can be abused.
199
200Here is an example of such an ordering:
201
202
203
204We can see that:
205
206- Lamport clocks respect the DAG structure
207- the final `Operation` order is \[A,B,C,D,E,F\], according to those clocks
208
209When we have concurrent editions, we apply a secondary ordering, based on the
210`OperationPack`'s identifier:
211
212
213
214This secondary ordering doesn't carry much meaning, but it's unbiased and hard
215to abuse.