data-model.md

  1# git-bug's reusable entity data model
  2
  3This document explains how git-bug's reusable distributed data structure in git
  4is working. This data structure is capable of:
  5
  6- storing an entity (bug, pull-request, config...) and its complete history in
  7  git
  8- carry signed authorship of editions
  9- use git remotes as a medium for synchronisation and collaboration
 10- merge conflicts
 11- respect the rules you define as to what edition are possible
 12- carry attached media
 13
 14If you are looking for a different writing format or to see how you can easily
 15make your own, checkout [the example code](../../entity/dag/example_test.go).
 16
 17If you are not familiar with
 18[git internals](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects), you
 19might first want to read about them, as the `git-bug` data model is built on top
 20of them.
 21
 22## Entities (bug, author, ...) are a series of edit operations
 23
 24As entities are stored and edited in multiple processes at the same time, it's
 25not possible to store the current state like it would be done in a normal
 26application. If two processes change the same entity and later try to merge the
 27states, we wouldn't know which change takes precedence or how to merge those
 28states.
 29
 30To deal with this problem, you need a way to merge these changes in a meaningful
 31way. Instead of storing the final bug data directly, we store a series of edit
 32`Operation`s. This is a common idea, notably with
 33[Operation-based CRDTs](https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type#Operation-based_CRDTs).
 34
 35![ordered operations](../assets/operations.png)
 36
 37To get the final state of an entity, we apply these `Operation`s in the correct
 38order on an empty state, to compute (aka "compile") our view.
 39
 40## Entities are stored in git objects
 41
 42An `Operation` is a piece of data, including:
 43
 44- a type identifier
 45- an author (a reference to another entity)
 46- a timestamp (there is also one or two [Lamport time](#time-is-unreliable))
 47- all the data required by that operation type (a message, a status ...)
 48- a random [nonce](https://en.wikipedia.org/wiki/Cryptographic_nonce) to ensure
 49  we have enough entropy, as the operation identifier is a hash of that data
 50  (more on that later)
 51
 52These `Operation`s are aggregated in an `OperationPack`, a simple array. An
 53`OperationPack` represents an edit session of the entity. As the operation's
 54author is the same for all the `OperationPack` we only store it once.
 55
 56We store this pack in git as a git `Blob`; that consists of a string containing
 57a JSON array of operations. One such pack -- here with two operations -- might
 58look like this:
 59
 60```json
 61{
 62  "author": {
 63    "id": "04bf6c1a69bb8e9679644874c85f82e337b40d92df9d8d4176f1c5e5c6627058"
 64  },
 65  "ops": [
 66    {
 67      "type": 3,
 68      "timestamp": 1647377254,
 69      "nonce": "SRQwUWTJCXAmQBIS+1ctKgOcbF0=",
 70      "message": "Adding a comment",
 71      "files": null
 72    },
 73    {
 74      "type": 4,
 75      "timestamp": 1647377257,
 76      "nonce": "la/HaRPMvD77/cJSJOUzKWuJdY8=",
 77      "status": 1
 78    }
 79  ]
 80}
 81```
 82
 83To reference our `OperationPack`, we create a git `Tree`; it references our
 84`OperationPack` `Blob` under `"/ops"`. If any edit operation includes a media
 85(for instance in a text message), we can store that media as a `Blob` and
 86reference it here under `"/media"`.
 87
 88To complete the picture, we create a git `Commit` that references our `Tree`.
 89Each time we add more `Operation`s to our bug, we add a new `Commit` with the
 90same data-structure to form a chain of `Commit`s.
 91
 92This chain of `Commit`s is made available as a git `Reference` under
 93`refs/<namespace>/<id>`. We can later use this reference to push our data to a
 94git remote. As git will push any data needed as well, everything will be pushed
 95to the remote, including the media.
 96
 97Here is the complete picture:
 98
 99![git graph of a simple bug](../assets/bug-graph.png)
100
101## Time is unreliable
102
103Before being able to merge conflicts, let's start with some building blocks.
104
105It would be very tempting to use the `Operation`'s timestamp to give us the
106order to compile the final state. However, you can't rely on the time provided
107by other people (their clock might be off) for anything other than just display.
108This is a fundamental limitation of distributed system, and even more so when
109actors might want to game the system.
110
111Instead, we are going to use
112[Lamport logical clock](https://en.wikipedia.org/wiki/Lamport_timestamps). A
113Lamport clock is a simple counter of events. This logical clock gives us a
114partial ordering:
115
116- if L1 \< L2, L1 happened before L2
117- if L1 > L2, L1 happened after L2
118- if L1 == L2, we can't tell which happened first: it's a concurrent edition
119
120Each time we are appending something to the data (create an `Entity`, add an
121`Operation`) a logical time will be attached, with the highest time value we are
122aware of, plus one. This declares a causality in the events and allows ordering
123entities and operations.
124
125The first commit of an `Entity` will have both a creation time and edit time
126clock, while a later commit will only have an edit time clock. These clocks
127value are serialized directly in the `Tree` entry name (for example:
128`"create-clock-4"`). As a `Tree` entry needs to reference something, we
129reference the git `Blob` with an empty content. As all of these entries will
130reference the same `Blob`, no network transfer is needed as long as you already
131have any entity in your repository.
132
133Example of a `Tree` of the first commit of an entity:
134
135```
136100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391	create-clock-14
137100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391	edit-clock-137
138100644 blob a020a85baa788e12699a4d83dd735578f0d78c75	ops
139```
140
141Example of a `Tree` of a later commit of an entity:
142
143```
144100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391	edit-clock-154
145100644 blob 68383346c1a9503f28eec888efd300e9fc179ca0	ops
146```
147
148## Entities and Operation's ID
149
150`Operation`s can be referenced - in the data model or by users - with an
151identifier. This identifier is computed from the `Operation`'s data itself, with
152a hash of that data: `id = hash(json(op))`
153
154For entities, `git-bug` uses as identifier the hash of the first `Operation` of
155the entity, as serialized on disk.
156
157The same way as git does, this hash is displayed truncated to a 7 characters
158string to a human user. Note that when specifying a bug id in a command, you can
159enter as few characters as you want, as long as there is no ambiguity. If
160multiple entities match your prefix, `git-bug` will complain and display the
161potential matches.
162
163## Entities support conflict resolution
164
165Now that we have all that, we can finally merge our entities without conflict,
166and collaborate with other users. Let's start by getting rid of two simple
167scenarios:
168
169- if we simply pull updates, we move forward our local reference. We get an
170  update of our graph that we read as usual.
171- if we push fast-forward updates, we move forward the remote reference and
172  other users can update their reference as well.
173
174The tricky part happens when we have concurrent editions. If we pull updates
175while we have local changes (non-straightforward in git term), `git-bug` creates
176the equivalent of a merge commit to merge both branches into a DAG. This DAG has
177a single root containing the first operation, but can have branches that get
178merged back into a single head pointed by the reference.
179
180As we don't have a purely linear series of commits/`Operations`s, we need a
181deterministic ordering to always apply operations in the same order.
182
183`git-bug` applies the following algorithm:
184
1851. load and read all the commits and the associated `OperationPack`s
1862. make sure that the Lamport clocks respect the DAG structure: a parent
187   commit/`OperationPack` (that is, towards the head) cannot have a clock that
188   is higher or equal than its direct child. If such a problem happens, the
189   commit is refused/discarded.
1903. individual `Operation`s are assembled together and ordered given the
191   following priorities:
192   1. the edition's lamport clock if not concurrent
193   2. the lexicographic order of the `OperationPack`'s identifier
194
195Step 2 is providing and enforcing a constraint over the `Operation`'s logical
196clocks. What that means, is that **we inherit the implicit ordering given by the
197DAG**. Later, logical clocks refine that ordering. This - coupled with signed
198commits - has the nice property of limiting how this data model can be abused.
199
200Here is an example of such an ordering:
201
202![merge scenario 1](../assets/merge-1.png)
203
204We can see that:
205
206- Lamport clocks respect the DAG structure
207- the final `Operation` order is \[A,B,C,D,E,F\], according to those clocks
208
209When we have concurrent editions, we apply a secondary ordering, based on the
210`OperationPack`'s identifier:
211
212![merge scenario 2](../assets/merge-2.png)
213
214This secondary ordering doesn't carry much meaning, but it's unbiased and hard
215to abuse.