Reflections on the change needed in the data model

Michael Muré (MichaelMure) opened 6 years ago

It appears to me that some changes are needed to fix some current problems or support more usecase.

To help structure the reflection, let's start with a ...

Threat model

Level 1:

all actors behave in good faith and run the normal software
the goal is to prevent accidental tempering or corruption of the data, and in particular, changes or suppression in the semantic of a bug.

In essence, this threat model represent the normal workflow of git-bug where all the actors are identified and validated, either by being authorised to push to a repository, being a source repository being pulled, or by being a bot of some sort with the same level of credential.

Being resistant to this threat model is the minimal acceptable protection that git-bug should provide.

Level 2:

bad actor(s) try to actively temper or corrupt the data, and in particular, try to change, inject or suppress some semantic information of a bug.
they can run a custom version of the software

This threat model represent a fully decentralized workflow for git-bug where the actors are not fully identified and validated.

Being resistant to this threat model is not a goal of git-bug at the moment. This threat model exist to be able to identify where git-bug might fail in this scenario and possibly to make educated design decision towards supporting it at some point. In any case, being resistant to this thread model automatically imply being resistant to the first one and might lead to more elegant solution.

Point of attention

ordering of the operation inside a bug
- reordering of existing operation inside an OperationPack
- reordering of existing OperationPack
insertion of an extra operation

Problem#1: operation's ID collision

As of now, it's fairly easy to create collision of Operation's ID, especially for simple Operation like SetStatusOperation. Create two "close" operation within the same second and you have a collision. Essentially, there is either not enough entropy or no mechanism to ensure uniqueness (or really enough uniqueness as we are dealing with a distributed system).

Possible solutions:

add some sort of arbitrary counter
- not great as it doesn't give uniqueness guarantee (two actors can select the same next index as they don't have the complete knowledge of the system)
increase the resolution of the time field
- make accidental collision way less likely
add a "random data" field like in identity.Version
(?) add a Vector clock per bug with a clock value for each operation, making them unique, or another solution for Problem#2

Problem#2: support decentralized workflow & operation ordering

As of now, git-bug enforce to have a single linear chain of operations. Those operations are interpreted in the order defined by this chain, which means that the oldest operation take precedence. If an operation's semantic doesn't make sense after the action of another, it is ignored.

When someone want to push new operation, git-bug enforce this single chain of operation, and enforce a fast-forward only update. This means that if a conflict exist, the sender has to retrieve the distant chain (pull), and apply his new operations on top before pushing.

This method works well with a centralized sharing topology (which would I expect be 90% of the actual usage), but it's much less clear to me what happen with multiple repositories push/pulling on each other. It would be very valuable to ensure that such a scenario works correctly as it would allow for a fully decentralized workflow.

At the same time, a solution to this problem has to ensure that a malicious actor cannot inject or reorder operations as it would effectively allow to disable others or temper the semantic of a bug.

Potential solutions:

add a Lamport clock's value on each operations
add a vector clock's value on each operation, with the process's id being the user'id
add on each operation the ID of the previous operation
allow a non-linear DAG of operations, with a computed merge. Inspiration: https://github.com/matrix-org/matrix-doc/blob/erikj/state_res_msc/proposals/1442-state-resolution.md

Selected solutions (WIP)

data validation

Bug's operation are checked to make sure that no operations have the same ID. If they do, the bug is considered invalid, which will prevent 1) adding operations that would create a collision and 2) merge a remote bug with a collision. This will mechanically prevent a bad bug from propagating in the network.

millisecond timestamps

At the moment, timestamps are classic unix timestamp, that is with a second precision. Switching to millisecond could help adding more entropy to each operation and prevent accidental collision. Not a silver bullet, but fairly easy to do and low cost.

Michael Muré (MichaelMure) commented 6 years ago

Obviously, any help on this would be super valuable. @sandhose maybe ?

hoijui (hoijui) commented 6 years ago

Have you though about versioning the data model? .. or is it already versioned? I only know what is written in model.md

hoijui (hoijui) commented 6 years ago

I would like to integrate the answer to this and other questions into model.md.

I like your project and your way of managing it a lot, and I think highly of your technical skills. I think we would all be best of if all devs interested in a distributed bug tracker, would join here rather then create their own version. The data-model is key for that though, so I would want it as well documented as possible, including reasoning for why it is as it is, and.. I think it is great you started a discussion about it now! :-)

Michael Muré (MichaelMure) commented 6 years ago

The data model is versioned in several places:

https://github.com/MichaelMure/git-bug/blob/master/bug/operation_pack.go#L12 (for Bug)
https://github.com/MichaelMure/git-bug/blob/master/identity/version.go#L16 (for Identity)
https://github.com/MichaelMure/git-bug/blob/master/cache/repo_cache.go#L30 (for the indexing cache)

model.md is really just an introduction into how things work under the hood, the reasoning of those choices so that people can collaborate with a common understanding. Eventually it would need to be upgraded into a more formal specification so that more interoperability can be created with other systems, but that's a ton of work ...

Please note an important point though: this document is outdated. With the 0.5.0 release, the Identitys have been separated into their own datastructure inside git, similar to what has been done for Bug. The reasoning is explained in the release log. I've meant to update the documentation, but well ... Help is super super welcome here.

I think we would all be best of if all devs interested in a distributed bug tracker, would join here rather then create their own version.

That'd be awesome :-) I expect the next release to have a decent amount of attention considering that the bridges with Github and Gitlab are now functional, thanks to @A-Hilaly . Maybe that will be enough to grow the contributor list ?

I like your project and your way of managing it a lot, and I think highly of your technical skills.

Thanks :-)

hoijui (hoijui) commented 6 years ago

Ahh yeah I have read about this.. Identity separation stuff! I also very much liked when you said it could be a separate project! It would be very annoying to have multiple systems of identity management for a single project (one for bugs, one for the wiki, ...), on the other hand, it would be a project.. even more meaningful then git-bug itsself, to have a well done distributed identity management. It could be useful well beyond git projects.

I totally agree that having the model documentation more formal would be great! Do you have any ideas how you would like this done? The thing that came to my mind now, is to use OWL/RDF. I am just learning about this for two other projects I got involved to, but it seems to fit somehow.

Michael Muré (MichaelMure) commented 6 years ago

Hmm, I don't have a preference. In any case, any doc is better than no docs ;)

hoijui (hoijui) commented 6 years ago

ok :-) yeah true.. whatever format.. once it is somewhat well documented, conversion between formats should be pretty easy anyway.

Michael Muré (MichaelMure) commented 6 years ago

I merged https://github.com/MichaelMure/git-bug/pull/213 as a first step to address this. It's some light change to ensure that hash collision are detected and rejected, both when manipulating a bug and when merging a remote state (meaning, it stops the propagation in the network of bad data).

To go further than that, I'm leaning on supporting a full DAG of operation instead of enforcing the linear chain but I haven't fully made my mind yet.

http://archagon.net/blog/2018/03/24/data-laced-with-history/ is still on my read list.

Paul "Joey" Clark (joeytwiddle) commented 5 years ago

Not sure if this 60 minute conference talk helps directly, but it's in a similar space, so it might provide some inspiration. The presenter reimplements (or rather demonstrates) the major components of a blockchain using Git.

He identifies various attacks and then implements a solution to each one. He runs a validator on each commit, and cherry-picks rather than merging. He adds new data in new immutable files (rather than updating any common files, which frequently leads to merge conflicts). He uses git subtree so consumers don't need the entire chain at once (only miners do).

Git as Blockchain - Michael Perry - NDC Sydney 2018

Michael Muré (MichaelMure) commented 5 years ago

@joeytwiddle thanks, it's a cool talk.

However it doesn't apply here. The whole point of this talk is to present the blockchains concept from the ground up. So, when reaching the point of merging the conflicts, the proposed solution is to accept both version, let them grow and consider valid only the longest chain. It's a valid way to do things but has a major drawback: it means that the loosing chain get entirely discarded. We loose information, and we don't want to do that here :)

Mark Hughes (happybeing) (happybeing) commented 5 years ago

Hi Michael, I've only just read the OP and skimmed the comments but recognise the issue as it is key to my use case of a p2p hosted git portal based on git-bug.

It's an area I believe can be solved although I've not gone into detail yet. With Safe Network which is my primary target, some of these issues are solved for us (identity for example). One issue we're left with is how to manage things like spamming of submissions. I have some ideas for how that could be handled but much you do first.

Michael Muré (MichaelMure) commented 3 years ago

FYI, this has finally been tackled. The result is the entity and entity/dag package: https://github.com/MichaelMure/git-bug/tree/master/entity/dag

This also make it quite easy to implement a new distributed data structure like Bug, and example is available here: https://github.com/MichaelMure/git-bug/blob/master/entity/dag/example_test.go

Michael Muré (MichaelMure) closed the bug 3 years ago

Michael Muré (MichaelMure) commented 3 years ago

Also, documentation: https://github.com/MichaelMure/git-bug/blob/master/doc/model.md

Timeline

Threat model

Point of attention

Problem#1: operation's ID collision

Problem#2: support decentralized workflow & operation ordering

Selected solutions (WIP)

data validation

millisecond timestamps