Data model: why deltas and not snapshots?

Labels: area/data-model resolution/closed

xaur (xaur) opened 7 years ago

Hello! Thanks for your amazing work. I have a big need for a decent distributed issue tracker and git-bug looks very promising. I also look for a way to grab all GitHub issue and pull request data, and the recent exporter work sounds like what I need.

I've read the data model and got some itching questions. Apologies if this was discussed previously and I would appreciate any pointers to read.

1. Why store deltas instead of snapshots?

Anybody can create and edit bugs at the same time as you. To deal with this problem, you need a way to merge these changes in a meaningful way. Instead of storing directly the final bug data, we store a series of edit Operation.

First thought as I read this: didn't Git solve this?

One of the major breakthroughs of Git was to store snapshots of files instead of deltas like SVN did. The hope was that tools would evolve to compute better and better deltas from two snapshots. And it happened, modern diff and merge tools are pretty smart. In Git anybody can create and edit files at the same time as I, and to deal with this problem we have merge tools and strategies.

From this perspective, I'm puzzled by the choice to store Operations (deltas) over snapshots. At any given moment you don't have complete versions (snapshots) of bugs - to get the state of bugs you need to compute deltas from the very beginning. This means you can't edit bugs as simple files and commit them with Git, you need to always use special software that talks the language of Operations' storage format.

Bugs and comments could be simple Markdown files with YAML metadata on top (compatible with the popular "front matter" thing). Alternatively, bugs could be metadata files without body with optional .md description file (or, description can be implemented as first comment). In any case I'd suggest YAML over JSON.

2. Rebase is not the best merge strategy

Now that we have this, we can easily merge our bugs without conflict. When pulling bug's update from a remote, we will simply add our new operations (that is, new Commit), if any, at the end of the chain. In git terms, it's just a rebase.

This reads like it will apply remote changes on top of mine and always override what I had. If state snapshots are stored instead of deltas (following a common Git way), I could fetch remote changes and decide myself how to integrate them with my state (using rebase or merge or whatever). This would allow standard Git ways to be used, e.g. a ton of people have cloned the repo but a select few of them are considered "upstreams" by everyone else for their better curation work.

Michael Muré (MichaelMure) commented 7 years ago

Why store deltas instead of snapshots?

Essentially, it boils down to this: git-bug is a bug tracker and git is a code/content tracker. Those are two different problems and should be considered individually to see what properties are desired and what tradeoff can or has to be made.

On the subject you are mentioning, there is a key distinction in the desired properties: while with a content tracker you want to have absolute control on the result and are willing to deal with manual merge and conflict resolution to achieve that, with a bug tracker you care about the intent of the participating users. Said otherwise, it's not a huge problem if merging two bug states doesn't go exactly as expected, as long as what each actors wanted is clear and visible (that is, each changes will be clearly mentioned in the timeline).

That means that a different tradeoff can/should be made. In this case, it's more interesting (at least imho) to have a low friction tool that does the merging autonomously, where we can have a free collaboration without even thinking about the internals of the tools as you have to with git ("should I rebase or merge here ?"), that anybody, even non developers can use (that is, users).

So what happen if git-bug does get a merge wrong ? Let say Alice change the status of a bug to some value and Bob change it another way. If the merge fail somehow, we end up with the wrong status, but we see who wanted what. It's a simple matter of changing the status again to fix the situation. The conflict resolution is explicit (that's a feature) in the timeline while it would have been hidden in a git like process.

Also, note one thing: bug edition is a slow process (there is usually at least hours between each event), and the events where a bad merge could be a problem (say labels or status) are usually done by a single actor, the maintainer or the developer working on the problem. I expect this merging problem to be quite rare in practice.

If you accept that this is the behavior you want for a bug tracker, then storing delta instead of snapshots makes more sense.

I'll explain another part of that tradeoff below.

This reads like it will apply remote changes on top of mine and always override what I had.

It's the other way around. When pulling the latest state from the remote, you apply your newly made changes on top.

That's part of the design/tradeoff explained above. When you made some changes offline or without being up to date with the discussion, git-bug will tell you that there was a potential conflict. The nice thing is that you are also the best person to figure out if that was merged properly so you are able to fix a bad merge if needed.

Michael Muré (MichaelMure) added label Non-actionable 7 years ago

xaur (xaur) commented 7 years ago

I see how storing deltas is justified here to minimize merge burden. I trained many people to use Git and indeed merging state is a huge barrier to learn a ton of concepts and techniques, and then even when you master them it still takes time.

For state such as status, title, labels or assignee this it is indeed trivial to recover any merge that gone wrong. Initially I thought there would be a problem with merging comment edits.

To clarify my use case, I use comment editing a lot. Sometimes I have "long running" issues where multiple users make multiple edits to the description to accumulate enough information (and it can grow into a large piece of text). Other case is when a contributor makes a very good comment I can edit it to improve structure and fix typos, because that comment can potentially be a good reference for years.

I guess that as you noted, low frequencey of events and small number of people doing edits (often one) applies to comment editing as well. And if participants coordinate through a centrally hosted repository they would do their best to push early, just like we do with Git. Even better if it is possible to discard or adjust local changes after pull to avoid spawning edits that correct your previous edits done before the pull.

Some background - my initial research was to find incremental GitHub issue backup/archival tool that can grab all new comments and edits as simple as git fetch. I was very excited to read the v0.4.0 release with incremental github importer that also supported comment edits. With potential exporter, an offline client for GitHub issues with full local data copy is much cooler than just data backup, and allows less use of browsers.

Thanks for answers! I think this can be closed now.

Michael Muré (MichaelMure) commented 7 years ago

You touch a few a good points though:

currently, each comment edit operation carry the full new version of the comment. This means that there is likely a data loss when merging occurs. As explained, I don't think it would happen often, as comment edition will be limited to a few persons (author, admin ...), but still. We could at some point leverage git to actually merge the changes and redure even further the problem. As you might think as well, it's not ideal as well for storage size, but git's delta compression should really limit that problem.
I expect an auto-push feature to exist at some point to reduce the risk of having to merge anything
It would be nice to have multiple non-pushed edition automatically merged in one locally

Michael Muré (MichaelMure) commented 4 years ago

Closing as it's not an issue.

Michael Muré (MichaelMure) closed the bug 4 years ago

sudoforge removed label Non-actionable 2 years ago

sudoforge added label resolution/closed 2 years ago

sudoforge added label area/data-model 2 years ago

sudoforge removed label resolution/closed 2 years ago

sudoforge added label resolution/closed 2 years ago

Timeline

1. Why store deltas instead of snapshots?

2. Rebase is not the best merge strategy