Skip to content

WIP: Reftables support #5462

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Conversation

pks-t
Copy link
Member

@pks-t pks-t commented Mar 25, 2020

So this is a very rough reftables refdb implementation for libgit2. It's main intent is to start a discussion, see what prerequisites we have in libgit2 to properly implement refdbs other than refdb_fs and whether I'm using the reftable API correctly. Thoughts:

  • While we theoretically support reference databases other than the filesystem-based one, our support for that is... lacking. We have dozens of places where we assume $GITDIR/HEAD exists, we write references like MERGE_HEAD, ORIG_HEAD manually without consulting the refdb, lack support to create new repos with a different refdb as we create the filesystem structures like $GITDIR/refs and $GITDIR/HEAD directly in repository.c without consulting the refdb at all. We'll have to carefully vet our code to remove all accesses to anything refish outside of the refdb and instead call into the refdb layer.

  • It seems to me that reftables currently don't support a few operations that we expect from any given refdb implementation. Most importantly I couldn't figure out how to delete and rename a reflog and how to implement explicit locking (e.g. what the file backend does when creating .git/refs/heads/master.lock).

  • I found it quite hard to get into reftables as the interface exposes about quite a lot of the low-level details. This is probably my own fault as I usually tend to dive heads-deep into anything new without looking left and right first. But one resource that would've helped quite a lot is something like high-level examples that represent common git operations. I feel like I've got it right by now, but I'd be happy to get feedback on this to see how wrong I am.

  • Are there any testvectors around? Like a repository that one may embed into libgit2 itself that uses reftables to verify we behave as expected. I couldn't get a hold of any and as it's currently non-trivial to use reftables anywhere I wasn't able to come up with tests yet. As a result, my current code will probably fail very quickly.

  • At some places I feel like there's high potential for namespacing issues. E.g. struct iterator is highly generic and may trivially clash with any preexisting types. It would be nice to use namespace prefixes for all exported symbols.

  • I think it would help if the C files were layed out in a more "standard" library way. Most importantly, separate out the "reftable.h" header file from the rest into its own folder so that one can add it to the include path without having to care about accidentally overriding "tree.h" that's part of libgit2 itself.

  • I didn't find any way to override memory allocation functions. I think we've discussed that at the Contributor's summit already.

Anyway, thanks a lot for working on reftables @hanwen, this is really exciting! I'd be happy to get feedback on my current implementation, which basically represents how I think it should work. Chances are high I've got it completely wrong, though ;)

@pks-t pks-t force-pushed the pks/reftables-support branch from 8f53ad6 to 83ceed9 Compare March 25, 2020 14:24
&entry->oid_old, &entry->oid_cur,
NULL, entry->msg)) < 0)
goto out;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll probably need to deduplicate pre-existing reflog entries. With filesystem-based reflog, we're rewriting the complete reflog every time, I think, but with reftables I'll need to figure out which entries are new.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think you are right. (I don't understand why you get duplicate entries, though.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly about git_reflog semantics in libgit2. When you pass it to the filesystem-based reflog, you need to have the complete reflog history in order to re-write it in full. So with reftable I'll need to see which entry is the last one that we've already written to disk and only start writing after that one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. In reftable, the log entries are stored newest-first, so a simple seek to the refname will tell you the last entry written.

Copy link
Contributor

@hanwen hanwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filesystem-based one, our support for that is... lacking. We have
dozens of places where we assume $GITDIR/HEAD exists, we write
references like MERGE_HEAD, ORIG_HEAD manually without consulting the
refdb, lack support to create new repos with a different refdb as we
create the filesystem structures like $GITDIR/refs and $GITDIR/HEAD
directly in repository.c without consulting the refdb at all. We'll
have to carefully vet our code to remove all accesses to anything
refish outside of the refdb and instead call into the refdb layer.

Yes, I think git-core also suffers from this.

It seems to me that reftables currently don't support a few operations
that we expect from any given refdb implementation. Most importantly I
couldn't figure out how to delete and rename a reflog and how to
implement explicit locking (e.g. what the file backend does when
creating .git/refs/heads/master.lock).

For deleting a complete reflog, you'd have to delete all the entries
individually. For renaming a reflog, the suggestion is to record a
deletion + creation. This would let you retractively understand that
the rename took place.

Why do you want explicit locking? The API is written as if you'd be
doing transactions, i.e. you try to make the change, and if you get a
LOCK_ERROR, you'll have to try again.

I found it quite hard to get into reftables as the interface exposes
about quite a lot of the low-level details. This is probably my own
fault as I usually tend to dive heads-deep into anything new without
looking left and right first. But one resource that would've helped
quite a lot is something like high-level examples that represent
common git operations. I feel like I've got it right by now, but I'd
be happy to get feedback on this to see how wrong I am.

the best resource is probably code
gitgitgadget/git#539, which adds support to
git-core.

Are there any testvectors around? Like a repository that one may embed
into libgit2 itself that uses reftables to verify we behave as
expected. I couldn't get a hold of any and as it's currently
non-trivial to use reftables anywhere I wasn't able to come up with
tests yet. As a result, my current code will probably fail very
quickly.

The easiest is to compile JGit,

bazel build org.eclipse.jgit.pgm:pgm

this leaves a binary in

bazel-bin/org.eclipse.jgit.pgm/jgit

then you can do

jgit init
jgit convert-ref-storage --format reftable

There currently aren't test vectors, but it's a good suggestion. I
could add some to the repository.

At some places I feel like there's high potential for namespacing
issues. E.g. struct iterator is highly generic and may trivially clash
with any preexisting types. It would be nice to use namespace prefixes
for all exported symbols.

I want to do this, but I've been putting it off, because I don't have
an IDE that will help me with renaming, and doing it by hand is
somewhat labor intensive.

I think it would help if the C files were layed out in a more
"standard" library way. Most importantly, separate out the
"reftable.h" header file from the rest into its own folder so that one
can add it to the include path without having to care about
accidentally overriding "tree.h" that's part of libgit2 itself.

I'll get to that.

I didn't find any way to override memory allocation functions. I think
we've discussed that at the Contributor's summit already.
Anyway, thanks a lot for working on reftables @hanwen, this is really
exciting! I'd be happy to get feedback on my current implementation,
which basically represents how I think it should work. Chances are
high I've got it completely wrong, though ;)

you're welcome! I am glad to see it get uptake :)

&entry->oid_old, &entry->oid_cur,
NULL, entry->msg)) < 0)
goto out;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think you are right. (I don't understand why you get duplicate entries, though.)

@pks-t
Copy link
Member Author

pks-t commented Mar 26, 2020

Why do you want explicit locking? The API is written as if you'd be
doing transactions, i.e. you try to make the change, and if you get a
LOCK_ERROR, you'll have to try again.

Because we're kind of in a special place with libgit2. With filesystem based refs, you can explicitly lock references to keep other applications from updating them by themselves. As libgit2 may be part of a long-running application (e.g. an IDE, but it could be anything) where the author of that application may have the need to lock references for a longer time, we support a transactional API that keeps other Git applications from updating them. While it's true that you would get an error anyway with the reftable API, this kind of opportunistic locking wouldn't help in such a usecase.

Thanks for your answers, I'll make sure to update the PR soonish with your feedback!

@pks-t pks-t force-pushed the pks/reftables-support branch 2 times, most recently from ee17225 to 0bfe212 Compare March 26, 2020 09:10
@hanwen
Copy link
Contributor

hanwen commented Mar 26, 2020 via email

@pks-t pks-t force-pushed the pks/reftables-support branch from 5944c49 to 65c0d65 Compare March 27, 2020 13:30
#include <sys/time.h>
#include <sys/types.h>
#include <unistd.h>
#include <zlib.h>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently doesn't compile on Windows due to <sys/time.h> not being found. I think it should work just fine if changing this to <time.h>. Instead, we could probably have the equivalent of REFTABLE_IN_GITCORE, but I think I'd rather prefer the header to be agnostic of libgit2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On linux, this would yield:

c/stack.c:226:12: error: implicit declaration of function 'gettimeofday' [-Werror=implicit-function-declaration]
226 | int err = gettimeofday(&deadline, NULL);
| ^~~~~~~~~~~~

@pks-t
Copy link
Member Author

pks-t commented Mar 27, 2020

Thanks a lot for addressing my concerns, @hanwen! I've updated the PR to include all of them.

@pks-t
Copy link
Member Author

pks-t commented Mar 28, 2020 via email

@hanwen
Copy link
Contributor

hanwen commented Mar 28, 2020

For the ref-like objects, this is what JGit does
https://eclipse.googlesource.com/jgit/jgit/+/459fc28cf6a22992b7d168cfd52286cd76ca68bc/org.eclipse.jgit/src/org/eclipse/jgit/lib/RefDatabase.java#511

(https://eclipse.googlesource.com/jgit/jgit/+/459fc28cf6a22992b7d168cfd52286cd76ca68bc/org.eclipse.jgit/src/org/eclipse/jgit/internal/storage/file/RefDirectory.java#113)

In JGit, HEAD is expected to be managed by RefDatabase. Unfortunately, nobody has been able to articulate to me what the expectations/abstractions for this in git-core should do. Can you do transactional updates with HEAD and another branch together in libgit2?

@pks-t
Copy link
Member Author

pks-t commented Mar 28, 2020 via email

@hanwen
Copy link
Contributor

hanwen commented Mar 30, 2020

Hey Patrick,

can you check out

https://godoc.org/github.com/google/reftable#Addition

and

https://godoc.org/github.com/google/reftable#Stack.NewAddition

would that work as an API for you?

@pks-t
Copy link
Member Author

pks-t commented Mar 31, 2020

Just to make sure I understand it, this is how I'd imagine it to look like in C:

/* Create a new transactional addition for the stack */
reftable_stack_new_addition(&addition, stack);
/* Write a set of ref/log records via the usual writer callback */
reftable_addition_add(addition, reftable_backend_write_records, &args1);
/* Writing multiple times is okay for a single addition */
reftable_addition_add(addition, reftable_backend_write_records, &args2);
/* Now commit changes to persist it to disk and update the stack */
reftable_addition_commit(addition);

If I understand it correctly, then Close() will not commit any data but just release the Addition, right? So it's kind of a rollback/abort function.

If my above assumptions are correct then the API should work for us, thanks a lot!

/* 4-byte identifier ("sha1", "s256") of the hash.
* Defaults to SHA1 if unset
*/
uint32_t hash_id;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this doesn't quite work as advertised. I've set up reftable_write_options opts ={0}, where I'd expect it to just default to "sha1". But reftable_new_merged_table() fails when comparing r->hash_id != hash_id.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be helpful to have some macros available so that one may set up the expected values without having to define them locally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't repro your complaint, see

google/reftable@8a04e7f

can you provide a repro testcase?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, forgot to check in for comments. The following fails if I do not set up options.hash_id:

const char *dir, *table;
struct reftable_write_options options = {0};
struct reftable_ref_record ref = {0};
struct reftable_stack *stack;

dir = "/path/to/testrepo-reftable/.git/reftable";
table = "/path/to/testrepo-reftable/.git/reftable/tables.list";

/* This line is required, otherwise `reftable_new_stack` fails. */
options.hash_id = 0x73686131;

cl_must_pass(reftable_new_stack(&stack, dir, table, options));
cl_must_pass(reftable_stack_reload(stack));
cl_must_pass(reftable_stack_read_ref(stack, "refs/heads/master", &ref));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I'm currently not using an up to date version of the reftable library as I first want to try to get what I have completely functioning (which it mostly is already, but there's still some issues with writing for preexisting branches).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which version are you using?

btw, if you are writing existing branches, make sure you sure you set an update_index that is beyond the update_index of the existing branch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, fixed in google/reftable@f781daf

@pks-t pks-t force-pushed the pks/reftables-support branch from 65c0d65 to 7b53cfc Compare March 31, 2020 09:13
@pks-t
Copy link
Member Author

pks-t commented Mar 31, 2020

I've added a test vector and was able to get a first test to pass with the reftable backend, which reads a reference and then renames it. More tests and bug fixes will follow.

I've fixed the refdb detection to now consult "extensions.refStorage" as mandated by the documentation.

I've noticed that another requirement is to get repository format version 1 implemented. There's a temporary commit in here that just removes the error check.

@hanwen
Copy link
Contributor

hanwen commented Mar 31, 2020

google/reftable@7aabfe0 for the transaction api in C.

@hanwen
Copy link
Contributor

hanwen commented Mar 31, 2020

BTW, - I would be grateful for any feedback about reftable on windows. I don't have windows, and I'd like to keep it that way.

@pks-t pks-t force-pushed the pks/reftables-support branch from 7b53cfc to 4c2ab8f Compare April 2, 2020 11:38
@pks-t
Copy link
Member Author

pks-t commented Apr 2, 2020

Rebased on current master, which now includes support for repository format v1 and extensions.

BTW, - I would be grateful for any feedback about reftable on windows. I don't have windows, and I'd like to keep it that way.

I may dig up my Windows VM as soon as I've got the backend working on my own machine. Until then there's currently only the CI that provides any kind of feedback on Windows.

@pks-t pks-t force-pushed the pks/reftables-support branch from 4c2ab8f to 4ab39b1 Compare April 3, 2020 13:34
@pks-t pks-t force-pushed the pks/reftables-support branch 2 times, most recently from 65fc25f to 70fd8b1 Compare April 19, 2020 12:04
@hanwen
Copy link
Contributor

hanwen commented Apr 27, 2020

@pks-t - does libgit2 already have code to detect file / directory conflicts? Looks like I may need to write and/or refactor that to check transactions, and wondering if that makes sense to put into the reftable library.

@pks-t
Copy link
Member Author

pks-t commented Apr 27, 2020 via email

@pks-t pks-t force-pushed the pks/reftables-support branch from 70fd8b1 to 6276b81 Compare April 30, 2020 12:49
@pks-t
Copy link
Member Author

pks-t commented Apr 30, 2020

Rebased on latest master, updated the reftable library to 66d38fe (C: move block_iter_next comment to header, 2020-04-28), fixed a set of bugs in the reflog implementation and improved error handling in most places. All in all the PR is close to being finished, but I've stumbled over one weird bug where compressing of the reftable stack seems to fail after writing symbolic references. No idea yet whether that's a fault of mine or in the reftable library, but I'll investigate.

@pks-t pks-t force-pushed the pks/reftables-support branch 2 times, most recently from efcc341 to d3195c3 Compare July 4, 2025 14:32
@pks-t
Copy link
Member Author

pks-t commented Jul 4, 2025

I've updated the pull request a bit:

  • It's now based on the latest main branch.
  • The bundled reftable library has been updated to the latest version of Git.
  • The way we handle reflogs has been adapted to work better.
  • Various other small fixes.

What is still missing:

  • There still is a bug with reflogs somewhere, which causes 16 tests to fail.
  • There is no conflict detection yet for D/F-style conflicts, which causes another three test failures.
  • The integration bits in "reftable/system.c" need to be made actually portable.

@pks-t pks-t force-pushed the pks/reftables-support branch from d3195c3 to fcc2bd3 Compare July 7, 2025 12:36
@pks-gitlab
Copy link

There is no conflict detection yet for D/F-style conflicts, which causes another three test failures.

This has been addressed now, there is proper conflict detection in the reftable backend now. Still remaining is:

  • "reftable/system.{c,h}" portability fixes.
  • Reflogs are still somewhat broken.

Overall, 13 tests are currently failing, all due to reflogs.

@pks-t pks-t force-pushed the pks/reftables-support branch 9 times, most recently from 953b45b to 98a686f Compare July 7, 2025 15:53
pks-t added 14 commits July 11, 2025 17:42
To support multiple different reference backend implementations,
Git introduced a "refStorage" extension that stores the reference
storage format a Git client should try to use.

Wire up the logic to read this new extension when we open a repository
from disk. For now, only the "files" backend is supported by us. When
trying to open a repository that has a refstorage format that we don't
understand we now error out.

There are two functions that create a new repository that doesn't really
have references. While those are mostly non-functional when it comes to
references, we do expect that you can access the refdb, even if it's not
yielding any refs. For now we mark those to use the "files" backend, so
that the status quo is retained. Eventually though it might not be the
worst idea to introduce an explicit "in-memory" reference database. But
that is outside the scope of this patch series.
While we only support initializing repositories with the "files"
reference backend right now, we are in the process of implementing a
second backend with the "reftable" format. And while we already have the
infrastructure to decide which format a repository should use when we
open it, we do not have infrastructure yet to create new repositories
with a different reference format.

Introduce a new field `git_repository_init_options::refdb_type`. If
unset, we'll default to the "files" backend. Otherwise though, if set to
a valid `git_refdb_t`, we will use that new format to initialize the
repostiory.

Note that for now the only thing we do is to write the "refStorage"
extension accordingly. What we explicitly don't yet do is to also handle
the backend-specific logic to initialize the refdb on disk. This will be
implemented in subsequent commits.
In our tests for "onbranch" config conditionals we set HEAD to point to
various different branches via `git_repository_create_head()`. This
function circumvents the refdb though and directly writes to the "HEAD"
file. While this works now, it will create problems once we have
multiple refdb backends.

Furthermore, the function is about to go away in the next commit. So
let's prepare for that and use `git_reference_symbolic_create()`
instead.
The initialization of the on-disk state of refdbs is currently not
handled by the actual refdb backend, but it's implemented ad-hoc where
needed. This is problematic once we have multiple different refdbs as
the filesystem structure is of course not the same.

Introduce a new callback function `git_refdb_backend::init()`. If set,
this callback can be invoked via `git_refdb_init()` to initialize the
on-disk state of a refdb. Like this, each backend can decide for itself
how exactly to do this.

Note that the initialization of the refdb is a bit intricate. A
repository is only recognized as such when it has a "HEAD" file as well
as a "refs/" directory. Consequently, regardless of which refdb format
we use, those files must always be present. This also proves to be
problematic for us, as we cannot access the repository and thus don't
have access to the refdb if those files didn't exist.

To work around the issue we thus handle the creation of those files
outside of the refdb-specific logic. We actually use the same strategy
as Git does, and write the invalid reference "ref: refs/heads/.invalid"
into "HEAD". This looks almost like a ref, but the name of that ref
is not valid and should thus trip up Git clients that try to read that
ref in a repository that really uses a different format.

So while that invalid "HEAD" reference will of course get rewritten by
the "files" backend, other backends should just retain it as-is.
Import the reftable library from commit f1228cd12c1 (reftable: make
REFTABLE_UNUSED C99 compatible, 2025-05-29). This is an exact copy of
the reftable library.

The library will be wired into libgit2 over the next couple of commits.
While the reftable library is mostly decoupled from the Git codebase,
some specific functionality intentionally taps into Git subsystems. To
make porting of the reftable library easier all of these are contained
in "system.h" and "system.c".

Reimplement those compatibility shims so that they work for libgit2.
The reftable writer accidentally uses the Git-specific `QSORT()` macro.
This macro removes the need for the caller to provide the element size,
but other than that it's mostly equivalent to `qsort()`.

Replace the macro accordingly. This incompatibility is a bug in Git that
needs to be fixed upstream.
While perfectly legal, older compiler toolchains complain when
zero-initializing structs that contain nested structs with `{0}`:

	/home/libgit2/source/deps/reftable/stack.c:862:35: error: suggest braces around initialization of subobject [-Werror,-Wmissing-braces]
		struct reftable_addition empty = REFTABLE_ADDITION_INIT;
						 ^~~~~~~~~~~~~~~~~~~~~~
	/home/libgit2/source/deps/reftable/stack.c:707:33: note: expanded from macro 'REFTABLE_ADDITION_INIT'
	#define REFTABLE_ADDITION_INIT {0}
					^

Silence this warning by using `{{0}}` instead.
Wire up the reftable library with CMake. At the current point in time
the library is not yet plugged into libgit2 itself.
Implement the reftable backend that is used to read and write reftables.
The backend is not yet used anywhere after this commit.
@pks-t pks-t force-pushed the pks/reftables-support branch from 98a686f to 392cffb Compare July 14, 2025 05:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants