-
Notifications
You must be signed in to change notification settings - Fork 2.5k
WIP: Reftables support #5462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
WIP: Reftables support #5462
Conversation
8f53ad6
to
83ceed9
Compare
src/refdb_reftable.c
Outdated
&entry->oid_old, &entry->oid_cur, | ||
NULL, entry->msg)) < 0) | ||
goto out; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll probably need to deduplicate pre-existing reflog entries. With filesystem-based reflog, we're rewriting the complete reflog every time, I think, but with reftables I'll need to figure out which entries are new.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think you are right. (I don't understand why you get duplicate entries, though.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly about git_reflog
semantics in libgit2. When you pass it to the filesystem-based reflog, you need to have the complete reflog history in order to re-write it in full. So with reftable
I'll need to see which entry is the last one that we've already written to disk and only start writing after that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. In reftable, the log entries are stored newest-first, so a simple seek to the refname will tell you the last entry written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filesystem-based one, our support for that is... lacking. We have
dozens of places where we assume $GITDIR/HEAD exists, we write
references like MERGE_HEAD, ORIG_HEAD manually without consulting the
refdb, lack support to create new repos with a different refdb as we
create the filesystem structures like $GITDIR/refs and $GITDIR/HEAD
directly in repository.c without consulting the refdb at all. We'll
have to carefully vet our code to remove all accesses to anything
refish outside of the refdb and instead call into the refdb layer.
Yes, I think git-core also suffers from this.
It seems to me that reftables currently don't support a few operations
that we expect from any given refdb implementation. Most importantly I
couldn't figure out how to delete and rename a reflog and how to
implement explicit locking (e.g. what the file backend does when
creating .git/refs/heads/master.lock).
For deleting a complete reflog, you'd have to delete all the entries
individually. For renaming a reflog, the suggestion is to record a
deletion + creation. This would let you retractively understand that
the rename took place.
Why do you want explicit locking? The API is written as if you'd be
doing transactions, i.e. you try to make the change, and if you get a
LOCK_ERROR, you'll have to try again.
I found it quite hard to get into reftables as the interface exposes
about quite a lot of the low-level details. This is probably my own
fault as I usually tend to dive heads-deep into anything new without
looking left and right first. But one resource that would've helped
quite a lot is something like high-level examples that represent
common git operations. I feel like I've got it right by now, but I'd
be happy to get feedback on this to see how wrong I am.
the best resource is probably code
gitgitgadget/git#539, which adds support to
git-core.
Are there any testvectors around? Like a repository that one may embed
into libgit2 itself that uses reftables to verify we behave as
expected. I couldn't get a hold of any and as it's currently
non-trivial to use reftables anywhere I wasn't able to come up with
tests yet. As a result, my current code will probably fail very
quickly.
The easiest is to compile JGit,
bazel build org.eclipse.jgit.pgm:pgm
this leaves a binary in
bazel-bin/org.eclipse.jgit.pgm/jgit
then you can do
jgit init
jgit convert-ref-storage --format reftable
There currently aren't test vectors, but it's a good suggestion. I
could add some to the repository.
At some places I feel like there's high potential for namespacing
issues. E.g. struct iterator is highly generic and may trivially clash
with any preexisting types. It would be nice to use namespace prefixes
for all exported symbols.
I want to do this, but I've been putting it off, because I don't have
an IDE that will help me with renaming, and doing it by hand is
somewhat labor intensive.
I think it would help if the C files were layed out in a more
"standard" library way. Most importantly, separate out the
"reftable.h" header file from the rest into its own folder so that one
can add it to the include path without having to care about
accidentally overriding "tree.h" that's part of libgit2 itself.
I'll get to that.
I didn't find any way to override memory allocation functions. I think
we've discussed that at the Contributor's summit already.
Anyway, thanks a lot for working on reftables @hanwen, this is really
exciting! I'd be happy to get feedback on my current implementation,
which basically represents how I think it should work. Chances are
high I've got it completely wrong, though ;)
you're welcome! I am glad to see it get uptake :)
src/refdb_reftable.c
Outdated
&entry->oid_old, &entry->oid_cur, | ||
NULL, entry->msg)) < 0) | ||
goto out; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think you are right. (I don't understand why you get duplicate entries, though.)
Because we're kind of in a special place with libgit2. With filesystem based refs, you can explicitly lock references to keep other applications from updating them by themselves. As libgit2 may be part of a long-running application (e.g. an IDE, but it could be anything) where the author of that application may have the need to lock references for a longer time, we support a transactional API that keeps other Git applications from updating them. While it's true that you would get an error anyway with the reftable API, this kind of opportunistic locking wouldn't help in such a usecase. Thanks for your answers, I'll make sure to update the PR soonish with your feedback! |
ee17225
to
0bfe212
Compare
On Wed, Mar 25, 2020 at 3:09 PM Patrick Steinhardt <notifications@github.com>
wrote
-
At some places I feel like there's high potential for namespacing
issues. E.g. struct iterator is highly generic and may trivially clash
with any preexisting types. It would be nice to use namespace prefixes for
all exported symbols.
I fixed this.
-
-
I think it would help if the C files were layed out in a more
"standard" library way. Most importantly, separate out the "reftable.h"
header file from the rest into its own folder so that one can add it to the
include path without having to care about accidentally overriding "tree.h"
that's part of libgit2 itself.
I moved reftable.h in a different directory.
-
I didn't find any way to override memory allocation functions. I think
we've discussed that at the Contributor's summit already.
I added this. reftable_set_malloc().
…--
Han-Wen Nienhuys - hanwenn@gmail.com - http://www.xs4all.nl/~hanwen
|
5944c49
to
65c0d65
Compare
#include <sys/time.h> | ||
#include <sys/types.h> | ||
#include <unistd.h> | ||
#include <zlib.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This currently doesn't compile on Windows due to <sys/time.h> not being found. I think it should work just fine if changing this to <time.h>
. Instead, we could probably have the equivalent of REFTABLE_IN_GITCORE
, but I think I'd rather prefer the header to be agnostic of libgit2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On linux, this would yield:
c/stack.c:226:12: error: implicit declaration of function 'gettimeofday' [-Werror=implicit-function-declaration]
226 | int err = gettimeofday(&deadline, NULL);
| ^~~~~~~~~~~~
Thanks a lot for addressing my concerns, @hanwen! I've updated the PR to include all of them. |
On Fri, Mar 27, 2020 at 07:05:28AM -0700, Han-Wen Nienhuys wrote:
@hanwen commented on this pull request.
On linux, this would yield:
c/stack.c:226:12: error: implicit declaration of function 'gettimeofday' [-Werror=implicit-function-declaration]
226 | int err = gettimeofday(&deadline, NULL);
| ^~~~~~~~~~~~
Too bad. So the reftable would need to check for some Windows-specific
defines and adjust accordingly.
By the way, getting back to a different topic: `HEAD` and its various
other incarnations `*_HEAD`. Are those expected to be managed by
reftable, too? It would make sense for some of them, but on the other
hand we have some that are references, but have a special format. The
most common one that comes to my mind is `FETCH_HEAD`, but there are
others, too.
|
For the ref-like objects, this is what JGit does In JGit, HEAD is expected to be managed by RefDatabase. Unfortunately, nobody has been able to articulate to me what the expectations/abstractions for this in git-core should do. Can you do transactional updates with HEAD and another branch together in libgit2? |
On Sat, Mar 28, 2020 at 08:18:02AM -0700, Han-Wen Nienhuys wrote:
In JGit, HEAD is expected to be managed by RefDatabase. Unfortunately,
nobody has been able to articulate to me what the
expectations/abstractions for this in git-core should do. Can you do
transactional updates with HEAD and another branch together in
libgit2?
Yes, this can be done via our transaction API. It does require the
reference backend to support locking though, which is not currently
possible with the reftable refdb.
I don't know if the issue is that reftables do not support locking or
whether our own assumptions were too centered on file refs and not
really on the transactional part. Anyway, this is nothing we can easily
change as it's part of our public interface.
|
Hey Patrick, can you check out https://godoc.org/github.com/google/reftable#Addition and https://godoc.org/github.com/google/reftable#Stack.NewAddition would that work as an API for you? |
Just to make sure I understand it, this is how I'd imagine it to look like in C: /* Create a new transactional addition for the stack */
reftable_stack_new_addition(&addition, stack);
/* Write a set of ref/log records via the usual writer callback */
reftable_addition_add(addition, reftable_backend_write_records, &args1);
/* Writing multiple times is okay for a single addition */
reftable_addition_add(addition, reftable_backend_write_records, &args2);
/* Now commit changes to persist it to disk and update the stack */
reftable_addition_commit(addition); If I understand it correctly, then If my above assumptions are correct then the API should work for us, thanks a lot! |
deps/reftable/include/reftable.h
Outdated
/* 4-byte identifier ("sha1", "s256") of the hash. | ||
* Defaults to SHA1 if unset | ||
*/ | ||
uint32_t hash_id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this doesn't quite work as advertised. I've set up reftable_write_options opts ={0}
, where I'd expect it to just default to "sha1". But reftable_new_merged_table()
fails when comparing r->hash_id != hash_id
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would also be helpful to have some macros available so that one may set up the expected values without having to define them locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, forgot to check in for comments. The following fails if I do not set up options.hash_id
:
const char *dir, *table;
struct reftable_write_options options = {0};
struct reftable_ref_record ref = {0};
struct reftable_stack *stack;
dir = "/path/to/testrepo-reftable/.git/reftable";
table = "/path/to/testrepo-reftable/.git/reftable/tables.list";
/* This line is required, otherwise `reftable_new_stack` fails. */
options.hash_id = 0x73686131;
cl_must_pass(reftable_new_stack(&stack, dir, table, options));
cl_must_pass(reftable_stack_reload(stack));
cl_must_pass(reftable_stack_read_ref(stack, "refs/heads/master", &ref));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I'm currently not using an up to date version of the reftable library as I first want to try to get what I have completely functioning (which it mostly is already, but there's still some issues with writing for preexisting branches).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which version are you using?
btw, if you are writing existing branches, make sure you sure you set an update_index that is beyond the update_index of the existing branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, fixed in google/reftable@f781daf
65c0d65
to
7b53cfc
Compare
I've added a test vector and was able to get a first test to pass with the reftable backend, which reads a reference and then renames it. More tests and bug fixes will follow. I've fixed the refdb detection to now consult "extensions.refStorage" as mandated by the documentation. I've noticed that another requirement is to get repository format version 1 implemented. There's a temporary commit in here that just removes the error check. |
google/reftable@7aabfe0 for the transaction api in C. |
BTW, - I would be grateful for any feedback about reftable on windows. I don't have windows, and I'd like to keep it that way. |
7b53cfc
to
4c2ab8f
Compare
Rebased on current master, which now includes support for repository format v1 and extensions.
I may dig up my Windows VM as soon as I've got the backend working on my own machine. Until then there's currently only the CI that provides any kind of feedback on Windows. |
4c2ab8f
to
4ab39b1
Compare
65fc25f
to
70fd8b1
Compare
@pks-t - does libgit2 already have code to detect file / directory conflicts? Looks like I may need to write and/or refactor that to check transactions, and wondering if that makes sense to put into the reftable library. |
On Mon, Apr 27, 2020 at 11:19:38AM -0700, Han-Wen Nienhuys wrote:
@pks-t - does libgit2 already have code to detect file / directory
conflicts? Looks like I may need to write and/or refactor that to
check transactions, and wondering if that makes sense to put into the
reftable library.
No, not yet, but it was on my todo list. I wouldn't complain if it was
moved into reftable, though.
Didn't have much time lately due to the world being even crazier than it
used to be, but I hope to tackle the missing pieces soonish again.
|
70fd8b1
to
6276b81
Compare
Rebased on latest master, updated the reftable library to 66d38fe (C: move block_iter_next comment to header, 2020-04-28), fixed a set of bugs in the reflog implementation and improved error handling in most places. All in all the PR is close to being finished, but I've stumbled over one weird bug where compressing of the reftable stack seems to fail after writing symbolic references. No idea yet whether that's a fault of mine or in the reftable library, but I'll investigate. |
efcc341
to
d3195c3
Compare
I've updated the pull request a bit:
What is still missing:
|
d3195c3
to
fcc2bd3
Compare
This has been addressed now, there is proper conflict detection in the reftable backend now. Still remaining is:
Overall, 13 tests are currently failing, all due to reflogs. |
953b45b
to
98a686f
Compare
To support multiple different reference backend implementations, Git introduced a "refStorage" extension that stores the reference storage format a Git client should try to use. Wire up the logic to read this new extension when we open a repository from disk. For now, only the "files" backend is supported by us. When trying to open a repository that has a refstorage format that we don't understand we now error out. There are two functions that create a new repository that doesn't really have references. While those are mostly non-functional when it comes to references, we do expect that you can access the refdb, even if it's not yielding any refs. For now we mark those to use the "files" backend, so that the status quo is retained. Eventually though it might not be the worst idea to introduce an explicit "in-memory" reference database. But that is outside the scope of this patch series.
While we only support initializing repositories with the "files" reference backend right now, we are in the process of implementing a second backend with the "reftable" format. And while we already have the infrastructure to decide which format a repository should use when we open it, we do not have infrastructure yet to create new repositories with a different reference format. Introduce a new field `git_repository_init_options::refdb_type`. If unset, we'll default to the "files" backend. Otherwise though, if set to a valid `git_refdb_t`, we will use that new format to initialize the repostiory. Note that for now the only thing we do is to write the "refStorage" extension accordingly. What we explicitly don't yet do is to also handle the backend-specific logic to initialize the refdb on disk. This will be implemented in subsequent commits.
In our tests for "onbranch" config conditionals we set HEAD to point to various different branches via `git_repository_create_head()`. This function circumvents the refdb though and directly writes to the "HEAD" file. While this works now, it will create problems once we have multiple refdb backends. Furthermore, the function is about to go away in the next commit. So let's prepare for that and use `git_reference_symbolic_create()` instead.
The initialization of the on-disk state of refdbs is currently not handled by the actual refdb backend, but it's implemented ad-hoc where needed. This is problematic once we have multiple different refdbs as the filesystem structure is of course not the same. Introduce a new callback function `git_refdb_backend::init()`. If set, this callback can be invoked via `git_refdb_init()` to initialize the on-disk state of a refdb. Like this, each backend can decide for itself how exactly to do this. Note that the initialization of the refdb is a bit intricate. A repository is only recognized as such when it has a "HEAD" file as well as a "refs/" directory. Consequently, regardless of which refdb format we use, those files must always be present. This also proves to be problematic for us, as we cannot access the repository and thus don't have access to the refdb if those files didn't exist. To work around the issue we thus handle the creation of those files outside of the refdb-specific logic. We actually use the same strategy as Git does, and write the invalid reference "ref: refs/heads/.invalid" into "HEAD". This looks almost like a ref, but the name of that ref is not valid and should thus trip up Git clients that try to read that ref in a repository that really uses a different format. So while that invalid "HEAD" reference will of course get rewritten by the "files" backend, other backends should just retain it as-is.
Import the reftable library from commit f1228cd12c1 (reftable: make REFTABLE_UNUSED C99 compatible, 2025-05-29). This is an exact copy of the reftable library. The library will be wired into libgit2 over the next couple of commits.
While the reftable library is mostly decoupled from the Git codebase, some specific functionality intentionally taps into Git subsystems. To make porting of the reftable library easier all of these are contained in "system.h" and "system.c". Reimplement those compatibility shims so that they work for libgit2.
The reftable writer accidentally uses the Git-specific `QSORT()` macro. This macro removes the need for the caller to provide the element size, but other than that it's mostly equivalent to `qsort()`. Replace the macro accordingly. This incompatibility is a bug in Git that needs to be fixed upstream.
While perfectly legal, older compiler toolchains complain when zero-initializing structs that contain nested structs with `{0}`: /home/libgit2/source/deps/reftable/stack.c:862:35: error: suggest braces around initialization of subobject [-Werror,-Wmissing-braces] struct reftable_addition empty = REFTABLE_ADDITION_INIT; ^~~~~~~~~~~~~~~~~~~~~~ /home/libgit2/source/deps/reftable/stack.c:707:33: note: expanded from macro 'REFTABLE_ADDITION_INIT' #define REFTABLE_ADDITION_INIT {0} ^ Silence this warning by using `{{0}}` instead.
Wire up the reftable library with CMake. At the current point in time the library is not yet plugged into libgit2 itself.
Implement the reftable backend that is used to read and write reftables. The backend is not yet used anywhere after this commit.
98a686f
to
392cffb
Compare
So this is a very rough reftables refdb implementation for libgit2. It's main intent is to start a discussion, see what prerequisites we have in libgit2 to properly implement refdbs other than refdb_fs and whether I'm using the reftable API correctly. Thoughts:
While we theoretically support reference databases other than the filesystem-based one, our support for that is... lacking. We have dozens of places where we assume $GITDIR/HEAD exists, we write references like MERGE_HEAD, ORIG_HEAD manually without consulting the refdb, lack support to create new repos with a different refdb as we create the filesystem structures like $GITDIR/refs and $GITDIR/HEAD directly in repository.c without consulting the refdb at all. We'll have to carefully vet our code to remove all accesses to anything refish outside of the refdb and instead call into the refdb layer.
It seems to me that reftables currently don't support a few operations that we expect from any given refdb implementation. Most importantly I couldn't figure out how to delete and rename a reflog and how to implement explicit locking (e.g. what the file backend does when creating .git/refs/heads/master.lock).
I found it quite hard to get into reftables as the interface exposes about quite a lot of the low-level details. This is probably my own fault as I usually tend to dive heads-deep into anything new without looking left and right first. But one resource that would've helped quite a lot is something like high-level examples that represent common git operations. I feel like I've got it right by now, but I'd be happy to get feedback on this to see how wrong I am.
Are there any testvectors around? Like a repository that one may embed into libgit2 itself that uses reftables to verify we behave as expected. I couldn't get a hold of any and as it's currently non-trivial to use reftables anywhere I wasn't able to come up with tests yet. As a result, my current code will probably fail very quickly.
At some places I feel like there's high potential for namespacing issues. E.g.
struct iterator
is highly generic and may trivially clash with any preexisting types. It would be nice to use namespace prefixes for all exported symbols.I think it would help if the C files were layed out in a more "standard" library way. Most importantly, separate out the "reftable.h" header file from the rest into its own folder so that one can add it to the include path without having to care about accidentally overriding "tree.h" that's part of libgit2 itself.
I didn't find any way to override memory allocation functions. I think we've discussed that at the Contributor's summit already.
Anyway, thanks a lot for working on reftables @hanwen, this is really exciting! I'd be happy to get feedback on my current implementation, which basically represents how I think it should work. Chances are high I've got it completely wrong, though ;)