Tests that include `t.Parallel()` causes concurrency issues on Windows test runner

Labels: lifecycle/rotten

Steve Moyer (smoyer64) opened 3 years ago

As discussed in #803, including a test that calls t.Parallel() and uses a temporary directory causes unpredictable failures on the Windows build server. Rerunning a failed check often succeeds but failures report that many different tests randomly fail. This is a pretty clear sign that there's a concurrency issue. Here's the code that originally caused the problem:

func TestGoGitRepo_Indexes(t *testing.T) {
	t.Parallel()

	plainRoot, err := ioutil.TempDir("", "")
	require.NoError(t, err)
	t.Cleanup(func() {
		require.NoError(t, os.RemoveAll(plainRoot))
	})

	repo, err := InitGoGitRepo(plainRoot, namespace)
	require.NoError(t, err)
	t.Cleanup(func() {
		require.NoError(t, repo.Close())
	})

	// Can create indices
	indexA, err := repo.GetBleveIndex("a")
	require.NoError(t, err)
	require.NotZero(t, indexA)
	require.FileExists(t, filepath.Join(plainRoot, ".git", namespace, "indexes", "a", "index_meta.json"))
	require.FileExists(t, filepath.Join(plainRoot, ".git", namespace, "indexes", "a", "store"))

	indexB, err := repo.GetBleveIndex("b")
	require.NoError(t, err)
	require.NotZero(t, indexB)

	// Can get an existing index
	indexA, err = repo.GetBleveIndex("a")
	require.NoError(t, err)
	require.NotZero(t, indexA)

	// Can delete an index
	err = repo.ClearBleveIndex("a")
	require.NoError(t, err)
	require.NoDirExists(t, filepath.Join(plainRoot, ".git", namespace, "indexes", "a"))
}

Linux and MacOS filesystems have a different behavior than Windows - if you delete a file that's in use, it will remove the directory entry but the inodes won't be deleted until all the related file handles have been closed. Changing the code above to the following makes the tests fail less frequently:

func TestGoGitRepo_Indexes(t *testing.T) {
	t.Parallel()

	plainRoot := t.TempDir()

	repo, err := InitGoGitRepo(plainRoot, namespace)
	require.NoError(t, err)
	t.Cleanup(func() {
		require.NoError(t, repo.Close())
	})

	// Can create indices
	indexA, err := repo.GetBleveIndex("a")
	require.NoError(t, err)
	require.NotZero(t, indexA)
	require.FileExists(t, filepath.Join(plainRoot, ".git", namespace, "indexes", "a", "index_meta.json"))
	require.FileExists(t, filepath.Join(plainRoot, ".git", namespace, "indexes", "a", "store"))

	indexB, err := repo.GetBleveIndex("b")
	require.NoError(t, err)
	require.NotZero(t, indexB)

	// Can get an existing index
	indexA, err = repo.GetBleveIndex("a")
	require.NoError(t, err)
	require.NotZero(t, indexA)

	// Can delete an index
	err = repo.ClearBleveIndex("a")
	require.NoError(t, err)
	require.NoDirExists(t, filepath.Join(plainRoot, ".git", namespace, "indexes", "a"))
}

Using t.TempDir() to create each test's temporary directory also registers an alternate cleanup function that has a different behavior from os.RemoveAll(). This function includes the following ToB comment (see https://github.com/golang/go/blob/8a56c7742d96c8ef8e8dcecaf3d1c0e9f022f708/src/testing/testing.go#L1115):

// removeAll is like os.RemoveAll, but retries Windows "Access is denied."
// errors up to an arbitrary timeout.
//
// Those errors have been known to occur spuriously on at least the
// windows-amd64-2012 builder (https://go.dev/issue/50051), and can only occur
// legitimately if the test leaves behind a temp file that either is still open
// or the test otherwise lacks permission to delete. In the case of legitimate
// failures, a failing test may take a bit longer to fail, but once the test is
// fixed the extra latency will go away.

It's possible that universally using t.TempDir() within git-bug will eliminate this issue. It's going to be a bit hard to be sure since the only way to tell is to run the tests repeatedly on Github. There are only three additional occurrences of os.TempDir() in the code as most of the tests rely on CreateGoGitTestRepo for their test repository instance. I've been trying different scenarios in #804 and will make this change next.

Steve Moyer (smoyer64) commented 3 years ago

Here's an instance of a test repo that's created but missing the code to properly close it after the test is complete - this might be contributing to the above described problem:

https://github.com/MichaelMure/git-bug/blob/96327c3371ca762d906209c6114092bbf552c0f4/cache/repo_cache_test.go#L223

Steve Moyer (smoyer64) commented 3 years ago

This test utility function creates three repo without closing them:

https://github.com/MichaelMure/git-bug/blob/96327c3371ca762d906209c6114092bbf552c0f4/entity/dag/common_test.go#L106

Steve Moyer (smoyer64) commented 3 years ago

I've got a very crufty branch I've been using for testing that includes 30-40 tests which include t.Parallel(). With these changes, the Windows build failed almost every time.

I then refactored the entire test suite to exclusively use t.TempDir() (with its built-in OS friendly version of removeAll()) and t.Cleanup(). After these changes, tests on Windows builds seem to be failing about once in 30 runs.

I've created #810, to describe a couple (final?) places we could end up with blocking behavior and will tackle those before returning to this issue.

Michael Muré (MichaelMure) commented 3 years ago

Here's an instance of a test repo that's created but missing the code to properly close it after the test is complete - this might be contributing to the above described problem:

https://github.com/MichaelMure/git-bug/blob/96327c3371ca762d906209c6114092bbf552c0f4/cache/repo_cache_test.go#L223

This looks like a real issue.

This test utility function creates three repo without closing them:

https://github.com/MichaelMure/git-bug/blob/96327c3371ca762d906209c6114092bbf552c0f4/entity/dag/common_test.go#L106

I think this one is OK because each time this function is used, defer repository.CleanupTestRepos(repoA, repoB, remote) is called. Admittedly, that might be better to register the removal with t.Cleanup() directly as it's made exactly for that.

That said, looking again at that code, it seems that there is a bunch of case where cache.NewRepoCache(repo) is called, but the cache is never closed after that. As we have seen, that might cause problems on windows especially.

Steve Moyer (smoyer64) commented 3 years ago

Agreed ... on the branch where I've been testing change, I've refactored the tests so that the repositories are always created in the CreateGoGitTestRepo function and which includes the t.Cleanup(). Take a look at https://github.com/MichaelMure/git-bug/blob/affde987ab14507de7edfccb4b4913a5834d0d9b/repository/gogit_testing.go#L41 but please be advised that this branch is a mess because a) it's based off the branch with the failing test and b) I've tried several dozens of ideas including finding the races. Assuming it's okay with you, my plan was to start from master`, and then apply a similar set of refactors.

github-actions (github-actions) commented 1 year ago

This bot triages untriaged issues and PRs according to the following rules:

After 90 days of inactivity, the lifecycle/stale label is applied
After 30 days of inactivity since lifecycle/stale was applied, the issue is closed

To remove the stale status, you can:

Remove the lifecycle/stale label
Comment on this issue

github-actions (github-actions) added label lifecycle/stale 1 year ago

github-actions (github-actions) commented 11 months ago

This bot triages issues in order to help the maintainers identify what needs attention, according to the following lifecycle rules:

After 90 days of inactivity, lifecycle/stale is applied
After 90 days of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

This bot will not automatically close stale issues.

To remove the stale status, you can:

Remove the stale label from this issue
Comment on this issue
Close this issue
Offer to help out with triaging

To avoid automatic lifecycle management of this issue, add lifecycle/frozen.

github-actions (github-actions) added label lifecycle/rotten 11 months ago

github-actions (github-actions) removed label lifecycle/stale 11 months ago

sudoforge removed label lifecycle/dormant 7 months ago

Timeline