{"url":"https://api.github.com/repos/tus/tusd/issues/1372","repository_url":"https://api.github.com/repos/tus/tusd","labels_url":"https://api.github.com/repos/tus/tusd/issues/1372/labels{/name}","comments_url":"https://api.github.com/repos/tus/tusd/issues/1372/comments","events_url":"https://api.github.com/repos/tus/tusd/issues/1372/events","html_url":"https://github.com/tus/tusd/issues/1372","id":4426630470,"node_id":"I_kwDOAIapMc8AAAABB9kFRg","number":1372,"title":"filelocker: stale .lock and .stop files not cleaned up on ERR_LOCK_TIMEOUT, causing permanent upload failure on network filesystems (SMB/CIFS)","user":{"login":"luciantimar","id":7078152,"node_id":"MDQ6VXNlcjcwNzgxNTI=","avatar_url":"https://avatars.githubusercontent.com/u/7078152?v=4","gravatar_id":"","url":"https://api.github.com/users/luciantimar","html_url":"https://github.com/luciantimar","followers_url":"https://api.github.com/users/luciantimar/followers","following_url":"https://api.github.com/users/luciantimar/following{/other_user}","gists_url":"https://api.github.com/users/luciantimar/gists{/gist_id}","starred_url":"https://api.github.com/users/luciantimar/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/users/luciantimar/subscriptions","organizations_url":"https://api.github.com/users/luciantimar/orgs","repos_url":"https://api.github.com/users/luciantimar/repos","events_url":"https://api.github.com/users/luciantimar/events{/privacy}","received_events_url":"https://api.github.com/users/luciantimar/received_events","type":"User","user_view_type":"public","site_admin":false},"labels":[{"id":32198123,"node_id":"MDU6TGFiZWwzMjE5ODEyMw==","url":"https://api.github.com/repos/tus/tusd/labels/bug","name":"bug","color":"fc2929","default":true,"description":null}],"state":"open","locked":false,"assignees":[],"milestone":null,"comments":1,"created_at":"2026-05-12T06:07:25Z","updated_at":"2026-05-24T20:54:34Z","closed_at":null,"assignee":null,"author_association":"NONE","issue_field_values":[],"type":null,"active_lock_reason":null,"sub_issues_summary":{"total":0,"completed":0,"percent_completed":0},"issue_dependencies_summary":{"blocked_by":0,"total_blocked_by":0,"blocking":0,"total_blocking":0},"body":"## Environment\n\n- tusd version: (2.9.2)\n- Storage backend: filestore on SMB/CIFS mount (CSI SMB driver on Kubernetes)\n- File server: Windows Server and Samba (linux)\n- Deployment: Single replica, single concurrent upload\n\n## Behaviour\n\nUploads intermittently fail with `ERR_LOCK_TIMEOUT`. Once a failure occurs,\nall subsequent retry attempts for the same upload also fail permanently with\nthe same error, even though no other process is accessing the upload.\n\n## Root Cause\n\nThe failure sequence is as follows:\n\n**Attempt 1:**\n1. tusd creates a new upload directory and `.lock` file on the SMB share\n2. `filelocker.Lock()` calls `flock()` on the `.lock` file\n3. On SMB/CIFS mounts, `flock()` translates to a server-side byte-range lock\n   request. Under certain server-side conditions this returns `ErrBusy` even\n   though no other process holds the lock\n4. `filelocker.Lock()` creates a `.stop` file and waits for a non-existent\n   lock holder to release\n5. `AcquireLockTimeout` (default 20s) expires → `ERR_LOCK_TIMEOUT`\n6. **`Unlock()` is never called** because the lock was never successfully acquired\n7. The `.lock` file and `.stop` file remain on disk\n\n**Attempt 2 (client retry):**\n1. The `.lock` file from attempt 1 still exists on disk\n2. `TryLock()` finds the existing `.lock` file\n3. `flock()` returns `ErrBusy` — this time legitimately, because the stale\n   `.lock` file appears held from the server's perspective\n4. The same 20s timeout occurs → `ERR_LOCK_TIMEOUT`\n5. The upload is now permanently unrecoverable without manual intervention\n\n## Key Observation\n\nThe issue is self-perpetuating: the first failure (which may be transient)\nleaves stale lock files that cause all subsequent attempts to fail\ndeterministically. The upload cannot recover without manually deleting the\n`.lock` and `.stop` files.\n\nThis was confirmed by observing that:\n- The upload directory was brand new with no other processes accessing it\n- `.lock` and `.stop` files remained on disk after every failed attempt\n- Each retry attempt failed with the same error\n- Adding `nobrl` to the CSI SMB mount options resolved the issue by handling\n  `flock()` locally (masking the root cause, not fixing it)\n\n## Expected Behaviour\n\nWhen `Lock()` times out without successfully acquiring the lock, tusd should\nclean up any `.stop` file it created during the attempt. Additionally,\nconsideration should be given to whether a failed lock acquisition should\nattempt to remove a `.lock` file that was found to be stale (e.g. PID no\nlonger alive, which already has detection logic in the lockfile package).\n\n## Relevant Code\n\nIn `pkg/filelocker/filelocker.go`, the `Lock()` function creates a `.stop`\nfile in the `ErrBusy` path but this file is only removed in `Unlock()`, which\nis never called when `Lock()` returns `ErrLockTimeout`:\n\n```go\n// .stop file is created here on ErrBusy\nfile, err := os.Create(lock.requestReleaseFile)\n\n// but only removed here, which is never reached on timeout\nfunc (lock fileUploadLock) Unlock() error {\n    _ = os.Remove(lock.requestReleaseFile)\n}\n```\n\n## Workaround\n\nAdd `nobrl` to SMB mount options to handle `flock()` locally, and add an\ninit container to clean up stale lock files on pod restart:\n\n```bash\nfind /upload-dir -name \"*.lock\" -delete\nfind /upload-dir -name \"*.stop\" -delete\n```\n\n## Suggested Fix\n\nIn `Lock()`, when the context deadline is exceeded, clean up the `.stop` file\nbefore returning `ErrLockTimeout`:\n\n```go\ncase <-ctx.Done():\n    os.Remove(lock.requestReleaseFile) // clean up before returning\n    return handler.ErrLockTimeout\n```","closed_by":null,"reactions":{"url":"https://api.github.com/repos/tus/tusd/issues/1372/reactions","total_count":0,"+1":0,"-1":0,"laugh":0,"hooray":0,"confused":0,"heart":0,"rocket":0,"eyes":0},"timeline_url":"https://api.github.com/repos/tus/tusd/issues/1372/timeline","performed_via_github_app":null,"state_reason":null,"pinned_comment":null}