In April 2019, Francisco Javier López hosted a Deep Dive (GitLab team members only:
on the GitLab Git LFS implementation to share his domain
specific knowledge with anyone who may work in this part of the codebase in the future.
You can find the recording on YouTube,
and the slides on Google Slides
and in PDF.
Everything covered in this deep dive was accurate as of GitLab 11.10, and while specific
details may have changed since then, it should still serve as a good introduction.
Including LFS blobs in project archives
Introduced in GitLab 13.5.
The following diagram illustrates how GitLab resolves LFS files for project archives:
sequenceDiagram autonumber Client->>+Workhorse: GET /group/project/-/archive/master.zip Workhorse->>+Rails: GET /group/project/-/archive/master.zip Rails->>+Workhorse: Gitlab-Workhorse-Send-Data git-archive Workhorse->>Gitaly: SendArchiveRequest Gitaly->>Git: git archive master Git->>Smudge: OID 12345 Smudge->>+Workhorse: GET /internal/api/v4/lfs?oid=12345&gl_repository=project-1234 Workhorse->>+Rails: GET /internal/api/v4/lfs?oid=12345&gl_repository=project-1234 Rails->>+Workhorse: Gitlab-Workhorse-Send-Data send-url Workhorse->>Smudge: <LFS data> Smudge->>Git: <LFS data> Git->>Gitaly: <streamed data> Gitaly->>Workhorse: <streamed data> Workhorse->>Client: master.zip
- The user requests the project archive from the UI.
- Workhorse forwards this request to Rails.
- If the user is authorized to download the archive, Rails replies with
an HTTP header of
Gitlab-Workhorse-Send-Datawith a base64-encoded JSON payload prefaced with
git-archive. This payload includes the
SendArchiveRequestbinary message, which is encoded again in base64.
- Workhorse decodes the
Gitlab-Workhorse-Send-Datapayload. If the archive already exists in the archive cache, Workhorse sends that file. Otherwise, Workhorse sends the
SendArchiveRequestto the appropriate Gitaly server.
- The Gitaly server will call
git archive <ref>to begin generating the Git archive on-the-fly. If the
include_lfs_blobsflag is enabled, Gitaly enables a custom LFS smudge filter via the
-c filter.lfs.smudge=/path/to/gitaly-lfs-smudgeGit option.
gitidentifies a possible LFS pointer using the
gitaly-lfs-smudgeand provides the LFS pointer via the standard input. Gitaly provides
GL_INTERNAL_CONFIGas environment variables to enable lookup of the LFS object.
- If a valid LFS pointer is decoded,
gitaly-lfs-smudgemakes an internal API call to Workhorse to download the LFS object from GitLab.
- Workhorse forwards this request to Rails. If the LFS object exists
and is associated with the project, Rails sends
ArchivePatheither with a path where the LFS object resides (for local disk) or a pre-signed URL (when object storage is enabled) via the
Gitlab-Workhorse-Send-DataHTTP header with a payload prefaced with
- Workhorse retrieves the file and send it to the
gitaly-lfs-smudgeprocess, which writes the contents to the standard output.
gitreads this output and sends it back to the Gitaly process.
- Gitaly sends the data back to Rails.
- The archive data is sent back to the client.
In step 7, the
gitaly-lfs-smudge filter must talk to Workhorse, not to
Rails, or an invalid LFS blob will be saved. To support this, GitLab
13.5 changed the default Omnibus configuration to have Gitaly talk to
instead of Rails.
One side effect of this change: the correlation ID of the original
request is not preserved for the internal API requests made by Gitaly
gitaly-lfs-smudge), such as the one made in step 8. The
correlation IDs for those API requests will be random values until this
Workhorse issue is