Skip to content

Commit e3176df

Browse files
committed
bump s3 docs
1 parent 5fcecfd commit e3176df

File tree

1 file changed

+1
-11
lines changed

1 file changed

+1
-11
lines changed

docs/misc/13_s3_cache/index.md

Lines changed: 1 addition & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,10 @@
11
# S3 Distributed Dependency Cache
22

3-
Workers cache aggressively the dependencies (and each version of them since every script has its own lockfile with a specific version for each dependency) so they are never pulled nor installed twice on the same worker. However, with a bigger cluster, for each script, the likelihood of being seen by a worker for the first time increases (and the cache hit ratio decreases). However, you may have noticed that our multi-tenant [cloud solution](https://app.windmill.dev) runs as if most dependencies were cached all the time, even though we have hundreds of workers on there. The secret sauce is a global cache backed by s3.
4-
5-
There are 3 mechanisms involved in the global dependency cache:
3+
Workers cache aggressively the dependencies (and each version of them since every script has its own lockfile with a specific version for each dependency) so they are never pulled nor installed twice on the same worker. However, with a bigger cluster, for each script, the likelihood of being seen by a worker for the first time increases (and the cache hit ratio decreases). However, you may have noticed that our multi-tenant [cloud solution](https://app.windmill.dev) runs as if most dependencies were cached all the time, even though we have hundreds of workers on there. For typescript, we do nothing special as npm has sufficient networking and npm packages are just tars that take no compute to extract. However, python is a whole other story and to achieve the same swiftness in cold start the secret sauce is a global cache backed by s3.
64

75
## Global Python Dependency Cache
86

97
The first time a dependency is seen by a worker, if it is not cached locally, the worker search in the bucket if that specific `name==version` is there:
108

119
1. If it is not, install the dependency from pypi, then do a snapshot of installed dependency, tar it and push it to s3 (we call this a "piptar")
1210
2. If it is, simply pull the "piptar" and extract it in place of installing from pypi. It is much faster than installing from pypi because that s3 is much closer to your workers than pypi and because there is no installation step to be done, a simple tar extract is sufficient which takes no compute.
13-
14-
## Local Cache Syncing
15-
16-
In the background, the entire worker cache is synced so that most dependencies get propagated over time to all workers.
17-
18-
## Entire Local Cache snapshotting
19-
20-
Another mechanism of the s3 distributed cache is the snapshotting of the entire cache as a single tar. This tar is created approximately every day by one of the worker (at random). This snapshot is then pulled at start of any worker. It enables workers to start with all the dependencies installed. Is it faster than pulling the list of all dependencies because it is much faster to pull one big object from s3 than many small ones (around 30s). The workers then start processing jobs with an "hot" cache.

0 commit comments

Comments
 (0)