How to use cache#
What does this guide solve?
This guide will show you how to locally cache data to speed up reloading from a remote Source.
Caching file#
When working with large data in a non-optional file format, creating a local cache can be advantageous to save time when reloading the data. Caching a Source is done by using cache_dir
followed by a directory. If the directory does not exist, it will be created. The data will be saved as parquet files in the cache directory and the schema will be stored as a JSON file.
Below is an example of caching a 370 MB file that consists of almost 12 million rows.
Warning
The initial load can take a couple of minutes as the file needs to be downloaded and cached first.
sources:
large_source:
type: file
cache_dir: cache
tables:
large_table: https://s3.amazonaws.com/datashader-data/nyc_taxi_wide.parq
kwargs:
engine: fastparquet
layouts:
- title: Table
source: large_source
views:
- type: table
table: large_table
from lumen.pipeline import Pipeline
data_url = "https://s3.amazonaws.com/datashader-data/nyc_taxi_wide.parq"
pipeline = Pipeline.from_spec(
{
"source": {
"type": "file",
"cache_dir": "cache",
"tables": {"large_table": data_url},
"kwargs": {"engine": "fastparquet"},
},
}
)
pipeline.data
Depending on the source type data caching will cache the entire table or individual queries. Using the cache_per_query
option you can toggle this behavior.
Note
Lumen’s cache can be added to all source types.
Precaching#
Sources that support caching per query can be made to pre-cache specific Filter and SQLTransform combinations. To enable pre-caching you must initialize a Pipeline and then either programmatically request to populate the pre-cache OR supply the pre-cache configuration as part of the YAML specification.
A pre-cache definitions can take one of two forms
A dictionary containing ‘filters’ and ‘variables’ dictionaries each containing lists of values to compute a cross-product for, e.g.
{
'filters': {
<filter>': ['a', 'b', 'c', ...],
...
},
'variables':
<variable>: [0, 2, 4, ...],
...
}
}
A list containing dictionaries of explicit values for each filter and variables.
[
{
'filters': {<filter>: 'a'},
'variables': {<variable>: 0}
},
{
'filters': {<filter>: 'a'},
'variables': {<variable>: 1}
},
...
]