How to use cache#

What does this guide solve?

This guide will show you how to locally cache data to speed up reloading from a remote Source.

Caching file#

When working with large data in a non-optional file format, creating a local cache can be advantageous to save time when reloading the data. Caching a Source is done by using cache_dir followed by a directory. If the directory does not exist, it will be created. The data will be saved as parquet files in the cache directory and the schema will be stored as a JSON file.

Below is an example of caching a 370 MB file that consists of almost 12 million rows.


The initial load can take a couple of minutes as the file needs to be downloaded and cached first.

    type: file
    cache_dir: cache
      engine: fastparquet

  - title: Table
    source: large_source
      - type: table
        table: large_table
from lumen.pipeline import Pipeline

data_url = ""
pipeline = Pipeline.from_spec(
        "source": {
            "type": "file",
            "cache_dir": "cache",
            "tables": {"large_table": data_url},
            "kwargs": {"engine": "fastparquet"},

Depending on the source type data caching will cache the entire table or individual queries. Using the cache_per_query option you can toggle this behavior.


Lumen’s cache can be added to all source types.


Sources that support caching per query can be made to pre-cache specific Filter and SQLTransform combinations. To enable pre-caching you must initialize a Pipeline and then either programmatically request to populate the pre-cache OR supply the pre-cache configuration as part of the YAML specification.

A pre-cache definitions can take one of two forms

  • A dictionary containing ‘filters’ and ‘variables’ dictionaries each containing lists of values to compute a cross-product for, e.g.

    'filters': {
        <filter>': ['a', 'b', 'c', ...],
        <variable>: [0, 2, 4, ...],
  • A list containing dictionaries of explicit values for each filter and variables.

        'filters': {<filter>: 'a'},
        'variables': {<variable>: 0}
        'filters': {<filter>: 'a'},
        'variables': {<variable>: 1}