DuckDBSource type: duckdb#

class lumen.sources.duckdb.DuckDBSource(*, ephemeral, filter_in_sql, initializers, mirrors, sql_expr, tables, uri, excluded_tables, load_schema, cache_data, cache_dir, cache_metadata, cache_per_query, cache_schema, cache_with_dask, metadata_func, root, shared, name)#

DuckDBSource provides a simple wrapper around the DuckDB SQL connector.

To specify tables to be queried provide a list or dictionary of tables. A SQL expression to fetch the data from the table will then be generated using the sql_expr, e.g. if we specify a local table flights.db the default sql_expr SELECT * FROM {table} will expand that to SELECT * FROM flights.db. If you want to specify a full SQL expression as a table you must change the sql_expr to ‘{table}’ ensuring no further templating is applied.

Note that certain functionality in DuckDB requires modules to be loaded before making a query. These can be specified using the initializers parameter, providing the ability to define DuckDb statements to be run when initializing the connection.

Parameters#

ephemeral

type: bool
default: False
Whether the data is ephemeral, i.e. manually inserted into theDuckDB table or derived from real data.

excluded_tables

type: list
default: []
List of table names that should be excluded from the results. Supports:- Fully qualified name: ‘DATABASE.SCHEMA.TABLE’- Schema qualified name: ‘SCHEMA.TABLE’- Table name only: ‘TABLE’- Wildcards: ‘SCHEMA.*’

filter_in_sql

type: bool
default: True
Whether to apply filters in SQL or in-memory.

initializers

type: list
default: []
SQL statements to run to initialize the connection.

load_schema

type: bool
default: True
Whether to load the schema

mirrors

type: dict
default: {}
Mirrors the tables into the DuckDB database. The mirrorsshould define a mapping from the table names to the source ofthe mirror which may be defined as a Pipeline or a tuple ofthe source and the table.

sql_expr

type: str
default: 'SELECT * FROM {table}'
The SQL expression to execute.

tables

type: list | dict
default: None
List or dictionary of tables.

uri

type: str
default: ''
The URI of the DuckDB database

Methods#

DuckDBSource.clear_cache(*events: Event)#: Clears any cached data.

DuckDBSource.close()#

Close the DuckDB connection, releasing associated resources.

This method should be called when the source is no longer needed to prevent connection leaks and properly clean up server-side resources.

DuckDBSource.create_sql_expr_source(tables: dict[str, str], materialize: bool = True, **kwargs)#

Creates a new SQL Source given a set of table names and corresponding SQL expressions.

Parameters:

tables (dict[str, str]) – Mapping from table name to SQL expression.
materialize (bool) – Whether to materialize new tables
kwargs (any) – Additional keyword arguments.

Returns:

source

Return type:

DuckDBSource

DuckDBSource.execute(sql_query: str, *args, **kwargs)#

Executes a SQL query and returns the result as a DataFrame.

Parameters:

sql_query (str) – The SQL Query to execute
*args (list) – Positional arguments to pass to the SQL query
**kwargs (dict) – Keyword arguments to pass to the SQL query

Returns:

The result as a pandas DataFrame

Return type:

pd.DataFrame

DuckDBSource.get(table, **query)#

Return a table; optionally filtered by the given query.

Parameters:

table (str) – The name of the table to query
query (dict) – A dictionary containing all the query parameters

Returns:

A DataFrame containing the queried table.

Return type:

DataFrame

DuckDBSource.get_metadata(table: str | list[str] | None) → dict#

Returns metadata for one, multiple or all tables provided by the source.

The metadata for a table is structured as:

{

“description”: …, “columns”: {

<COLUMN>: {
“description”: …, “data_type”: …,

}

}, **other_metadata

}

If a list of tables or no table is provided the metadata is nested one additional level:

{

“table_name”: {

{

“description”: …, “columns”: {

<COLUMN>: { “description”: …, “data_type”: …, }

}, **other_metadata

}

Parameters:: table (str | list[str] | None) – The name of the table to return the schema for. If None returns schema for all available tables.
Returns:: metadata – Dictionary of metadata indexed by table (if no table was was provided or individual table metdata.
Return type:: dict

DuckDBSource.get_schema(table: str | None = None, limit: int | None = None, shuffle: bool = False) → dict[str, dict[str, Any]] | dict[str, Any]#

Returns JSON schema describing the tables returned by the Source.

Parameters:

table (str | None) – The name of the table to return the schema for. If None returns schema for all available tables.
limit (int | None) – Limits the number of rows considered for the schema calculation

Returns:

JSON schema(s) for one or all the tables.

Return type:

dict

DuckDBSource.get_sql_expr(table: str | dict)#: Returns the SQL expression corresponding to a particular table.

DuckDBSource.get_tables()#

Returns the list of tables available on this source.

Returns:: The list of available tables on this source.
Return type:: list

DuckDBSource.normalize_table(table: str)#: Allows implementing table name normalization to allow fuzze matching of the table name for minor variations such as quoting differences.

DuckDBSource.to_spec(context: dict[str, Any] | None = None) → dict[str, Any]#

Exports the full specification to reconstruct this component.

Return type:: Resolved and instantiated Component object

DuckDBSource type: duckdb#

Parameters#

Methods#

This Page