Pre-Aggregations
Pre-aggregations are materialized rollup tables that store pre-computed aggregations for significant query performance improvements. The query engine can automatically route queries to matching pre-aggregations instead of scanning base tables. Pre-aggregations can be configured to refresh automatically on schedules or when source data changes.
Sidemantic can automatically analyze your query history and recommend optimal pre-aggregations. This is unique to Sidemantic—no manual analysis of query patterns required.
Fetch query history from BigQuery, Snowflake, Databricks, or ClickHouse and get instant recommendations with benefit scores:
sidemantic preagg recommend --connection "bigquery://project/dataset"
sidemantic preagg apply models/ --connection "bigquery://project/dataset" --top 5See the Pre-Aggregation Recommendations guide for complete CLI documentation.
Pre-aggregation routing is disabled by default and must be explicitly enabled. See Enabling Routing for details.
Why Use Pre-Aggregations?
Performance Benefits
Pre-aggregations provide dramatic performance improvements by reducing the amount of data scanned:
Example: E-commerce orders table with 100M rows
# Without pre-aggregation: scan 100M rows
SELECT status, SUM(amount)
FROM orders
GROUP BY status
# With daily pre-aggregation: scan ~36K rows (10 statuses × 365 days × 10 years)
SELECT status, SUM(revenue_raw)
FROM orders_preagg_daily
GROUP BY statusPerformance gain: ~2,777x fewer rows scanned
When to Use Pre-Aggregations
Use pre-aggregations when: - Base tables are large (millions+ rows) - Queries aggregate data frequently - Query patterns are predictable (same dimensions/metrics) - Real-time data freshness isn’t critical (can refresh periodically)
Common use cases: - Dashboard queries that run frequently - Time-series analytics (daily/monthly trends) - High-cardinality dimension reductions - Cross-model aggregations
Defining Pre-Aggregations
Pre-aggregations are defined in the model’s YAML configuration:
Basic Example
models:
- name: orders
table: public.orders
primary_key: order_id
dimensions:
- name: status
type: categorical
sql: status
- name: region
type: categorical
sql: region
- name: created_at
type: time
sql: created_at
granularity: day
metrics:
- name: count
agg: count
- name: revenue
agg: sum
sql: amount
pre_aggregations:
- name: daily_summary
measures: [count, revenue]
dimensions: [status, region]
time_dimension: created_at
granularity: dayThis creates a pre-aggregation that: - Groups by status and region dimensions - Pre-computes count and revenue metrics - Aggregates to daily granularity on created_at
Configuration Options
Required Fields
| Field | Description | Example |
|---|---|---|
name |
Unique pre-aggregation identifier | daily_summary |
measures |
List of metrics to pre-compute | [count, revenue] |
dimensions |
List of dimensions to group by | [status, region] |
Optional Fields
| Field | Description | Example |
|---|---|---|
time_dimension |
Time-based dimension for temporal grouping | created_at |
granularity |
Time aggregation level | day |
partition_granularity |
Partition size for incremental refresh | month |
refresh |
Refresh strategy (see Refresh Configuration) | See examples below |
indexes |
Index definitions for performance | See Indexes |
Advanced Configuration
Partitioning
Split large pre-aggregations into partitions for faster incremental refresh:
pre_aggregations:
- name: monthly_partitioned
measures: [count, revenue]
dimensions: [status, region]
time_dimension: created_at
granularity: day
partition_granularity: month # Partition by month
refresh:
every: "1 hour"
incremental: true
update_window: "7 day" # Only refresh last 7 daysBenefits: - Only refresh changed partitions - Faster incremental updates - Reduced rebuild time
Refresh Configuration
Pre-aggregations can be configured to refresh automatically using different strategies. The refresh field controls how and when the materialized table is updated.
Time-Based Refresh
Refresh on a fixed schedule:
pre_aggregations:
- name: hourly_refresh
measures: [count, revenue]
dimensions: [status, region]
time_dimension: created_at
granularity: day
refresh:
every: "1 hour" # or "30 minutes", "1 day", "1 week", etc.Common intervals: - "30 minutes" - High-frequency dashboards - "1 hour" - Standard real-time analytics - "1 day" - Daily batch processing - "1 week" - Historical analysis
SQL-Triggered Refresh
Refresh when source data changes by checking a SQL query result:
pre_aggregations:
- name: smart_refresh
measures: [count, revenue]
dimensions: [status, region]
time_dimension: created_at
granularity: day
refresh:
sql: "SELECT MAX(updated_at) FROM orders"The pre-aggregation refreshes when the SQL query returns a different value. Common patterns: - SELECT MAX(updated_at) FROM table - Detects new/updated rows - SELECT COUNT(*) FROM table - Detects row count changes - SELECT MAX(id) FROM table - Detects new inserts
Incremental Refresh
Only update changed partitions instead of rebuilding the entire table:
pre_aggregations:
- name: incremental_update
measures: [count, revenue]
dimensions: [status, region]
time_dimension: created_at
granularity: day
partition_granularity: month
refresh:
every: "1 hour"
incremental: true
update_window: "7 day" # Only refresh last 7 days of partitionsConfiguration options: - incremental: true - Enable incremental refresh - update_window - How far back to refresh (e.g., "7 day", "30 day") - Requires partition_granularity to be set
Benefits: - Much faster refresh times (only rebuilds recent partitions) - Efficient for append-only data - Ideal for large historical datasets
When to use: - Large pre-aggregations with millions+ rows - Time-series data with limited updates to historical data - Queries that mostly focus on recent data
Combined Refresh Strategies
Mix time-based and incremental refresh:
pre_aggregations:
- name: combined_strategy
measures: [count, revenue]
dimensions: [status, region]
time_dimension: created_at
granularity: day
partition_granularity: month
refresh:
every: "1 hour" # Check every hour
incremental: true # Only update recent partitions
update_window: "30 day" # Refresh last 30 daysIndexes
Add indexes for query performance:
pre_aggregations:
- name: daily_summary
measures: [count, revenue]
dimensions: [status, region, customer_id]
time_dimension: created_at
granularity: day
indexes:
- name: status_idx
columns: [status]
- name: composite_idx
columns: [status, region, customer_id]Index ordering: Put high-selectivity columns first.
Enabling Routing
Pre-aggregation routing is disabled by default. Queries will use base tables unless routing is explicitly enabled.
Global Enable
Enable for all queries in the semantic layer:
from sidemantic import SemanticLayer
# Enable pre-aggregation routing globally
sl = SemanticLayer(use_preaggregations=True)Per-Query Override
Override the global setting for specific queries:
# Disabled globally, enable for one query
sl = SemanticLayer(use_preaggregations=False)
sql = sl.compile(
metrics=["orders.revenue"],
dimensions=["orders.status"],
use_preaggregations=True # Enable for this query only
)# Enabled globally, disable for one query
sl = SemanticLayer(use_preaggregations=True)
sql = sl.query(
metrics=["orders.revenue"],
dimensions=["orders.customer_id"],
use_preaggregations=False # Disable for this query
)Why Disabled by Default?
- Pre-aggregation tables must exist - Routing fails if materialized tables aren’t created in your database
- Safer default - Prevents query errors when pre-aggs aren’t materialized
- Explicit opt-in - Users control when optimization is active
Query Matching
The query engine automatically finds the best matching pre-aggregation based on:
Matching Rules
- Dimension Subset: Query dimensions must be a subset of pre-aggregation dimensions
- Measure Compatibility: All query metrics must be derivable from pre-aggregated measures
- Granularity Compatibility: Query time granularity must be coarser or equal to pre-aggregation granularity
- Best Match Selection: Chooses the smallest/most specific matching pre-aggregation
Matching Examples
Given this pre-aggregation:
pre_aggregations:
- name: daily_rollup
measures: [count, revenue]
dimensions: [status, region, customer_id]
time_dimension: created_at
granularity: dayQueries That Match ✓
# Subset of dimensions (just status)
sl.query(metrics=["orders.revenue"], dimensions=["orders.status"])
# → Uses daily_rollup
# All dimensions present
sl.query(
metrics=["orders.revenue"],
dimensions=["orders.status", "orders.region", "orders.customer_id"]
)
# → Uses daily_rollup
# Coarser granularity (month > day) - rolls up
sl.query(
metrics=["orders.revenue"],
dimensions=["orders.created_at__month"]
)
# → Uses daily_rollup, converts day → month
# With filters - applied on top
sl.query(
metrics=["orders.revenue"],
dimensions=["orders.status"],
filters=["orders.region = 'US'"]
)
# → Uses daily_rollup with WHERE region = 'US'Queries That Don’t Match ✗
# Dimension not in pre-agg
sl.query(metrics=["orders.revenue"], dimensions=["orders.product_id"])
# → Falls back to base table
# Finer granularity (hour < day) - can't drill down
sl.query(metrics=["orders.revenue"], dimensions=["orders.created_at__hour"])
# → Falls back to base table
# Metric not in pre-agg
sl.query(metrics=["orders.profit"], dimensions=["orders.status"])
# → Falls back to base tableMeasure Derivability
Different aggregation types have different derivability rules:
| Query Metric | Pre-Agg Requires | Derivable? | Notes |
|---|---|---|---|
SUM |
SUM |
✓ Yes | Direct re-aggregation |
COUNT |
COUNT |
✓ Yes | Sum counts from partitions |
AVG |
SUM + COUNT |
✓ Yes | SUM(sum_raw) / SUM(count_raw) |
MIN/MAX |
MIN/MAX |
⚠️ Limited | Only at same granularity |
COUNT_DISTINCT |
COUNT_DISTINCT |
✗ No | Can’t re-aggregate distinct counts |
Generated Pre-Aggregation Tables
When materialized, pre-aggregations are stored as physical tables:
Table Naming
Format: {model_name}_preagg_{preagg_name}
Example:
orders_preagg_daily_rollup
Table Structure
For this pre-aggregation:
pre_aggregations:
- name: daily_rollup
measures: [count, revenue]
dimensions: [status]
time_dimension: created_at
granularity: dayGenerated table schema:
CREATE TABLE orders_preagg_daily_rollup AS
SELECT
DATE(created_at) as created_at_day,
status,
COUNT(*) as count_raw,
SUM(amount) as revenue_raw
FROM orders
GROUP BY 1, 2Column naming convention: - Time dimensions: {dimension_name}_{granularity} (e.g., created_at_day) - Regular dimensions: Same name as source - Measures: {metric_name}_raw (e.g., revenue_raw)
Generated Queries
When routing to pre-aggregation:
# User query
sl.query(
metrics=["orders.revenue"],
dimensions=["orders.status"],
use_preaggregations=True
)
# Generated SQL
SELECT
status,
SUM(revenue_raw) as revenue
FROM orders_preagg_daily_rollup
GROUP BY 1Best Practices
1. Start with High-Impact Rollups
Define pre-aggregations for your most frequent query patterns:
# Dashboard query: revenue by region, daily
pre_aggregations:
- name: dashboard_main
measures: [count, revenue, avg_order_value]
dimensions: [region, status]
time_dimension: created_at
granularity: day2. Layer Pre-Aggregations by Granularity
Create multiple levels for different use cases:
pre_aggregations:
# High-level summary (fast, small)
- name: monthly_summary
measures: [count, revenue]
dimensions: [region]
time_dimension: created_at
granularity: month
# Mid-level detail
- name: daily_by_region
measures: [count, revenue]
dimensions: [region, status]
time_dimension: created_at
granularity: day
# Full detail (large, comprehensive)
- name: daily_full
measures: [count, revenue, avg_order_value]
dimensions: [region, status, customer_segment, product_category]
time_dimension: created_at
granularity: day
partition_granularity: month
refresh:
every: "1 hour"
incremental: true3. Balance Size vs. Coverage
More dimensions = larger rollup but matches more queries
Strategy: - Include dimensions used in 80% of queries - Avoid very high-cardinality dimensions (like customer_id) unless necessary - Use partitioning for large pre-aggregations
4. Use Incremental Refresh for Large Tables
pre_aggregations:
- name: large_rollup
measures: [count, revenue]
dimensions: [status, region, product_id]
time_dimension: created_at
granularity: day
partition_granularity: month # Partition by month
refresh:
every: "1 hour"
incremental: true
update_window: "30 day" # Only refresh last 30 days5. Add Indexes for Filtered Queries
If queries frequently filter by specific dimensions, add indexes:
pre_aggregations:
- name: daily_rollup
measures: [revenue]
dimensions: [status, region, customer_id]
time_dimension: created_at
granularity: day
indexes:
- name: region_idx
columns: [region] # Fast filtering by region
- name: composite_idx
columns: [region, status] # Fast filtering by bothLimitations
Single-Model Queries Only
Pre-aggregations only work for queries against a single model:
# ✓ Works - single model
sl.query(
metrics=["orders.revenue"],
dimensions=["orders.status"]
)
# ✗ Doesn't work - joins multiple models
sl.query(
metrics=["orders.revenue"],
dimensions=["orders.status", "customers.segment"]
)
# Falls back to base tablesNo Support for Window Functions
Queries using cumulative or time-comparison metrics don’t use pre-aggregations:
# ✗ Window functions bypass pre-aggregations
sl.query(
metrics=["orders.cumulative_revenue"], # Uses window functions
dimensions=["orders.created_at__day"]
)COUNT_DISTINCT Not Derivable
Pre-aggregated COUNT_DISTINCT values can’t be re-aggregated:
# This won't work well
pre_aggregations:
- name: rollup
measures: [unique_customers] # COUNT_DISTINCT
dimensions: [region]Alternative: Use HyperLogLog approximations or store exact values.
Materialization and Refresh
Sidemantic provides a CLI command to materialize and refresh pre-aggregation tables.
Basic Refresh
Refresh all pre-aggregations in your models:
# DuckDB
sidemantic preagg refresh models/ --db data.db
# PostgreSQL, BigQuery, Snowflake, etc.
sidemantic preagg refresh models/ --connection "postgres://localhost:5432/db"
sidemantic preagg refresh models/ --connection "bigquery://project/dataset"The refresh command: - Generates materialization SQL for all pre-aggregations - Creates tables if they don’t exist - Refreshes existing tables based on their refresh configuration - Stateless: derives watermarks from existing table data - Supports incremental, full, and merge refresh modes
Refresh Modes
Control how pre-aggregations are refreshed:
# Incremental refresh (default) - only update changed partitions
sidemantic preagg refresh models/ --db data.db --mode incremental
# Full refresh - rebuild entire table
sidemantic preagg refresh models/ --db data.db --mode full
# Merge mode - upsert based on partition keys
sidemantic preagg refresh models/ --db data.db --mode merge
# Engine mode - use database-native materialized views
sidemantic preagg refresh models/ --connection "snowflake://..." --mode engine
sidemantic preagg refresh models/ --connection "bigquery://project/dataset" --mode engineIncremental mode: - Only refreshes partitions within the update_window - Checks existing table to determine last refresh timestamp - Efficient for large tables with time-based partitions - Requires partition_granularity and time_dimension - Supported: DuckDB only
Full mode: - Drops and recreates the entire table - Use when schema changes or data has significant updates - Slower but ensures complete accuracy - Supported: DuckDB only
Merge mode: - Updates existing partitions and inserts new ones - Good for slowly-changing dimensions - Requires database support for MERGE/UPSERT - Supported: DuckDB only
Engine mode: - Uses database-native materialized views - Snowflake: DYNAMIC TABLES with automatic refresh - ClickHouse: MATERIALIZED VIEWS - BigQuery: MATERIALIZED VIEWS with scheduled refresh - Database manages refresh automatically - Validates SQL compatibility (no window functions, etc.) - Supported: Snowflake, ClickHouse, BigQuery
Selective Refresh
Refresh specific models or pre-aggregations:
# Only refresh pre-aggregations for the 'orders' model
sidemantic preagg refresh models/ --db data.db --model orders
# Only refresh a specific pre-aggregation
sidemantic preagg refresh models/ --db data.db --model orders --preagg daily_rollupScheduling Refreshes
The refresh command is stateless and designed to run on a schedule:
Cron example:
# Refresh every hour
0 * * * * sidemantic preagg refresh /path/to/models --db /path/to/data.dbAirflow example:
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta
dag = DAG(
'refresh_preaggregations',
schedule_interval='@hourly',
start_date=datetime(2024, 1, 1),
)
refresh_task = BashOperator(
task_id='refresh_preaggs',
bash_command='sidemantic preagg refresh models/ --connection "postgres://host/db"',
dag=dag,
)GitHub Actions example:
name: Refresh Pre-Aggregations
on:
schedule:
- cron: '0 * * * *' # Every hour
jobs:
refresh:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: '3.11'
- run: pip install sidemantic
- run: sidemantic preagg refresh models/ --connection "${{ secrets.DB_CONNECTION }}"How Refresh Works
The refresh command:
- Discovers pre-aggregations - Parses model YAML files and finds all pre-aggregations
- Checks existing tables - Queries database to see which tables exist and their current watermarks
- Determines refresh strategy - Based on
refreshconfig and--modeflag - Generates SQL - Creates materialization/refresh SQL appropriate for your database
- Executes queries - Runs the SQL to create/update tables
For incremental refresh: - Reads time_dimension column from existing table to find max timestamp - Only processes partitions after that timestamp (within update_window) - Inserts new data into partitioned table
Cube Format Import
Sidemantic can import pre-aggregations from Cube.js files:
# Cube format
cubes:
- name: Orders
sql_table: orders
dimensions:
- name: status
sql: status
type: string
- name: createdAt
sql: created_at
type: time
measures:
- name: count
type: count
- name: revenue
sql: amount
type: sum
pre_aggregations:
- name: dailyRollup
type: rollup
measures:
- CUBE.count
- CUBE.revenue
dimensions:
- CUBE.status
time_dimension: CUBE.createdAt
granularity: day
partition_granularity: month
refresh_key:
every: 1 hour
incremental: true
update_window: 7 dayImport with CubeAdapter:
from sidemantic.adapters.cube import CubeAdapter
adapter = CubeAdapter()
graph = adapter.parse("path/to/cube/files")
# Pre-aggregations are automatically imported
model = graph.get_model("Orders")
print(model.pre_aggregations)Summary
Key Takeaways:
- Pre-aggregations provide 100-10,000x speedups for large datasets
- Disabled by default - must enable with
use_preaggregations=True - Automatic routing - queries transparently use matching rollups
- Define in YAML - declarative configuration alongside models
- Use CLI for refresh -
sidemantic preagg refreshhandles materialization - Best for predictable patterns - dashboard queries, time-series analytics
Next steps:
- Define pre-aggregations for your busiest queries
- Enable routing:
SemanticLayer(use_preaggregations=True) - Materialize and refresh tables:
sidemantic preagg refresh models/ --db data.db - Monitor performance improvements