Pre-Aggregations

Pre-aggregations are materialized rollup tables that store pre-computed aggregations for significant query performance improvements. The query engine can automatically route queries to matching pre-aggregations instead of scanning base tables. Pre-aggregations can be configured to refresh automatically on schedules or when source data changes.

TipAutomatic Recommendations

Sidemantic can automatically analyze your query history and recommend optimal pre-aggregations. This is unique to Sidemantic—no manual analysis of query patterns required.

Fetch query history from BigQuery, Snowflake, Databricks, or ClickHouse and get instant recommendations with benefit scores:

sidemantic preagg recommend --connection "bigquery://project/dataset"
sidemantic preagg apply models/ --connection "bigquery://project/dataset" --top 5

See the Pre-Aggregation Recommendations guide for complete CLI documentation.

ImportantRouting Disabled by Default

Pre-aggregation routing is disabled by default and must be explicitly enabled. See Enabling Routing for details.

Why Use Pre-Aggregations?

Performance Benefits

Pre-aggregations provide dramatic performance improvements by reducing the amount of data scanned:

Example: E-commerce orders table with 100M rows

# Without pre-aggregation: scan 100M rows
SELECT status, SUM(amount)
FROM orders
GROUP BY status

# With daily pre-aggregation: scan ~36K rows (10 statuses × 365 days × 10 years)
SELECT status, SUM(revenue_raw)
FROM orders_preagg_daily
GROUP BY status

Performance gain: ~2,777x fewer rows scanned

When to Use Pre-Aggregations

Use pre-aggregations when: - Base tables are large (millions+ rows) - Queries aggregate data frequently - Query patterns are predictable (same dimensions/metrics) - Real-time data freshness isn’t critical (can refresh periodically)

Common use cases: - Dashboard queries that run frequently - Time-series analytics (daily/monthly trends) - High-cardinality dimension reductions - Cross-model aggregations

Defining Pre-Aggregations

Pre-aggregations are defined in the model’s YAML configuration:

Basic Example

models:
  - name: orders
    table: public.orders
    primary_key: order_id

    dimensions:
      - name: status
        type: categorical
        sql: status
      - name: region
        type: categorical
        sql: region
      - name: created_at
        type: time
        sql: created_at
        granularity: day

    metrics:
      - name: count
        agg: count
      - name: revenue
        agg: sum
        sql: amount

    pre_aggregations:
      - name: daily_summary
        measures: [count, revenue]
        dimensions: [status, region]
        time_dimension: created_at
        granularity: day

This creates a pre-aggregation that: - Groups by status and region dimensions - Pre-computes count and revenue metrics - Aggregates to daily granularity on created_at

Configuration Options

Required Fields

Field Description Example
name Unique pre-aggregation identifier daily_summary
measures List of metrics to pre-compute [count, revenue]
dimensions List of dimensions to group by [status, region]

Optional Fields

Field Description Example
time_dimension Time-based dimension for temporal grouping created_at
granularity Time aggregation level day
partition_granularity Partition size for incremental refresh month
refresh Refresh strategy (see Refresh Configuration) See examples below
indexes Index definitions for performance See Indexes

Advanced Configuration

Partitioning

Split large pre-aggregations into partitions for faster incremental refresh:

pre_aggregations:
  - name: monthly_partitioned
    measures: [count, revenue]
    dimensions: [status, region]
    time_dimension: created_at
    granularity: day
    partition_granularity: month  # Partition by month
    refresh:
      every: "1 hour"
      incremental: true
      update_window: "7 day"  # Only refresh last 7 days

Benefits: - Only refresh changed partitions - Faster incremental updates - Reduced rebuild time

Refresh Configuration

Pre-aggregations can be configured to refresh automatically using different strategies. The refresh field controls how and when the materialized table is updated.

Time-Based Refresh

Refresh on a fixed schedule:

pre_aggregations:
  - name: hourly_refresh
    measures: [count, revenue]
    dimensions: [status, region]
    time_dimension: created_at
    granularity: day
    refresh:
      every: "1 hour"  # or "30 minutes", "1 day", "1 week", etc.

Common intervals: - "30 minutes" - High-frequency dashboards - "1 hour" - Standard real-time analytics - "1 day" - Daily batch processing - "1 week" - Historical analysis

SQL-Triggered Refresh

Refresh when source data changes by checking a SQL query result:

pre_aggregations:
  - name: smart_refresh
    measures: [count, revenue]
    dimensions: [status, region]
    time_dimension: created_at
    granularity: day
    refresh:
      sql: "SELECT MAX(updated_at) FROM orders"

The pre-aggregation refreshes when the SQL query returns a different value. Common patterns: - SELECT MAX(updated_at) FROM table - Detects new/updated rows - SELECT COUNT(*) FROM table - Detects row count changes - SELECT MAX(id) FROM table - Detects new inserts

Incremental Refresh

Only update changed partitions instead of rebuilding the entire table:

pre_aggregations:
  - name: incremental_update
    measures: [count, revenue]
    dimensions: [status, region]
    time_dimension: created_at
    granularity: day
    partition_granularity: month
    refresh:
      every: "1 hour"
      incremental: true
      update_window: "7 day"  # Only refresh last 7 days of partitions

Configuration options: - incremental: true - Enable incremental refresh - update_window - How far back to refresh (e.g., "7 day", "30 day") - Requires partition_granularity to be set

Benefits: - Much faster refresh times (only rebuilds recent partitions) - Efficient for append-only data - Ideal for large historical datasets

When to use: - Large pre-aggregations with millions+ rows - Time-series data with limited updates to historical data - Queries that mostly focus on recent data

Combined Refresh Strategies

Mix time-based and incremental refresh:

pre_aggregations:
  - name: combined_strategy
    measures: [count, revenue]
    dimensions: [status, region]
    time_dimension: created_at
    granularity: day
    partition_granularity: month
    refresh:
      every: "1 hour"           # Check every hour
      incremental: true         # Only update recent partitions
      update_window: "30 day"   # Refresh last 30 days

Indexes

Add indexes for query performance:

pre_aggregations:
  - name: daily_summary
    measures: [count, revenue]
    dimensions: [status, region, customer_id]
    time_dimension: created_at
    granularity: day
    indexes:
      - name: status_idx
        columns: [status]
      - name: composite_idx
        columns: [status, region, customer_id]

Index ordering: Put high-selectivity columns first.

Enabling Routing

Warning

Pre-aggregation routing is disabled by default. Queries will use base tables unless routing is explicitly enabled.

Global Enable

Enable for all queries in the semantic layer:

from sidemantic import SemanticLayer

# Enable pre-aggregation routing globally
sl = SemanticLayer(use_preaggregations=True)

Per-Query Override

Override the global setting for specific queries:

# Disabled globally, enable for one query
sl = SemanticLayer(use_preaggregations=False)

sql = sl.compile(
    metrics=["orders.revenue"],
    dimensions=["orders.status"],
    use_preaggregations=True  # Enable for this query only
)
# Enabled globally, disable for one query
sl = SemanticLayer(use_preaggregations=True)

sql = sl.query(
    metrics=["orders.revenue"],
    dimensions=["orders.customer_id"],
    use_preaggregations=False  # Disable for this query
)

Why Disabled by Default?

  1. Pre-aggregation tables must exist - Routing fails if materialized tables aren’t created in your database
  2. Safer default - Prevents query errors when pre-aggs aren’t materialized
  3. Explicit opt-in - Users control when optimization is active

Query Matching

The query engine automatically finds the best matching pre-aggregation based on:

Matching Rules

  1. Dimension Subset: Query dimensions must be a subset of pre-aggregation dimensions
  2. Measure Compatibility: All query metrics must be derivable from pre-aggregated measures
  3. Granularity Compatibility: Query time granularity must be coarser or equal to pre-aggregation granularity
  4. Best Match Selection: Chooses the smallest/most specific matching pre-aggregation

Matching Examples

Given this pre-aggregation:

pre_aggregations:
  - name: daily_rollup
    measures: [count, revenue]
    dimensions: [status, region, customer_id]
    time_dimension: created_at
    granularity: day

Queries That Match ✓

# Subset of dimensions (just status)
sl.query(metrics=["orders.revenue"], dimensions=["orders.status"])
# → Uses daily_rollup

# All dimensions present
sl.query(
    metrics=["orders.revenue"],
    dimensions=["orders.status", "orders.region", "orders.customer_id"]
)
# → Uses daily_rollup

# Coarser granularity (month > day) - rolls up
sl.query(
    metrics=["orders.revenue"],
    dimensions=["orders.created_at__month"]
)
# → Uses daily_rollup, converts day → month

# With filters - applied on top
sl.query(
    metrics=["orders.revenue"],
    dimensions=["orders.status"],
    filters=["orders.region = 'US'"]
)
# → Uses daily_rollup with WHERE region = 'US'

Queries That Don’t Match ✗

# Dimension not in pre-agg
sl.query(metrics=["orders.revenue"], dimensions=["orders.product_id"])
# → Falls back to base table

# Finer granularity (hour < day) - can't drill down
sl.query(metrics=["orders.revenue"], dimensions=["orders.created_at__hour"])
# → Falls back to base table

# Metric not in pre-agg
sl.query(metrics=["orders.profit"], dimensions=["orders.status"])
# → Falls back to base table

Measure Derivability

Different aggregation types have different derivability rules:

Query Metric Pre-Agg Requires Derivable? Notes
SUM SUM ✓ Yes Direct re-aggregation
COUNT COUNT ✓ Yes Sum counts from partitions
AVG SUM + COUNT ✓ Yes SUM(sum_raw) / SUM(count_raw)
MIN/MAX MIN/MAX ⚠️ Limited Only at same granularity
COUNT_DISTINCT COUNT_DISTINCT ✗ No Can’t re-aggregate distinct counts

Generated Pre-Aggregation Tables

When materialized, pre-aggregations are stored as physical tables:

Table Naming

Format: {model_name}_preagg_{preagg_name}

Example:

orders_preagg_daily_rollup

Table Structure

For this pre-aggregation:

pre_aggregations:
  - name: daily_rollup
    measures: [count, revenue]
    dimensions: [status]
    time_dimension: created_at
    granularity: day

Generated table schema:

CREATE TABLE orders_preagg_daily_rollup AS
SELECT
  DATE(created_at) as created_at_day,
  status,
  COUNT(*) as count_raw,
  SUM(amount) as revenue_raw
FROM orders
GROUP BY 1, 2

Column naming convention: - Time dimensions: {dimension_name}_{granularity} (e.g., created_at_day) - Regular dimensions: Same name as source - Measures: {metric_name}_raw (e.g., revenue_raw)

Generated Queries

When routing to pre-aggregation:

# User query
sl.query(
    metrics=["orders.revenue"],
    dimensions=["orders.status"],
    use_preaggregations=True
)

# Generated SQL
SELECT
  status,
  SUM(revenue_raw) as revenue
FROM orders_preagg_daily_rollup
GROUP BY 1

Best Practices

1. Start with High-Impact Rollups

Define pre-aggregations for your most frequent query patterns:

# Dashboard query: revenue by region, daily
pre_aggregations:
  - name: dashboard_main
    measures: [count, revenue, avg_order_value]
    dimensions: [region, status]
    time_dimension: created_at
    granularity: day

2. Layer Pre-Aggregations by Granularity

Create multiple levels for different use cases:

pre_aggregations:
  # High-level summary (fast, small)
  - name: monthly_summary
    measures: [count, revenue]
    dimensions: [region]
    time_dimension: created_at
    granularity: month

  # Mid-level detail
  - name: daily_by_region
    measures: [count, revenue]
    dimensions: [region, status]
    time_dimension: created_at
    granularity: day

  # Full detail (large, comprehensive)
  - name: daily_full
    measures: [count, revenue, avg_order_value]
    dimensions: [region, status, customer_segment, product_category]
    time_dimension: created_at
    granularity: day
    partition_granularity: month
    refresh:
      every: "1 hour"
      incremental: true

3. Balance Size vs. Coverage

More dimensions = larger rollup but matches more queries

Strategy: - Include dimensions used in 80% of queries - Avoid very high-cardinality dimensions (like customer_id) unless necessary - Use partitioning for large pre-aggregations

4. Use Incremental Refresh for Large Tables

pre_aggregations:
  - name: large_rollup
    measures: [count, revenue]
    dimensions: [status, region, product_id]
    time_dimension: created_at
    granularity: day
    partition_granularity: month  # Partition by month
    refresh:
      every: "1 hour"
      incremental: true
      update_window: "30 day"  # Only refresh last 30 days

5. Add Indexes for Filtered Queries

If queries frequently filter by specific dimensions, add indexes:

pre_aggregations:
  - name: daily_rollup
    measures: [revenue]
    dimensions: [status, region, customer_id]
    time_dimension: created_at
    granularity: day
    indexes:
      - name: region_idx
        columns: [region]  # Fast filtering by region
      - name: composite_idx
        columns: [region, status]  # Fast filtering by both

Limitations

Single-Model Queries Only

Pre-aggregations only work for queries against a single model:

# ✓ Works - single model
sl.query(
    metrics=["orders.revenue"],
    dimensions=["orders.status"]
)

# ✗ Doesn't work - joins multiple models
sl.query(
    metrics=["orders.revenue"],
    dimensions=["orders.status", "customers.segment"]
)
# Falls back to base tables

No Support for Window Functions

Queries using cumulative or time-comparison metrics don’t use pre-aggregations:

# ✗ Window functions bypass pre-aggregations
sl.query(
    metrics=["orders.cumulative_revenue"],  # Uses window functions
    dimensions=["orders.created_at__day"]
)

COUNT_DISTINCT Not Derivable

Pre-aggregated COUNT_DISTINCT values can’t be re-aggregated:

# This won't work well
pre_aggregations:
  - name: rollup
    measures: [unique_customers]  # COUNT_DISTINCT
    dimensions: [region]

Alternative: Use HyperLogLog approximations or store exact values.

Materialization and Refresh

Sidemantic provides a CLI command to materialize and refresh pre-aggregation tables.

Basic Refresh

Refresh all pre-aggregations in your models:

# DuckDB
sidemantic preagg refresh models/ --db data.db

# PostgreSQL, BigQuery, Snowflake, etc.
sidemantic preagg refresh models/ --connection "postgres://localhost:5432/db"
sidemantic preagg refresh models/ --connection "bigquery://project/dataset"

The refresh command: - Generates materialization SQL for all pre-aggregations - Creates tables if they don’t exist - Refreshes existing tables based on their refresh configuration - Stateless: derives watermarks from existing table data - Supports incremental, full, and merge refresh modes

Refresh Modes

Control how pre-aggregations are refreshed:

# Incremental refresh (default) - only update changed partitions
sidemantic preagg refresh models/ --db data.db --mode incremental

# Full refresh - rebuild entire table
sidemantic preagg refresh models/ --db data.db --mode full

# Merge mode - upsert based on partition keys
sidemantic preagg refresh models/ --db data.db --mode merge

# Engine mode - use database-native materialized views
sidemantic preagg refresh models/ --connection "snowflake://..." --mode engine
sidemantic preagg refresh models/ --connection "bigquery://project/dataset" --mode engine

Incremental mode: - Only refreshes partitions within the update_window - Checks existing table to determine last refresh timestamp - Efficient for large tables with time-based partitions - Requires partition_granularity and time_dimension - Supported: DuckDB only

Full mode: - Drops and recreates the entire table - Use when schema changes or data has significant updates - Slower but ensures complete accuracy - Supported: DuckDB only

Merge mode: - Updates existing partitions and inserts new ones - Good for slowly-changing dimensions - Requires database support for MERGE/UPSERT - Supported: DuckDB only

Engine mode: - Uses database-native materialized views - Snowflake: DYNAMIC TABLES with automatic refresh - ClickHouse: MATERIALIZED VIEWS - BigQuery: MATERIALIZED VIEWS with scheduled refresh - Database manages refresh automatically - Validates SQL compatibility (no window functions, etc.) - Supported: Snowflake, ClickHouse, BigQuery

Selective Refresh

Refresh specific models or pre-aggregations:

# Only refresh pre-aggregations for the 'orders' model
sidemantic preagg refresh models/ --db data.db --model orders

# Only refresh a specific pre-aggregation
sidemantic preagg refresh models/ --db data.db --model orders --preagg daily_rollup

Scheduling Refreshes

The refresh command is stateless and designed to run on a schedule:

Cron example:

# Refresh every hour
0 * * * * sidemantic preagg refresh /path/to/models --db /path/to/data.db

Airflow example:

from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta

dag = DAG(
    'refresh_preaggregations',
    schedule_interval='@hourly',
    start_date=datetime(2024, 1, 1),
)

refresh_task = BashOperator(
    task_id='refresh_preaggs',
    bash_command='sidemantic preagg refresh models/ --connection "postgres://host/db"',
    dag=dag,
)

GitHub Actions example:

name: Refresh Pre-Aggregations
on:
  schedule:
    - cron: '0 * * * *'  # Every hour

jobs:
  refresh:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - run: pip install sidemantic
      - run: sidemantic preagg refresh models/ --connection "${{ secrets.DB_CONNECTION }}"

How Refresh Works

The refresh command:

  1. Discovers pre-aggregations - Parses model YAML files and finds all pre-aggregations
  2. Checks existing tables - Queries database to see which tables exist and their current watermarks
  3. Determines refresh strategy - Based on refresh config and --mode flag
  4. Generates SQL - Creates materialization/refresh SQL appropriate for your database
  5. Executes queries - Runs the SQL to create/update tables

For incremental refresh: - Reads time_dimension column from existing table to find max timestamp - Only processes partitions after that timestamp (within update_window) - Inserts new data into partitioned table

Cube Format Import

Sidemantic can import pre-aggregations from Cube.js files:

# Cube format
cubes:
  - name: Orders
    sql_table: orders

    dimensions:
      - name: status
        sql: status
        type: string
      - name: createdAt
        sql: created_at
        type: time

    measures:
      - name: count
        type: count
      - name: revenue
        sql: amount
        type: sum

    pre_aggregations:
      - name: dailyRollup
        type: rollup
        measures:
          - CUBE.count
          - CUBE.revenue
        dimensions:
          - CUBE.status
        time_dimension: CUBE.createdAt
        granularity: day
        partition_granularity: month
        refresh_key:
          every: 1 hour
          incremental: true
          update_window: 7 day

Import with CubeAdapter:

from sidemantic.adapters.cube import CubeAdapter

adapter = CubeAdapter()
graph = adapter.parse("path/to/cube/files")

# Pre-aggregations are automatically imported
model = graph.get_model("Orders")
print(model.pre_aggregations)

Summary

Key Takeaways:

  1. Pre-aggregations provide 100-10,000x speedups for large datasets
  2. Disabled by default - must enable with use_preaggregations=True
  3. Automatic routing - queries transparently use matching rollups
  4. Define in YAML - declarative configuration alongside models
  5. Use CLI for refresh - sidemantic preagg refresh handles materialization
  6. Best for predictable patterns - dashboard queries, time-series analytics

Next steps:

  • Define pre-aggregations for your busiest queries
  • Enable routing: SemanticLayer(use_preaggregations=True)
  • Materialize and refresh tables: sidemantic preagg refresh models/ --db data.db
  • Monitor performance improvements