Pre-Aggregation Recommendations
Sidemantic automatically analyzes your query history and recommends optimal pre-aggregations. This is unique to Sidemantic—no manual analysis of query patterns required.
Overview
The pre-aggregation recommendation system:
- Fetches query history from BigQuery, Snowflake, Databricks, ClickHouse, or local files
- Identifies patterns in frequently-run queries (metrics + dimensions + granularities)
- Scores recommendations based on query count, reusability, and consolidation potential
- Generates YAML pre-aggregation definitions and writes them to model files
All queries must include -- sidemantic: instrumentation comments (automatically added when compiling queries).
Quick Start
# 1. View recommendations
sidemantic preagg recommend --connection "bigquery://project/dataset"
# 2. Apply top 5 to model files
sidemantic preagg apply models/ --connection "bigquery://project/dataset" --top 5Commands
preagg refresh
Materialize and refresh pre-aggregation tables in your database.
sidemantic preagg refresh models/ --db data.db
sidemantic preagg refresh models/ --connection "postgres://localhost:5432/db"
sidemantic preagg refresh models/ --model orders --mode fullOptions:
--model, -m TEXT: Only refresh pre-aggregations for this model--preagg, -p TEXT: Only refresh this specific pre-aggregation--mode TEXT: Refresh mode:full,incremental, ormerge(default:incremental)--connection TEXT: Database connection string--db PATH: DuckDB file (shorthand for duckdb:/// connection)
Refresh modes:
incremental(default): Only update changed partitions (DuckDB only)full: Drop and recreate entire table (DuckDB only)merge: Upsert based on partition keys (DuckDB only)engine: Use database-native materialized views (Snowflake, ClickHouse, BigQuery)
How it works:
- Discovers pre-aggregations from model YAML files
- Checks existing tables and current watermarks
- Generates materialization/refresh SQL
- Executes queries to create/update tables
For incremental refresh, reads the time_dimension column to find the max timestamp and only processes partitions after that timestamp (within update_window).
Engine mode examples:
# Snowflake DYNAMIC TABLE
sidemantic preagg refresh models/ \
--connection "snowflake://user:pass@account/db/schema" \
--mode engine
# BigQuery MATERIALIZED VIEW
sidemantic preagg refresh models/ \
--connection "bigquery://project/dataset" \
--mode engine
# ClickHouse MATERIALIZED VIEW
sidemantic preagg refresh models/ \
--connection "clickhouse://localhost/default" \
--mode engineEngine mode creates database-native materialized views that refresh automatically according to the refresh configuration in your YAML. The database manages the refresh schedule, so no external orchestration is needed.
See Pre-Aggregations Guide for detailed refresh documentation and scheduling examples.
preagg recommend
Analyzes query patterns and shows detailed recommendations with benefit scores.
sidemantic preagg recommend --connection "bigquery://project/dataset"
sidemantic preagg recommend --db data.db --min-count 50 --top 10
sidemantic preagg recommend --queries queries.sql --min-score 0.5Output:
✓ Analyzed 1,245 queries
Found 23 unique patterns
8 patterns above threshold
Models:
orders: 892 queries
customers: 353 queries
================================================================================
Pre-Aggregation Recommendations (found 8)
================================================================================
1. day_status_revenue
Model: orders
Query Count: 234
Benefit Score: 0.82
Metrics: orders.revenue
Dimensions: orders.status
Granularities: day
2. region_2metrics
Model: orders
Query Count: 156
Benefit Score: 0.74
Metrics: orders.count, orders.revenue
Dimensions: orders.region
3. day_segment_count
Model: customers
Query Count: 98
Benefit Score: 0.68
Metrics: customers.count
Dimensions: customers.segment
Granularities: day
preagg apply
Automatically add pre-aggregation definitions to model YAML files.
sidemantic preagg apply models/ --connection "bigquery://project/dataset"
sidemantic preagg apply models/ --db data.db --top 5
sidemantic preagg apply models/ --queries queries.sql --dry-runOptions:
--top, -n INT: Apply only top N recommendations--dry-run: Preview changes without writing files
Output:
Found 8 recommendations to apply
+ orders.day_status_revenue (234 queries)
+ orders.region_2metrics (156 queries)
+ customers.day_segment_count (98 queries)
✓ Added 3 pre-aggregations to model files
Generated YAML:
models:
- name: orders
table: public.orders
dimensions:
- name: status
type: categorical
- name: created_at
type: time
metrics:
- name: revenue
agg: sum
sql: amount
pre_aggregations:
- name: day_status_revenue
measures: [revenue]
dimensions: [status]
time_dimension: created_at
granularity: dayOptions
Both commands support:
Data Sources
Database Connections:
--connection TEXT # Database connection string
--db PATH # DuckDB file (shorthand for duckdb:///path)Query Files:
--queries, -q PATH # SQL file or folderOne of --connection, --db, or --queries is required.
Query Filtering
--days, -d INT # Days of history to fetch (default: 7)
--limit, -l INT # Max queries to analyze (default: 1000)Thresholds
--min-count INT # Minimum query count (default: 10)
--min-score FLOAT # Minimum benefit score 0-1 (default: 0.3)Benefit Score Factors:
- Query frequency: More queries = higher score (log scale)
- Dimension count: Fewer dimensions = higher score (more reusable)
- Metric count: More metrics = higher score (better consolidation)
Threshold Guidance:
- Production: Use defaults (
--min-count 10 --min-score 0.3) - Testing: Lower thresholds (
--min-count 1 --min-score 0.0) - High-traffic: Raise thresholds (
--min-count 100 --min-score 0.5)
Database Support
BigQuery
Fetches from INFORMATION_SCHEMA.JOBS_BY_PROJECT:
sidemantic preagg recommend --connection "bigquery://project-id/dataset-id"Requires viewer permissions on project.
Snowflake
Fetches from INFORMATION_SCHEMA.QUERY_HISTORY():
sidemantic preagg recommend \
--connection "snowflake://user:pass@account/database/schema"Returns last 7 days (INFORMATION_SCHEMA limit). For longer history, use SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY (requires account admin).
Databricks
Fetches from system.query.history (Unity Catalog):
sidemantic preagg recommend \
--connection "databricks://token@workspace.cloud.databricks.com/sql/1.0/warehouses/abc"Requires Unity Catalog and appropriate permissions.
ClickHouse
Fetches from system.query_log:
sidemantic preagg recommend \
--connection "clickhouse://localhost/default"DuckDB
No built-in query history. Use file-based analysis:
sidemantic preagg recommend --queries query_logs/Query Instrumentation
Only queries with -- sidemantic: comments are analyzed.
Example instrumented query:
SELECT
status,
SUM(revenue_raw) as revenue
FROM orders_cte
GROUP BY 1
-- sidemantic: models=orders metrics=orders.revenue dimensions=orders.status granularities=dayInstrumentation is automatic when you compile queries with sidemantic:
from sidemantic import SemanticLayer
sl = SemanticLayer(connection="bigquery://project/dataset")
sql = sl.compile(
metrics=["orders.revenue"],
dimensions=["orders.status", "orders.created_at__day"]
)
# Automatically includes: -- sidemantic: models=orders ...Non-instrumented queries are ignored:
- Raw SQL queries not from sidemantic
- Queries without comments
- Multi-model queries (joins between models)
File-Based Analysis
Use --queries to analyze queries from files instead of database history.
Single file (semicolon-separated):
-- queries.sql
SELECT revenue FROM orders
WHERE status = 'completed'
-- sidemantic: models=orders metrics=orders.revenue dimensions=orders.status;
SELECT count FROM customers
WHERE segment = 'enterprise'
-- sidemantic: models=customers metrics=customers.count dimensions=customers.segment;sidemantic preagg recommend --queries queries.sqlFolder (recursive .sql search):
query_logs/
2024-01/
queries.sql
2024-02/
queries.sql
sidemantic preagg recommend --queries query_logs/Workflow Examples
Production Deployment
# 1. Review top 10 recommendations from 30 days of BigQuery history
sidemantic preagg recommend \
--connection "bigquery://prod-project/analytics" \
--days 30 \
--min-count 100 \
--top 10
# 2. Preview changes (dry run)
sidemantic preagg apply models/ \
--connection "bigquery://prod-project/analytics" \
--days 30 \
--min-count 100 \
--top 5 \
--dry-run
# 3. Apply for real
sidemantic preagg apply models/ \
--connection "bigquery://prod-project/analytics" \
--days 30 \
--min-count 100 \
--top 5
# 4. Refresh the new pre-aggregations
sidemantic preagg refresh models/ --connection "bigquery://prod-project/analytics"Development Testing
# Lower thresholds for testing
sidemantic preagg recommend \
--db dev.db \
--min-count 1 \
--min-score 0.0
# Apply just 1 for testing
sidemantic preagg apply models/ \
--db dev.db \
--min-count 1 \
--top 1 \
--dry-runFile-Based Analysis
# Export query history from warehouse
bq query --format=prettyjson \
"SELECT query FROM \`project.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT\`
WHERE query LIKE '%-- sidemantic:%'" > queries.sql
# Generate recommendations
sidemantic preagg recommend --queries queries.sql --min-count 20
# Apply to models
sidemantic preagg apply models/ --queries queries.sql --top 10Limitations
Only single-model queries:
Multi-model queries (with joins) are ignored. Pre-aggregations only work for single-model queries.
Instrumentation required:
Only sidemantic-compiled queries are analyzed. Raw SQL queries without -- sidemantic: comments are skipped.
Database support:
- ✅ BigQuery, Snowflake, Databricks, ClickHouse: Built-in query history
- ❌ PostgreSQL, DuckDB: No query history (use
--querieswith files)
Query patterns only:
Recommendations based solely on query patterns, not actual query execution times or data distribution.
Next Steps
After applying recommendations:
- Review generated YAML: Ensure pre-aggregations make sense for your data
- Materialize tables: Use
sidemantic preagg refreshto create physical tables - Enable routing: Set
use_preaggregations=Truein your semantic layer - Monitor performance: Compare query times before/after
See Pre-Aggregations Guide for detailed concepts and best practices.