Migrator

Migrates existing SQL queries to semantic layer by analyzing queries and generating model definitions.

Key features:

  1. Generate model definitions from existing queries
  2. Rewrite queries to use semantic layer syntax
  3. Measure what’s missing from your semantic layer

Quick Start

Generate models from your existing queries:

sidemantic migrator analyze queries/ --output-dir models/

This analyzes SQL files in queries/ and generates model YAML files in models/.

Example output models/orders.yml:

model:
  name: orders
  table: orders
dimensions:
  - name: status
    sql: status
  - name: region
    sql: region
metrics:
  - name: sum_amount
    agg: sum
    sql: amount
  - name: count
    agg: count
    sql: "*"

Add --rewrite to also generate rewritten queries:

sidemantic migrator analyze queries/ --output-dir models/ --rewrite

Original query:

SELECT status, region, SUM(amount), COUNT(*)
FROM orders
WHERE status = 'completed'
GROUP BY status, region

Rewritten query:

SELECT orders.status, orders.region, orders.sum_amount, orders.count
FROM orders
WHERE status = 'completed'

Supported Patterns

Basic Aggregations

Extracts dimensions from GROUP BY and metrics from aggregations:

SELECT
    status,
    SUM(amount) as revenue,
    COUNT(DISTINCT customer_id) as customers
FROM orders
GROUP BY status

Complex Expressions

Handles CASE, COALESCE, CAST, math, string functions:

SELECT
    UPPER(status) as status,
    SUM(CASE WHEN priority = 'high' THEN amount ELSE 0 END) as high_priority_revenue,
    SUM(quantity * price) as total_revenue
FROM orders
GROUP BY UPPER(status)

Extracts underlying columns (status, amount, quantity, price).

Time Dimensions

SELECT
    DATE_TRUNC('month', order_date) as month,
    EXTRACT(YEAR FROM order_date) as year,
    SUM(amount) as revenue
FROM orders
GROUP BY DATE_TRUNC('month', order_date), EXTRACT(YEAR FROM order_date)

Extracts order_date as time dimension. Rewrites to orders.order_date__month, orders.order_date__year.

Derived Metrics

SELECT
    status,
    SUM(revenue) / COUNT(*) as avg_revenue_per_order,
    SUM(amount - discount) as net_amount
FROM orders
GROUP BY status

Extracts base metrics and derived metrics separately:

metrics:
  - name: sum_revenue
    agg: sum
    sql: revenue
  - name: count
    agg: count
    sql: "*"
  - name: avg_revenue_per_order
    type: derived
    sql: "SUM(revenue) / COUNT(*)"

Joins and Relationships

SELECT
    c.region,
    COUNT(o.order_id) as order_count
FROM customers c
JOIN orders o ON c.id = o.customer_id
GROUP BY c.region

Extracts relationship: orders.customer_idcustomers.id (many_to_one).

# orders.yml
relationships:
  - name: customers
    type: many_to_one
    foreign_key: customer_id

Subqueries

Resolves subquery aliases to underlying tables:

SELECT
    sub.status,
    COUNT(*) as order_count
FROM (
    SELECT status, amount
    FROM orders
    WHERE amount > 100
) sub
GROUP BY sub.status

Extracts orders table and resolves sub.statusorders.status.

Window Functions and Cumulative Metrics

Detects window functions and generates cumulative metric definitions.

Running totals:

SELECT
    order_date,
    SUM(amount) OVER (ORDER BY order_date) as running_total
FROM orders
metrics:
  - name: running_total
    type: cumulative
    sql: "orders.sum_amount"

Rolling windows:

SELECT
    order_date,
    SUM(amount) OVER (
        ORDER BY order_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) as rolling_7day
FROM orders
metrics:
  - name: rolling_7day
    type: cumulative
    sql: "orders.sum_amount"
    window: "6 days"

Period-to-date:

SELECT
    order_date,
    SUM(amount) OVER (
        PARTITION BY DATE_TRUNC('month', order_date)
        ORDER BY order_date
    ) as mtd_revenue
FROM orders
metrics:
  - name: mtd_revenue
    type: cumulative
    sql: "orders.sum_amount"
    grain_to_date: "month"

What’s extracted:

  • Aggregation window functions: SUM/AVG/COUNT/MIN/MAX with OVER()
  • Rolling windows with ROWS BETWEEN
  • Period-to-date with PARTITION BY DATE_TRUNC()

What’s ignored:

  • Window-only functions: ROW_NUMBER(), RANK(), LAG(), LEAD()
  • Complex window calculations like COUNT(*) * 100.0 / SUM(COUNT(*)) OVER()

Base aggregations inside complex calculations are still extracted as regular metrics.

Other Patterns

CTEs: Extracts from underlying tables, preserves structure in rewrites

GROUP BY ordinals: GROUP BY 1, 2 resolved to columns

Implicit joins: FROM a, b WHERE a.id = b.fk extracts relationships

Self-joins: Generates single model

CLI Usage

Generate models from queries:

sidemantic migrator analyze queries/ --output-dir models/

Generate models and rewritten queries:

sidemantic migrator analyze queries/ --output-dir models/ --rewrite

This analyzes all SQL files in queries/ and generates: - Model YAML files in models/ - Rewritten queries in queries_rewritten/ (if --rewrite flag used)

Other commands:

# Analyze specific query
sidemantic migrator analyze-query \
    "SELECT status, COUNT(*) FROM orders GROUP BY status"

# Generate coverage report
sidemantic migrator report queries/ > coverage-report.txt

Options:

  • --pattern "*.sql" - File pattern to match
  • --output-dir - Where to write generated models
  • --rewrite - Also generate rewritten queries
  • --verbose - Show detailed analysis

Migration Workflow

1. Analyze queries and generate models

sidemantic migrator analyze queries/ --output-dir models/generated/ --verbose

Review generated models in models/generated/ directory.

2. Generate rewritten queries

sidemantic migrator analyze queries/ --output-dir models/generated/ --rewrite

This creates: - Model definitions in models/generated/ - Rewritten queries in queries_rewritten/

3. Test rewritten queries

Compare results between original and rewritten queries to ensure correctness.

4. Deploy models

cp models/generated/*.yml models/

Load into your semantic layer and start using.

Troubleshooting

Low Coverage

Cause: Missing models, dimensions, or metrics.

Fix: Generate models from your queries first:

sidemantic migrator analyze queries/ --output-dir models/

Then load those models into your semantic layer before re-analyzing.

Dimensions Not Extracted

Cause: No GROUP BY clause.

-- ❌ No dimensions (no GROUP BY)
SELECT COUNT(*) FROM orders

-- ✅ Extracts status
SELECT status, COUNT(*) FROM orders GROUP BY status

Relationships Not Detected

Causes:

  • Non-equi joins
  • Complex conditions
  • Multiple join keys

Fix: Manually add relationships to generated models or use simple equality joins in queries.

Metric Names Don’t Match

Analyzer generates names from agg_column:

  • SUM(amount)sum_amount
  • COUNT(*)count

Use aliases to control names:

SELECT SUM(amount) as total_revenue FROM orders  -- Generates "total_revenue"

Python API

Get coverage metrics from report:

report = analyzer.analyze_queries(queries)

print(f"Coverage: {report.coverage_percentage:.1f}%")
print(f"Missing models: {report.missing_models}")
print(f"Missing dimensions: {report.missing_dimensions}")

Provide database connection for better inference:

import duckdb

conn = duckdb.connect(":memory:")
analyzer = Migrator(layer, connection=conn)

Benefits: Resolves ambiguous columns, infers relationships from foreign keys, detects primary keys.