Data and Context Engineering have become the key factor for AI/BI. But what really takes to bring good data and relavent context to AI without breaking the bank? Are the following assumptions realistic?
[1] the application/service has good CI/CD and observability already, that equals to good data; [2] the underlying datasets have been clearly modeled as conformed dimensions + fact + agg; [3] the transformation pipelines are reliable and well-maintained; [4] data integraity & quality is taken care by the Analytics/Data Engineers or Data Analysts;
This post will break down the myth and explain the WHY & HOW for 3 critical building blocks of AI for Data:
- context graph based on precise
contentunderstanding and lineage (schemas and PRD/TDD are far from enough) - smart orchestration based on data dependency and compute resource + cost factor
- shift-left with canonical data model and (early) integration layer with ODS or streaming
Table of contents
- Why Transformation Pipelines Are Inevitable yet Undervalued
- Semantic Context Is Much More Than Schema + Document
- Orchestration Must Focus On Data Dependency, Compute Resource and Cost
- Do More With Less - “Shift Left”
Why Transformation Pipelines Are Inevitable yet Undervalued
Semantic Context Is Much More Than Schema + Document
| Knowledge | Trust Level | Decay Rate | Coverage/Accuracy |
|---|---|---|---|
| Certified Query | High | Slow (>0) | Lower-than-expected, can still be tribal |
| Pipeline / DBT Code | Medium | Medium (tribal) | Partial/Tribal |
| BI Report/Dashboard | Medium-Low | Fast (drift) | Siloed (better than ad-hoc only) |
| Document / Wiki | Low | Very Fast (often stale) | Low, Sparsed |
| Agent-discovered | Variable | Tracks with Validation Timestamp | Variable (but better than manual processes) |
| Human Correction | Very High | Medium | Low but Quite Accurate |
Data Catalog
Lineage and Observability
Semantic Annotation
Orchestration Must Focus On Data Dependency, Compute Resource and Cost
Do More With Less - “Shift Left”
Inefficient Org Structure and SOP
Total Cost of Ownership/Operation
Versatile Engineer
Building AI infrastructure for data intensive use cases is hard. We’re working on the boring-yet-necessary components that handles these patterns for you. Join our pilot program to learn more.