Data and Context Engineering have become the key factor for AI/BI. But what really takes to bring good data and relavent context to AI without breaking the bank? Are the following assumptions realistic?

[1] the application/service has good CI/CD and observability already, that equals to good data; [2] the underlying datasets have been clearly modeled as conformed dimensions + fact + agg; [3] the transformation pipelines are reliable and well-maintained; [4] data integraity & quality is taken care by the Analytics/Data Engineers or Data Analysts;

This post will break down the myth and explain the WHY & HOW for 3 critical building blocks of AI for Data:

  • context graph based on precise content understanding and lineage (schemas and PRD/TDD are far from enough)
  • smart orchestration based on data dependency and compute resource + cost factor
  • shift-left with canonical data model and (early) integration layer with ODS or streaming

Table of contents

Why Transformation Pipelines Are Inevitable yet Undervalued

Semantic Context Is Much More Than Schema + Document

KnowledgeTrust LevelDecay RateCoverage/Accuracy
Certified QueryHighSlow (>0)Lower-than-expected, can still be tribal
Pipeline / DBT CodeMediumMedium (tribal)Partial/Tribal
BI Report/DashboardMedium-LowFast (drift)Siloed (better than ad-hoc only)
Document / WikiLowVery Fast (often stale)Low, Sparsed
Agent-discoveredVariableTracks with Validation TimestampVariable (but better than manual processes)
Human CorrectionVery HighMediumLow but Quite Accurate

Data Catalog

Lineage and Observability

Semantic Annotation

Orchestration Must Focus On Data Dependency, Compute Resource and Cost

Do More With Less - “Shift Left”

Inefficient Org Structure and SOP

Total Cost of Ownership/Operation

Versatile Engineer


Building AI infrastructure for data intensive use cases is hard. We’re working on the boring-yet-necessary components that handles these patterns for you. Join our pilot program to learn more.