Executive Summary
Hybrid-by-accident is now hybrid-by-design. Most enterprises operate across on-prem, edge, and multiple clouds while modernizing apps and infusing AI. That sprawl explodes security-relevant telemetry and multiplies access paths, identities, and compliance obligations. Centralizing everything into one SIEM or data lake is proving cost-prohibitive and operationally brittle. A federated data platform approach, grounded in open formats, layered pipelines, and strong identity/lineage controls, offers a path to actionable signals without drowning in noise or spending. This is based on a conversation Balázs Scheidler and I had on his podcast Data Strikes Back (below).
Why This Matters Now
- Attack surface expansion: APIs, agents, and AI-driven access patterns introduce new lateral paths and privilege escalation vectors across heterogeneous stacks.
- Data gravity & compliance: Regulated data often lives in multiple systems; copying it multiplies risk (freshness, residency, duty of care) while complicating authorization and audit.
- Cost shock: “Store everything, correlate later” (mass-ingest to SIEM/object storage) is increasingly uneconomic—especially for verbose cloud-provider logs.
- Skills gap: Platform engineering converges roles, but few teams are deep across three+ hyperscalers, mainframe/back-office feeds, Kubernetes, and legacy NAS/DBs.
From Monoliths to Federated Security Data Platforms
Old model: Aggregate all logs to a central lake/SIEM, normalize, correlate, alert.
Emerging model:
- Decouple storage & compute. Use open table formats (e.g., Iceberg, Delta, Hudi) so multiple engines can query the same governed data without lock-in or duplication.
- Federated query + tiered pipelines. Keep raw data at source or in domain lakes; build pipelines that create summarized views for SOC workflows, with drill-down to raw for forensics.
Layered architecture:
- Source & Collection: OpenTelemetry/Prometheus/agents, cloud-native logging, network, mainframe, and enterprise app feeds.
- Transit & Governance: Policy-aware pipelines; PII minimization; encryption; lineage capture.
- Storage & Formats: Open tables and domain lakes; immutability options (e.g., WORM/tape) for long-term legal holds.
- Processing & Views: Stream filters, feature extraction, and rollups tuned for detection & triage; retain access to raw.
- Access & Control: Identity, authorization, and attribute-based access for humans and AI agents; auditable policies.
- Consumption: SIEM/SOAR, threat hunting, forensics, analyst workbenches, and AI/agentic automations via well-scoped views.
Key Design Principles for CISOs
- Federate first; centralize only where it pays. Not all data merits hot, centralized storage. Use federated search for breadth; centralize signals or high-value domains.
- Keep raw, govern views. Retain original evidence for forensics and legal defensibility; expose least-privilege, purpose-built views for SOC and AI agents.
- Identity is the new perimeter—for humans and agents. Treat agents like users: strong authentication, scoped tokens, JIT/JEA access, and full session auditing.
- Lineage everywhere. Track provenance and transformations; make lineage queryable during incident response.
- Open over proprietary. Favor open collection and data formats to reduce lock-in and enable multi-engine analytics.
- Cost as a control. Price out each pipeline stage (ingest, transform, store, query) and push summarization/feature extraction to the edge where possible.
- Resilience & retention. Combine object storage with immutable tiers (including tape/air-gap) aligned to legal hold policies.
What Changes with AI & Agents
- Access pattern explosion: Agents traverse APIs and systems faster than humans, increasing blast radius if credentials or scopes are lax.
- View-driven security analytics: Curated “security views” feed agents with bounded, policy-enforced context—minimizing over-permissioned data exposure.
- Noise suppression at the edge: Use ML/AI pre-filters in pipelines to create signals before data hits expensive tiers; keep links (lineage) to raw for explainability.
Operating Model: Process Before Platform
- Adopt a framework (e.g., NIST) and map controls to layers. Select the right tool for the right layer vs. one-size-fits-all.
- Standardize integrations. Institutionalize patterns for collection, transformation, and quality checks; reduce bespoke scripts that decay and widen the attack surface.
- Measure what matters. MTTR, false-positive rate, % detections tied to federated sources, pipeline cost per alert, lineage coverage, and agent access exceptions.
- Vendor strategy. Prefer vendors that (a) interoperate with open standards, (b) support federated access, and (c) are explicit about what they don’t do.
Your 90-Day CISO Checklist
- Inventory security-relevant data domains, retention mandates, and current egress costs.
- Classify which streams require hot centralization vs. federated access with on-demand pull.
- Stand up an open table format zone for high-value security datasets; prove two query engines working on the same data.
- Enforce agent IAM: dedicated identities, least-privilege scopes, rotation, and audit.
- Implement lineage capture in pipelines; make it searchable from the SOC.
- Pilot edge summarization for a noisy cloud log source; compare detection quality and cost.
- Plan immutable retention for evidentiary data with WORM/tape where appropriate.
Our ANGLE
The winning posture is not “collect it all in one spot” but to collect what counts from the source systems of record, govern the rest, and make the signal cheap to find. A federated, open, identity-first security data platform turns sprawling telemetry into operational clarity, at a cost the business can live with.