The Next Evolution of Platform Engineering
I have been building platform teams for a while now. Long enough to have lived through the DevOps-to-Platform-Engineering transition firsthand. The pattern is happening again, and this time it is bigger.
Platform Engineering is splitting. A new discipline is emerging that I am calling AI Platform Engineering, and if you work in a regulated industry like healthcare, you cannot afford to ignore it.
#We have seen this movie before
Remember when “DevOps” was the answer to everything? Every ops engineer became a DevOps engineer overnight. The title changed but the work stayed the same for a while. Then slowly, the real shift happened. Teams started building internal developer platforms. Golden paths. Self-service infrastructure. The role genuinely evolved into something new.
That same pressure is building again. AI is not just another workload to deploy. It introduces an entirely different set of problems: model governance, prompt injection risks, data privacy boundaries, cost management for inference, and compliance frameworks that regulators are still writing in real time.
Your traditional platform team is not equipped to handle all of this. They should not have to be.
Platform, AI Platform, and Security today — overlapping but still largely separate.
#What AI Platform Engineering actually looks like
An AI Platform team is not a machine learning team. They are not training models. They are building the guardrails, pipelines, and self-service tooling that lets every other engineering team adopt AI safely and quickly.
In practice, this means owning things like:
- LLM Gateways — A centralized proxy layer that handles authentication, rate limiting, cost tracking, and content filtering for every model call in the org. Engineers should not be copy-pasting API keys into random services.
- Prompt management and versioning — Treating prompts like infrastructure. Version controlled, tested, reviewed. Not scattered across application code.
- Guardrails as code — Input and output validation that is defined centrally and enforced automatically. In healthcare, this is not optional. A model hallucinating medical advice is a compliance nightmare.
- Model routing and fallback — Abstracting away which model is being called so teams can swap providers, manage costs, and handle outages without code changes.
- Audit and observability — Every model interaction logged, traceable, and queryable. When a regulator asks what your AI told a patient, you need an answer in minutes, not weeks.
None of this is theoretical. These are real problems I am seeing teams try to solve with duct tape and hope right now.
#The LLM Gateway is the foundation
If you only build one thing, build the gateway. A centralized LLM gateway solves most of the problems on that list in a single layer. It is the same pattern we used with API gateways a decade ago, applied to model access.
Here is what the flow looks like. An engineer’s application makes a request to the gateway, not directly to OpenAI or Anthropic. The gateway handles everything in between.
The gateway intercepts every request and response. On the way in, it authenticates the caller, checks rate limits, scans the input for PII or PHI, and applies content policies. It decides which model to route to based on the use case, cost, or availability. On the way out, it filters the response, logs everything, and tracks token usage by team.
One gateway config handles what would otherwise be dozens of scattered implementation decisions across every team.
The good news is that the tooling is maturing fast. There are open source LLM gateways that run natively in Kubernetes, implement the Kubernetes Gateway API, and handle not just agent-to-LLM traffic but also agent-to-tool (via MCP) and agent-to-agent (via A2A) communication. The best ones let you define routes so different teams hit the same gateway but get different policies applied automatically. Your clinical team’s traffic can be routed through a PHI guardrail before it ever reaches the model while your internal tooling team gets a lighter policy. API keys live in Kubernetes secrets, not in application code. Every interaction is traced via OpenTelemetry. If one provider goes down, traffic fails over to another.
One gateway gives you authentication, model routing with fallback, content guardrails, and a full audit trail. Nobody is managing their own API keys. Nobody is building their own retry logic. Nobody is accidentally sending PHI to a third-party model without filtering.
#Why regulated industries cannot wait
Earlier this year, Chipotle’s customer service AI agent got jailbroken. Someone nudged it off task and had it answering coding questions instead of helping with burrito orders. For Chipotle, it was a minor embarrassment and some wasted API tokens. They patched it and moved on.
Now imagine that same failure in healthcare. A patient-facing AI agent with no guardrails gets manipulated into generating medical advice it was never designed to give. That is not a funny LinkedIn post. That is a HIPAA violation, a liability event, and a potential patient safety issue.
This is exactly why AI Platform Engineering matters most in regulated environments. The guardrails cannot be an afterthought bolted on after the product team ships something. They need to be baked into the platform from day one. Chipotle learned that lesson over a PR blip. In healthcare, the lesson costs infinitely more.
The teams that figure this out early gain a massive advantage. They move faster because their engineers are not scared to use AI. They have clear boundaries. They know what is allowed, what is logged, and what is blocked. That confidence unlocks velocity.
The teams that wait end up with shadow AI. Engineers using personal API keys. Prompts with PHI flowing through third-party services nobody vetted. It is the same shadow IT problem we dealt with a decade ago, just with higher stakes.
#The convergence
By 2027, these disciplines will be inseparable.
By 2027, I believe these circles will have pulled significantly closer together.
Security teams will need to understand AI-specific threat models. AI Platform teams will need deep infrastructure and networking knowledge. Traditional platform engineers will need to understand model serving, GPU scheduling, and inference optimization.
The boundaries blur. The best teams will be the ones that recognize this convergence early and start cross-training now.
#What to do about it
If you are leading a platform team today, start by acknowledging that AI infrastructure is a distinct problem space. Do not just hand it to your existing platform engineers and hope for the best.
Start small. Stand up an LLM gateway. Centralize your API keys. Build an audit trail. Get your security team in the room early. These are not massive investments, but they set the foundation.
And if you are in healthcare or another regulated space, treat this as urgent. The compliance landscape for AI is forming right now. The organizations that build the right patterns today will be the ones writing the playbook everyone else follows.
#Where does your team stand?
Run through these. Be honest.
- Do you have a centralized way to manage LLM API keys, or are they scattered across teams and services?
- Can you tell a regulator exactly what your AI said to a user last Tuesday at 2pm?
- Do you have input/output guardrails that are enforced automatically, or is each team rolling their own?
- If your primary model provider goes down, can your applications fail over without a code change?
- Do you know how much you spent on inference last month, broken down by team?
- Has your security team reviewed your AI attack surface — prompt injection, data exfiltration, model abuse?
- Can a new engineer on your team ship an AI-powered feature without copy-pasting API keys or writing their own retry logic?
If you answered no to more than two of these, you do not have an AI platform. You have a collection of individual experiments. That is fine — most teams are there right now. But the gap between “experimenting” and “governing” is where the risk lives.
The DevOps-to-Platform-Engineering shift took years to fully materialize. This one is going to move faster. AI adoption is accelerating too quickly for it not to.
The question is not whether AI Platform Engineering becomes a real discipline. It is whether your team is ready when it does.