Agent Mag, May 10, 2026
Two practical reads: OpenAI's Codex safety post as a control checklist, and SemiAnalysis' InferenceX as a benchmark to evaluate against your own agent workloads.
1. Codex safety controls are infrastructure controls
OpenAI described how it operates Codex with controls around execution, approvals, networking, and observability.
The useful takeaway is not to copy OpenAI's stack. It is to compare your own coding-agent deployment against the control categories. Coding agents can interact with source code, credentials, CI systems, and production workflows, so the risk model is partly infrastructure security, not only model behavior.
Practical checks:
- Run coding agents in isolated environments.
- Require approval for risky or externally visible actions.
- Restrict network access rather than relying on agent behavior.
- Capture telemetry security and compliance teams can inspect.
- Treat rollout and permissions as operational controls.
Caveat: this is OpenAI describing its own operating model. The post does not provide incident data, red-team results, detailed failure modes, or a clear view of which controls require OpenAI-scale infrastructure. Smaller teams should adapt the categories to their own threat model and constraints.
2. InferenceX is a useful signal, not a buying decision
SemiAnalysis is presenting InferenceX as an open-source continuous inference benchmark. The source says it includes initial inference numbers across NVIDIA and AMD GPUs, including token throughput, performance per dollar, and tokens per megawatt.
This is relevant to agent teams because multi-step workflows can multiply model calls. Latency and inference cost can become product constraints when agents loop over tools, retries, long context, and parallel subtasks.
Use the benchmark as a starting point. Before changing hardware or serving plans, check:
- Which models and precisions were tested.
- Which kernels, drivers, and serving stacks were used.
- Whether results are reproducible.
- How often the benchmark updates.
- Whether token throughput maps to your latency budget.
- Whether the workload resembles your traffic pattern.
The approved source does not answer all of these questions, so treat InferenceX as a prompt for evaluation, not a final verdict.
What to watch
- Whether OpenAI or others publish evidence on control performance in incidents, red-team exercises, or misuse cases.
- How smaller teams implement sandboxing, approvals, network restrictions, and telemetry without large internal platforms.
- Whether InferenceX publishes enough methodology and governance detail for teams to reproduce and trust results.
- How benchmark rankings change as kernels, drivers, and serving systems improve.