covid20212022: Show HN: OmoiOS–190K lines of Python to stop babysitting AI agents (Apache 2.0) https://ift.tt/5tfqPKU

Thursday, March 5, 2026

Show HN: OmoiOS–190K lines of Python to stop babysitting AI agents (Apache 2.0) https://ift.tt/5tfqPKU

Show HN: OmoiOS–190K lines of Python to stop babysitting AI agents (Apache 2.0) AI coding agents generate decent code. The problem is everything around the code - checking progress, catching drift, deciding if it's actually done. I spent months trying to make autonomous agents work. The bottleneck was always me. Attempt 1 - Claude/GPT directly: works for small stuff, but you re-explain context endlessly. Attempt 2 - Copilot/Cursor: great autocomplete, still doing 95% of the thinking. Attempt 3 - continuous agents: keeps working without prompting, but "no errors" doesn't mean "feature works." Attempt 4 - parallel agents: faster wall-clock, but now you're manually reviewing even more output. The common failure: nobody verifies whether the output satisfies the goal. That somebody was always me. So I automated that job. OmoiOS is a spec-driven orchestration system. You describe a feature, and it: 1. Runs a multi-phase spec pipeline (Explore > Requirements > Design > Tasks) with LLM evaluators scoring each phase. Retry on failure, advance on pass. By the time agents code, requirements have machine-checkable acceptance criteria. 2. Spawns isolated cloud sandboxes per task. Your local env is untouched. Agents get ephemeral containers with full git access. 3. Validates continuously - a separate validator agent checks each task against acceptance criteria. Failures feed back for retry. No human in the loop between steps. 4. Discovers new work - validation can spawn new tasks when agents find missing edge cases. The task graph grows as agents learn. What's hard (honest): - Spec quality is the bottleneck. Vague spec = agents spinning. - Validation is domain-specific. API correctness is easy. UI quality is not. - Discovery branching can grow the task graph unexpectedly. - Sandbox overhead adds latency per task. Worth it, but a tradeoff. - Merging parallel branches with real conflicts is the hardest problem. - Guardian monitoring (per-agent trajectory analysis) has rough edges still. Stack: Python/FastAPI, PostgreSQL+pgvector, Redis (~190K lines). Next.js 15 + React Flow (~83K lines TS). Claude Agent SDK + Daytona Cloud. 686 commits since Nov 2025, built solo. Apache 2.0. I keep coming back to the same problem: structured spec generation that produces genuinely machine-checkable acceptance criteria. Has anyone found an approach that works for non-trivial features, or is this just fundamentally hard? GitHub: https://ift.tt/d356S9K Live: https://omoios.dev https://ift.tt/d356S9K March 5, 2026 at 11:07PM

covid20212022

ads

ads

Thursday, March 5, 2026

Show HN: OmoiOS–190K lines of Python to stop babysitting AI agents (Apache 2.0) https://ift.tt/5tfqPKU

No comments:

Post a Comment

Popular Posts

ads