E2E Testing with Ralph Loop
End-to-end testing for the InfraOps pipeline using the autonomous RALPH Loop pattern.
What Is Ralph Loop?
Section titled “What Is Ralph Loop?”Ralph Loop is a self-correcting E2E evaluation workflow that runs all InfraOps pipeline steps without human gates. It validates the entire agent pipeline autonomously — from requirements through deployment to documentation — with built-in self-correction, challenger reviews, and benchmark scoring.
Key characteristics:
- Autonomous: all gates auto-approve after validation passes
- Self-correcting: validation failures feed findings back to the agent for retry (max 5 per step)
- IaC-agnostic: supports both Bicep and Terraform tracks
- Dry-run only: never deploys real Azure resources (uses
bicep what-iforterraform plan) - Benchmarked: produces an 8-dimension quality score (0–100)
Supported IaC Tracks
Section titled “Supported IaC Tracks”| Track | Code Directory | Entry File | Validation Commands |
|---|---|---|---|
| Bicep | infra/bicep/{project}/ | main.bicep | bicep build, bicep lint |
| Terraform | infra/terraform/{project}/ | main.tf | terraform validate, terraform fmt -check |
The IaC tool is read from decisions.iac_tool in 00-session-state.json.
To switch tracks, change the IaC tool field in the prompt file’s Project Context section.
How to Run
Section titled “How to Run”Quick Validation (Structural)
Section titled “Quick Validation (Structural)”Validates all E2E artifacts for structural compliance (fast, no agent invocation):
# Default projectnpm run e2e:validate
# Specific projectnode scripts/validate-e2e-step.mjs --project=terraform-e2e all
# Single stepnode scripts/validate-e2e-step.mjs --project=e2e-ralph-loop 5Benchmark Scoring
Section titled “Benchmark Scoring”Runs the 8-dimension benchmark and generates a report:
# Default project (e2e-ralph-loop)npm run e2e:benchmark
# Terraform projectnpm run e2e:benchmark -- terraform-e2e
# Multi-project comparisonnpm run e2e:benchmark -- --compareFull Ralph Loop Execution
Section titled “Full Ralph Loop Execution”Open VS Code Chat and use one of the prompt files:
- Simple project (pre-seeded): Open
.github/prompts/e2e-ralph-loop.prompt.md - Complex project (RFP-driven): Open
.github/prompts/e2e-contoso-rfp.prompt.md - Post-loop analysis: Open
.github/prompts/e2e-analyze-lessons.prompt.md
The E2E Conductor agent (.github/agents/e2e-conductor.agent.md) orchestrates
the loop with conditional IaC routing based on session state.
Benchmark Scoring Dimensions
Section titled “Benchmark Scoring Dimensions”The benchmark engine scores each run across 8 dimensions:
| Dimension | Weight | What It Measures |
|---|---|---|
| Artifact completeness | 20% | All required step outputs exist |
| Structural compliance | 15% | Artifact template format, H2 sync, session state |
| Code quality | 20% | Bicep build/lint or Terraform validate/fmt + AVM usage |
| Review thoroughness | 10% | Challenger review passes executed per step |
| WAF coverage | 10% | All 5 Well-Architected pillars in architecture |
| Cost accuracy | 5% | Budget stated + cost estimate exists |
| Session state integrity | 10% | Schema version, project, decisions, decision_log, step completion |
| Timing performance | 10% | Duration within thresholds (3 min normal, 10 min codegen) |
Composite score = weighted average. Grades: A (90–100), B (80–89), C (70–79), D (60–69), F (<60).
Pass threshold: 60/100 (configurable via E2E_PASS_THRESHOLD environment variable).
Interpreting Results
Section titled “Interpreting Results”Benchmark Report
Section titled “Benchmark Report”After running npm run e2e:benchmark, check:
agent-output/{project}/08-benchmark-report.md— human-readable scorecardagent-output/{project}/08-benchmark-scores.json— machine-readable JSON
Lessons Learned
Section titled “Lessons Learned”Self-correction events and systemic issues are captured in:
agent-output/{project}/09-lessons-learned.json— structured findingsagent-output/{project}/09-lessons-learned.md— narrative summary
Iteration Log
Section titled “Iteration Log”Per-step attempt tracking in agent-output/{project}/08-iteration-log.json.
Decision Log
Section titled “Decision Log”Cross-agent decisions are captured in the decision_log array inside 00-session-state.json.
Each entry records what was decided, why, what was rejected, and which agent made the call.
The benchmark scores decision_log presence as part of session state integrity.
See .github/instructions/decision-logging.instructions.md for the entry schema.
Test Projects
Section titled “Test Projects”| Project | IaC Tool | Complexity | Description |
|---|---|---|---|
e2e-ralph-loop | Bicep | Simple | Nordic Fresh Foods Lite (canonical) |
terraform-e2e | Terraform | Simple | Small ecommerce storefront |
contoso-service-hub-run-1 | Bicep | Complex | Contoso Service Hub (RFP-driven) |
contoso-service-hub-run-2 | Bicep | Complex | Contoso Service Hub (second run) |
Troubleshooting
Section titled “Troubleshooting”Benchmark score is 0 for code quality
Section titled “Benchmark score is 0 for code quality”The IaC tool is detected from 00-session-state.json. Ensure iac_tool is set to
either Bicep or Terraform in the session state.
Terraform validation fails during E2E
Section titled “Terraform validation fails during E2E”Run terraform init -backend=false in the project directory first.
The validator runs this automatically, but network issues may cause failures.
Pre-validation fails after agent return
Section titled “Pre-validation fails after agent return”The agent may have written to the wrong output directory. Check that the project
name in session state matches the --project flag.