MimironsGoldOMatic

Tier B / E2E pipeline — maintenance checklist

Workflow: .github/workflows/e2e-test.yml · Handover context: TIER_B_HANDOVER.md · Full guide: E2E_AUTOMATION_PLAN.md — E2E Pipeline Maintenance Guide


Verification of monitoring & alerting (operators)

These checks validate that monitoring and alerting remain functional after workflow edits.

Validation results (fill during execution)

Note: In this workspace, GitHub CLI (gh) is not available, and no authenticated GitHub API token is configured. As a result, the steps below were reviewed for correctness but not executed from this environment.

Check Result Evidence
Weekly health report manual dispatch [!] Blocked (needs Actions UI) Run URL: TBD
Consecutive failure alert dedupe [!] Blocked (needs failed runs) Issue URL: TBD
e2e-test.yml Summary timing table [!] Blocked (needs Actions UI) Run URL: TBD

A) Weekly health report (manual dispatch)

Workflow: .github/workflows/e2e-weekly-health-report.yml

  1. GitHub → Actions → workflow “E2E weekly health report”.
  2. Select Run workflow (manual workflow_dispatch).
  3. Open the run → confirm the Summary includes:
    • 30‑day rolling success/failure counts
    • Data sourced from the GitHub Actions REST API (see the script’s call to actions.listWorkflowRuns)

What “good” looks like

Record

B) Consecutive-failure alert (deduplicated issue)

Workflow: .github/workflows/e2e-consecutive-failure-alert.yml

This workflow should open exactly one GitHub issue when the two most recent completed runs of e2e-test.yml are both failure (excluding cancelled/skipped).

How to simulate two consecutive failures safely

Record

C) e2e-test.yml job Summary table (timing boundaries)

Workflow: .github/workflows/e2e-test.yml

  1. Open any recent run → job e2e-tier-a-b.
  2. Confirm the job Summary contains a table titled “E2E performance (this run)” with:
    • Total job wall-clock seconds
    • Tier B step boundary seconds (or explicitly n/a if timestamps missing)

Record

Weekly verification (≈15 min)


Monthly review (≈45 min)


Pipeline update checklist (before merging workflow changes)


Emergency response (failed E2E on main PR)

  1. Reproduce: Re-run failed job; confirm not transient runner flake.
  2. Artifacts: Pull e2e-service-logs; read mgm-backend.log, mgm-tier-b-orchestrator.log, mgm-mock-helix.log.
  3. Bisect: Recent PRs touching Backend HelixChatService, TwitchOptions, mocks, or run_e2e_tier_b.py.
  4. Contain: If main is blocked, consider revert; open issue with log excerpts (redact secrets — workflow uses inline test tokens only today).
  5. Alerting: If e2e-consecutive-failure-alert opened an issue, use it as the war-room thread.

Workflow Purpose
e2e-test.yml PR Tier A + B gate
e2e-weekly-health-report.yml Scheduled / manual rolling stats
e2e-consecutive-failure-alert.yml Opens issue when two consecutive e2e-test.yml runs fail