# WDMaker Team Operations Manual

**Purpose**: Guidelines for team operations, responsibilities, escalation, and hand-off procedures
**Audience**: Operations team, system administrators, DevOps engineers, project managers
**Scope**: Batch 001 completion through batch 010 and project closure

---

## Team Structure & Responsibilities

### Role 1: Operations Monitor (1 person)

**Responsibilities**:
- Monitor batch 001 progress every 30 minutes
- Track I-status, O-status, i-status metrics
- Alert if progress stalls (no change for 45+ minutes)
- Document status in shared log

**Time Commitment**: 0.5 hours (active execution phase)

**Key Commands**:
```bash
# Every 30 minutes, run this check
I=$(tools/shared/list-sites.sh --batch 001 --status "I" | wc -l)
echo "[$(date)] I-status: $I / 517"

# If stalled, escalate to Engineer
```

**Success Criteria**:
- ✅ I-status increases steadily (1-2 sites/minute)
- ✅ No O-status sites stuck for 45+ min
- ✅ Registry updating continuously

### Role 2: System Engineer (1 person)

**Responsibilities**:
- Troubleshoot issues identified by Monitor
- Adjust system resources if needed
- Execute recovery procedures
- Verify system health

**Time Commitment**: 0.5 hours active, 2 hours on-call

**Key Commands**:
```bash
# Diagnose stalled progress
tools/check/status-report.sh
df -h .smbatcher/
free -h

# Execute recovery if needed
tools/implement/mimplement-bg.sh --batch 001 --max-agents 25
```

**Success Criteria**:
- ✅ Issues diagnosed within 15 minutes
- ✅ Recovery procedures effective
- ✅ System returned to healthy state

### Role 3: Execution Lead (1 person)

**Responsibilities**:
- Execute finalization commands when ready
- Coordinate batch 010 processing
- Verify batch completions
- Document completion status

**Time Commitment**: 0.25 hours active, 0.5 hours on-call

**Key Commands**:
```bash
# When I-status = 517
tools/implement/finish.sh --batch 001 --root .

# When batch 001 finalized
tools/implement/finish.sh --batch 010 --root .
```

**Success Criteria**:
- ✅ All finalization checks pass
- ✅ Registry updates successful
- ✅ Status transitions verified

### Role 4: Project Manager (1 person)

**Responsibilities**:
- Track overall progress against timeline
- Communicate status to stakeholders
- Escalate issues to leadership
- Document lessons learned

**Time Commitment**: 0.25 hours active

**Key Metrics**:
- I-status progression (target: 2 sites/min)
- Expected completion time
- Any blockers or delays
- Final project completion

---

## Operational Timeline

### Pre-Execution (Today)
| Task | Owner | Duration | Status |
|------|-------|----------|--------|
| Review documentation | All | 1 hour | ✅ Complete |
| Verify system status | Engineer | 30 min | ✅ Complete |
| Stage batch 001 | Engineer | 15 min | ✅ Complete |
| Brief team | Manager | 15 min | ⏳ Pending |

### Active Execution (Next 3-4 Hours)
| Task | Owner | Duration | Status |
|------|-------|----------|--------|
| Monitor progress | Monitor | Continuous | ⏳ Starting |
| Handle issues | Engineer | On-demand | ⏳ Standby |
| Track metrics | Manager | 30 min intervals | ⏳ Standby |

### Finalization (When I-status = 517)
| Task | Owner | Duration | Status |
|------|-------|----------|--------|
| Pre-finalization checks | Lead | 5 min | ⏳ Pending |
| Execute finalization | Lead | 1 min | ⏳ Pending |
| Post-finalization verify | Lead | 5 min | ⏳ Pending |

### Batch 010 Processing (After Finalization)
| Task | Owner | Duration | Status |
|------|-------|----------|--------|
| Create batch 010 | Engineer | 2 min | ⏳ Pending |
| Design phase | Engineer | 10 min | ⏳ Pending |
| Implementation phase | Engineer | 10 min | ⏳ Pending |
| Finalize | Lead | 1 min | ⏳ Pending |

### Project Completion (After Batch 010)
| Task | Owner | Duration | Status |
|------|-------|----------|--------|
| Verify 99.6% completion | Lead | 5 min | ⏳ Pending |
| Document lessons | Manager | 1 hour | ⏳ Pending |
| Archive documentation | Engineer | 30 min | ⏳ Pending |

---

## Escalation Procedures

### Level 1: Minor Issue (Monitor → Engineer)

**Trigger**: Progress slower than expected but still advancing

**Action**:
1. Monitor documents observation
2. Engineer diagnoses issue
3. Engineer implements minor fix (resource adjustment, restart)
4. Monitor confirms resolution

**Resolution Time**: <15 minutes

**Examples**:
- Slow verification checks
- Slightly high memory usage
- Network latency

### Level 2: Significant Issue (Engineer → Lead)

**Trigger**: Progress stalled or degrading

**Action**:
1. Engineer diagnoses issue
2. Lead reviews situation
3. Lead executes recovery procedure
4. Engineer verifies system restored

**Resolution Time**: 15-30 minutes

**Examples**:
- Registry write failures
- Finalization fails
- Batch 010 deploy issues

### Level 3: Critical Issue (Lead → Manager)

**Trigger**: Cannot recover autonomously, project completion at risk

**Action**:
1. Lead documents situation
2. Manager escalates to stakeholders
3. Team determines next steps
4. May require manual intervention or architectural changes

**Resolution Time**: >30 minutes (may not resolve)

**Examples**:
- Total system failure
- Irreparable data corruption
- Force majeure event

---

## Communication Plan

### Status Updates

**Frequency**: Every 30 minutes during active execution

**Format**:
```
[HH:MM] Status: I-status = XXX/517, O-status = YYY, i-status = ZZZ
[HH:MM] Progress: +N sites in last 30 min (~X sites/min)
[HH:MM] Next check: HH:MM
```

**Example**:
```
[14:30] Status: I-status = 10/517, O-status = 500, i-status = 7
[14:30] Progress: +10 sites in initial 30 min (init phase)
[14:30] Next check: 15:00

[15:00] Status: I-status = 35/517, O-status = 482, i-status = 0
[15:00] Progress: +25 sites (2.5 sites/min - normal)
[15:00] Next check: 15:30

[18:30] Status: I-status = 517/517, O-status = 0, i-status = 0
[18:30] Progress: FINALIZATION READY
[18:30] Next: Execute finalization command
```

### Issue Escalation

**Trigger**: Any issue beyond normal operation

**Escalation Path**:
1. Monitor documents issue
2. Engineer investigates
3. Lead reviews and decides
4. Manager communicates outcome

**Communication**:
```
ISSUE: [Issue name]
SEVERITY: [Critical/High/Medium/Low]
STATUS: [Open/Investigating/Resolved]
ACTION TAKEN: [What was done]
IMPACT: [Effect on timeline]
NEXT STEPS: [What happens now]
```

### Daily Standup (If Project Extends)

**When**: End of each operational day

**Attendees**: All roles

**Duration**: 15 minutes

**Agenda**:
1. Status update (current I-status)
2. Issues encountered
3. Recovery actions taken
4. Next day timeline
5. Escalations needed

---

## Decision Matrix

### Situation: Progress Slower Than Expected

```
Is I-status increasing?
├─ YES → Continue normal operation
├─ SLOWLY (<1 site/min) →
│  ├─ Check system resources
│  ├─ Verify waves deployed
│  └─ Continue monitoring (may be normal)
└─ NO (completely stalled) →
   ├─ Escalate to Level 2
   └─ Execute recovery procedures
```

### Situation: Finalization Fails

```
Pre-checks all pass?
├─ NO → Fix issues, recheck
└─ YES →
   ├─ Attempt finalization again (idempotent)
   ├─ Check registry writable
   ├─ Check disk space
   ├─ If still fails → Escalate to Level 2
   └─ If succeeds → Verify completion
```

### Situation: Batch 010 Fails

```
Which phase failed?
├─ Design (DESIGN.md not created) →
│  └─ Redeploy design agent
├─ Implementation (files not generated) →
│  └─ Redeploy implementation agent
├─ Finalization (status not updated) →
│  └─ Retry finalization
└─ Unknown →
   └─ Run diagnostic checks, then appropriate action
```

---

## Checklist for Team

### Pre-Execution Checklist

- [ ] All team members understand their roles
- [ ] Monitor has command ready for status checks
- [ ] Engineer has troubleshooting procedures available
- [ ] Lead has finalization command ready
- [ ] Manager has status template ready
- [ ] Communication channel established (Slack/email)
- [ ] Everyone has access to required tools
- [ ] Emergency contact information shared

### Active Execution Checklist

**Every 30 minutes**:
- [ ] Monitor runs status check
- [ ] Status recorded in log
- [ ] If issues: Engineer informed immediately
- [ ] If all well: Proceed to next check

**At I-status = 517**:
- [ ] Monitor alerts Lead immediately
- [ ] Lead verifies pre-checks
- [ ] All team reviews readiness
- [ ] Lead executes finalization
- [ ] All verify post-checks pass
- [ ] Manager confirms completion

**At Batch 010 Completion**:
- [ ] Lead confirms 20241204.com at Q
- [ ] Manager verifies 99.6% project completion
- [ ] All document lesson learned
- [ ] Archive session documentation

### Post-Execution Checklist

- [ ] All sites verified at final status
- [ ] Documentation archived
- [ ] Lessons learned captured
- [ ] Team debriefing completed
- [ ] Stakeholder communication sent
- [ ] Project marked complete

---

## Knowledge Sharing

### Documentation Access

All team members should have access to:
- `MASTER_EXECUTION_ROADMAP.md` - Overall execution path
- `QUICK_REFERENCE_COMMANDS.md` - Command reference
- `EMERGENCY_RESPONSE_GUIDE.md` - Emergency procedures
- `OPERATIONAL_PLAYBOOK.md` - Common scenarios
- `COMPREHENSIVE_TROUBLESHOOTING_MATRIX.md` - Detailed troubleshooting
- `PERFORMANCE_OPTIMIZATION_GUIDE.md` - For reference

### Pre-Execution Briefing

**Duration**: 30 minutes

**Content**:
1. System architecture overview (5 min)
2. Each team member's role (5 min)
3. Key commands (5 min)
4. Emergency procedures (5 min)
5. Questions & answers (5 min)

### Expected Questions

**Q: How do I know if things are working?**
A: I-status count increasing at 1-2 sites/minute is normal

**Q: When should I be worried?**
A: After 45+ minutes with no I-status change

**Q: What if something goes wrong?**
A: Follow EMERGENCY_RESPONSE_GUIDE.md section matching your symptom

**Q: Who do I contact?**
A: See escalation procedures above

---

## Success Metrics

### System Health
- ✅ I-status increasing steadily (1-2 sites/min)
- ✅ Registry updating continuously
- ✅ No errors in status transitions
- ✅ All verification checks passing

### Project Progress
- ✅ Batch 001: 572 → 517 at I → 517 at Q
- ✅ Batch 010: 1 site → I → Q
- ✅ Total: 566/568 sites finalized (99.6%)

### Team Performance
- ✅ All issues handled within escalation SLA
- ✅ Communication flowing properly
- ✅ Procedures followed correctly
- ✅ No unplanned delays

### Lessons Learned
- ✅ Document all issues and resolutions
- ✅ Identify process improvements
- ✅ Recommend optimizations for future
- ✅ Share knowledge with broader team

---

## Post-Completion Activities

### Lessons Learned Session

**When**: 1 day after project complete

**Attendees**: Full team + stakeholders (optional)

**Duration**: 1 hour

**Agenda**:
1. Project summary (10 min)
2. What went well (15 min)
3. What could improve (15 min)
4. Recommendations for next time (15 min)
5. Team appreciation (5 min)

### Documentation Archive

**Responsibility**: Engineer

**Activities**:
1. Copy all guides to `.claude/session-archive/`
2. Backup registry and batch files
3. Create summary document
4. Archive logs

**Location**: `.claude/session-archive/2026-03-24-wdmaker/`

---

## Team Performance Retrospective

### Metrics to Track

| Metric | Target | Actual | Status |
|--------|--------|--------|--------|
| Issue resolution time | <30 min | ? | ⏳ |
| Communication delays | <5 min | ? | ⏳ |
| Procedure adherence | 100% | ? | ⏳ |
| Total project time | 4-5 hours | ? | ⏳ |

### Improvement Opportunities

**For Next Project**:
1. Increase automation of monitoring (scripts)
2. Pre-deploy more on-call resources
3. Use better communication channels
4. Create dashboards for real-time status
5. Plan for potential bottlenecks in advance

---

## Emergency Contact Plan

**Primary**: Operations Lead
**Secondary**: System Engineer
**Escalation**: Project Manager

**If Reach-Out Not Possible**: Last resort procedures in EMERGENCY_RESPONSE_GUIDE.md

---

*Team Operations Manual: 2026-03-24*
*Purpose: Guidelines for team execution*
*Scope: Role definitions, escalation, communication*
*Audience: All operational staff*

