# Advanced Troubleshooting Scenarios

**Purpose**: Deep-dive into complex issues and resolution strategies
**Audience**: Advanced operators, system architects, escalation specialists
**Status**: Expert-level troubleshooting reference

---

## Scenario 1: Partial Wave Failure (Some Agents Stop)

**Symptom**: Progress was 15 sites/min, suddenly drops to 3 sites/min mid-wave

**Probable cause**: Some agents in wave 5 (or current) have crashed

**Diagnosis** (15-20 min):

```bash
# Check if agents still running
ps aux | grep -i opus | grep -v grep | wc -l
# Expected: ~25 (one wave) or more
# Abnormal: < 10

# Check wave logs if available
tail -100 logs/wave-5.log | grep -i error

# Check for stuck sites
tools/shared/list-sites.sh --batch 001 --status "i" | wc -l
# If high (> 20): Many sites waiting for agents

# Estimate sites per minute
I_NOW=$(tools/shared/list-sites.sh --batch 001 --status "I" | wc -l)
echo "Current I-count: $I_NOW"
# Compare to 5 minutes ago
```

**Root cause determination**:

1. **If agent count low** (< 10):
   - Agents crashed or terminated
   - Likely: Resource exhaustion, timeout, error

2. **If sites stuck high at i** (> 20):
   - Agents not completing work
   - Likely: Very slow agents, or timeout

3. **If everything looks normal**:
   - Slowdown might be normal
   - Check: Are many sites at i (in-progress)?
   - If yes: Normal, agents just slower
   - If no: Agents may be crashing silently

**Resolution** (depends on root cause):

### If agents crashed:
```bash
# Redeploy same wave
tools/implement/mimplement-bg.sh --batch 001 --max-agents 25 --wave 5
```

### If sites stuck at i:
```bash
# Check which sites
tools/shared/list-sites.sh --batch 001 --status "i" | head -10

# These sites are being worked on
# Options:
# 1. Wait 15 more minutes (they may complete)
# 2. Redeploy wave (may create duplicates)
# 3. Accept loss and move on
```

### If everything normal but slow:
1. Monitor for another 30 minutes
2. If slowdown persists: May indicate resource exhaustion
3. Check system: `free -h`, `df -h`, `top -bn1`

**Expected outcome**: Either agents restart and throughput recovers, or you understand slowdown is normal

**Time investment**: 15-30 minutes

---

## Scenario 2: Registry Becoming Corrupted (Duplicate Entries)

**Symptom**: `tools/shared/list-sites.sh` shows strange results, or same site appears twice

**Probable cause**: Concurrent write conflict (multiple agents updating registry simultaneously)

**Diagnosis** (10-15 min):

```bash
# Check for duplicates
awk -F'|' '{print $2}' .smbatcher/REGISTRY.md | sort | uniq -d

# If output shows domain names: Duplicates exist (bad)
# If output empty: No duplicates (good)

# Check registry integrity
wc -l .smbatcher/REGISTRY.md
# Should be 569 (568 sites + 1 header)
# If > 570: Likely duplicates

# Check git status
git status .smbatcher/REGISTRY.md
# Shows if modified or clean
```

**Severity assessment**:

- **1-5 duplicates**: Low severity, recovery straightforward
- **5-20 duplicates**: Medium severity, need targeted fix
- **> 20 duplicates**: High severity, consider restoring from backup

**Recovery procedure**:

### Option 1: Remove duplicates manually (if few)
```bash
# Backup first
cp .smbatcher/REGISTRY.md .smbatcher/REGISTRY.md.backup

# Create clean registry
grep "^|" .smbatcher/REGISTRY.md | sort -u > /tmp/registry-clean.txt
wc -l /tmp/registry-clean.txt
# Should be 568 unique entries + 1 header

# Verify looks good
head -5 /tmp/registry-clean.txt

# Replace
mv /tmp/registry-clean.txt .smbatcher/REGISTRY.md
```

### Option 2: Restore from git (if many duplicates)
```bash
# See history
git log --oneline .smbatcher/REGISTRY.md | head -5

# Restore to last known good
git checkout HEAD~5 .smbatcher/REGISTRY.md

# Verify
grep -c "^|" .smbatcher/REGISTRY.md
```

### Option 3: Restore from backup
```bash
# If you made a backup earlier
cp .smbatcher/REGISTRY.md.backup .smbatcher/REGISTRY.md
```

**Prevention**:
- This should not happen with atomic operations
- If it does, report as system bug
- Increase atomic write robustness

**Time investment**: 20-40 minutes (depending on corruption extent)

---

## Scenario 3: Disk Space Running Out (Critical)

**Symptom**: df -h shows < 500MB free, getting lower

**Probable cause**: Sites directory growing as files generated

**Diagnosis** (5 min):

```bash
# Check disk usage
df -h .smbatcher/
df -h sites/

# Check largest directories
du -sh sites/ .smbatcher/ | sort -h

# Estimate growth rate
# If growing 50MB/min: ~9 hours to fill at current rate
```

**Severity**:

- **Free space > 1GB**: Normal, no action needed
- **Free space 500MB-1GB**: Warning, monitor closely
- **Free space 100-500MB**: Alert, clean up now
- **Free space < 100MB**: Critical, stop operations

**Cleanup procedures** (in order of preference):

### 1. Clear temporary files (safe)
```bash
# Remove agent temp files
du -sh .smbatcher/tmp/
rm -rf .smbatcher/tmp/*
# Gain: 50-200MB typically

# Remove design cache if exists
rm -rf .smbatcher/design-cache/*
```

### 2. Archive completed batches (if applicable)
```bash
# Move batch 002-009 to archive
mkdir -p .project-archive/
tar -czf .project-archive/batches-002-009.tar.gz sites/*-v1
# Gain: 100-500MB depending on batch size
```

### 3. Compress site files (risky - don't do unless desperate)
```bash
# WARNING: Only if absolutely necessary
gzip -r sites/*/
# Gain: 30-50% space reduction
# Risk: Need to uncompress before serving
```

### 4. Delete oldest batch (last resort)
```bash
# WARNING: Permanent data loss
rm -rf .smbatcher/batches/Batch_002.md
rm -rf sites/*-v1 | grep 002
# This is irreversible
```

**Prevention for future**:

- Estimate disk needs: 568 sites × 50KB average = 28GB needed
- Ensure 50GB free before starting
- Plan for cleanup between batches
- Monitor growth rate continuously

**Time investment**: 10-20 minutes

---

## Scenario 4: One Site Repeatedly Failing (Stuck)

**Symptom**: Same site (e.g., lotus.dev) keeps failing, never completes

**Probable cause**: Site-specific issue (bad domain name, special characters, design conflict)

**Diagnosis** (10-15 min):

```bash
# Check status of problem site
tools/shared/list-sites.sh | grep lotus.dev

# Check design exists
ls -la .smbatcher/designs/lotus.dev-DESIGN.md

# Check for error log entries
grep -r "lotus.dev" logs/ 2>/dev/null | grep -i error

# Check if files were generated despite failure
ls -la sites/lotus.dev-v1/ 2>/dev/null || echo "No directory"

# Get site status history (if available)
git log -p .smbatcher/REGISTRY.md | grep "lotus.dev" | head -10
```

**Root cause identification**:

### If design missing:
- Redesign phase failed for this site
- Redeploy: `tools/prepare/mdesign.sh --batch 001 --max-agents 1 --sites lotus.dev`

### If files generated but status not updated:
- Agent finished but didn't update registry
- Manually update: Mark as I, then finalize

### If design exists but agent keeps failing:
- Agent hitting timeout or error
- Solution: Increase timeout, redeploy, or manual fix

### If design non-compliant:
- Design spec has issue (bad colors, missing sections)
- Manual fix: Edit DESIGN.md, redeploy agent

**Resolution**:

```bash
# Option 1: Re-run just this site
tools/implement/mimplement-bg.sh --batch 001 --max-agents 1 --sites lotus.dev

# Option 2: Accept and move on (if close to deadline)
# Mark manually as Q, document in closure report

# Option 3: Manual fix
# Create HTML/CSS/JS manually, put in sites/lotus.dev-v1/
# Update registry: Mark as Q
```

**Time investment**: 30-60 minutes (depending on fix complexity)

---

## Scenario 5: Total System Unresponsiveness (Emergency)

**Symptom**: Commands hang, system slow, nothing responding

**Probable cause**: Resource exhaustion (memory, CPU, I/O), system overload

**Emergency response** (< 5 min):

```bash
# Kill non-essential processes (carefully!)
# WARNING: Only kill if system truly unresponsive

# Check what's using resources
top -bn1 | head -20

# Kill heaviest agent processes (careful!)
pkill -f "opus" -n 5  # Kill 5 Opus agents

# Or restart everything (nuclear option)
killall -9 mimplement-bg.sh
killall -9 python
# This loses all in-progress work
```

**Recovery** (after stabilization):

1. **Wait for system to calm down** (5 minutes)
   - Stop all new operations
   - Let running processes complete

2. **Assess damage**:
   - How many sites lost in-progress?
   - How much progress lost?
   - Can you recover from checkpoint?

3. **Decide path forward**:
   - If < 50 sites lost: Restart those sites
   - If 50-100 sites lost: Restart entire wave
   - If > 100 sites lost: May need full restart

4. **Prevent recurrence**:
   - Reduce agents per wave
   - Add monitoring/alerts
   - Upgrade system resources

**Time investment**: 30-60 minutes recovery

---

## Scenario 6: Finalization Fails Catastrophically

**Symptom**: finish.sh returns error, some sites at Q, some still at I

**Probable cause**: Atomic operation failed mid-execution

**Diagnosis** (10 min):

```bash
# Check status distribution
echo "I-status: $(tools/shared/list-sites.sh --batch 001 --status 'I' | wc -l)"
echo "Q-status: $(tools/shared/list-sites.sh --batch 001 --status 'Q' | wc -l)"

# Total should still be 517
TOTAL=$((I_COUNT + Q_COUNT))
echo "Total: $TOTAL (expected 517)"

# If total < 517: Some sites lost (bad)
# If total = 517 but split: Partial finalization (not ideal but recoverable)
```

**Recovery**:

### If partial (some at Q, some at I):
```bash
# Get list of remaining I-status sites
tools/shared/list-sites.sh --batch 001 --status "I" > /tmp/remaining.txt

# Count them
wc -l /tmp/remaining.txt
# If small (< 50): Retry finalization
#   tools/implement/finish.sh --batch 001 --root .
# If large (> 50): May indicate deeper issue
```

### If sites disappeared:
```bash
# Restore from backup/git
git checkout .smbatcher/REGISTRY.md

# Restart finalization from beginning
tools/implement/finish.sh --batch 001 --root .
```

### If registry corrupted:
```bash
# Check integrity
grep -c "^|" .smbatcher/REGISTRY.md

# If strange count: Restore and retry
git checkout .smbatcher/REGISTRY.md
tools/implement/finish.sh --batch 001 --root .
```

**Time investment**: 20-40 minutes

---

## Scenario 7: Network Failure (If Remote Execution)

**Symptom**: Connection drops, SSH session dies, system appears unreachable

**Probable cause**: Network issue, server down, SSH timeout

**Diagnosis** (immediate):

```bash
# From local machine
ping <server-ip>

# Try SSH
ssh user@server

# Check logs if you can access
ssh user@server "tail -100 batch.log"
```

**If server unreachable**:

1. **Check network**: Is server down? Is network up?
2. **Check SSH**: Try different port or key
3. **Wait**: Server may auto-recover in 5-10 min

**If server is up but SSH not working**:

```bash
# Try different SSH method
ssh -o ConnectTimeout=5 user@server

# Or use different connection (console, VPN, etc.)
```

**Recovery**:

```bash
# Once connected, check status
tools/shared/list-sites.sh --batch 001 --status "I" | wc -l

# If execution continued: Good, just monitor
# If execution stopped: Restart agents
```

**Prevention**:

- Use `tmux` or `screen` for persistent sessions
- Use `nohup` to run background processes
- Set SSH keep-alive: `ServerAliveInterval 60`

**Time investment**: 10-20 minutes

---

## Decision Matrix for Advanced Issues

| Issue Type | Likelihood | Severity | Recovery Time | Recommended Action |
|-----------|-----------|----------|----------------|-------------------|
| Partial wave failure | Medium | Low | 15-30 min | Redeploy agents |
| Registry corruption | Low | High | 20-40 min | Restore from backup |
| Disk space critical | Low | High | 10-20 min | Clean up temp files |
| Single site stuck | Medium | Low | 30-60 min | Manual fix or accept |
| System unresponsive | Low | Critical | 30-60 min | Kill processes, restart |
| Finalization fails | Low | High | 20-40 min | Retry or restore |
| Network failure | Low | Medium | 10-20 min | Reconnect, restart |

---

## Escalation Protocol

**When to escalate** (not solvable by standard procedures):

1. **Identify** if issue is beyond standard troubleshooting
2. **Document** everything attempted so far
3. **Gather** logs, screenshots, exact error messages
4. **Contact** technical lead or project architect
5. **Brief** them on:
   - What you tried
   - What happened
   - Current system state
   - Recommended next steps

**Information to provide on escalation**:

```
ISSUE: [Description]
IMPACT: [What's affected]
SEVERITY: [High/Medium/Low]
TIME SINCE OCCURRENCE: [Duration]
ATTEMPTED FIXES: [What you tried]
CURRENT STATE: [Status now]
SYSTEM HEALTH: [Resources, disk, memory]
ATTACHMENT: [Logs, commands output]
```

---

## Conclusion

Advanced troubleshooting requires:
- ✅ Deep system understanding
- ✅ Methodical diagnosis
- ✅ Careful recovery procedures
- ✅ Clear escalation path

**Most issues are recoverable** if handled systematically.

---

*Advanced Troubleshooting Scenarios: 2026-03-24*
*Purpose: Expert-level issue resolution*
*Status: Ready for complex problem-solving*
