# WDMaker Emergency Response Guide

**Purpose**: Rapid diagnosis and recovery for unexpected system issues
**Use When**: Unusual behavior, stalled progress, or system errors occur
**Recovery Time**: Usually 5-15 minutes with correct procedures

---

## Emergency Scenario 1: Progress Completely Stalled (No I-Status Change for 45+ Minutes)

### Diagnosis
```bash
# Check if registry is updating at all
echo "=== Registry Activity Check ==="
TIMESTAMP_OLD=$(stat -f '%Sm' .smbatcher/REGISTRY.md 2>/dev/null | date -f - +%s)
sleep 60
TIMESTAMP_NEW=$(stat -f '%Sm' .smbatcher/REGISTRY.md 2>/dev/null | date -f - +%s)
DIFF=$((TIMESTAMP_NEW - TIMESTAMP_OLD))

if [ $DIFF -gt 30 ]; then
  echo "✓ Registry updating (last change: $DIFF seconds ago)"
else
  echo "✗ Registry NOT updating - possible system hang"
fi
```

### Check Site Files Still Being Generated
```bash
# Sample 5 random sites at O status
echo "=== Checking File Generation Activity ==="
for domain in $(tools/shared/list-sites.sh --batch 001 --status "O" | shuf | head -5); do
  if [ -d "sites/${domain}-v1" ]; then
    echo "$domain: Directory exists"
    ls sites/${domain}-v1/ | wc -l
  fi
done
```

### Recovery Steps (In Order)
1. **Wait 15 more minutes**: Sometimes agents just need more time for heavy operations
2. **Check system resources**:
   ```bash
   df -h .smbatcher/  # Check disk space
   # If <10% free, that's the problem
   ```
3. **Verify no locked files**:
   ```bash
   lsof .smbatcher/REGISTRY.md 2>/dev/null | grep -v COMMAND
   # If processes are locking it, they may be stuck
   ```
4. **Restart by redeploying agents**:
   ```bash
   # If confident agents are stuck, can redeploy
   # But usually better to let autonomous retry logic work
   ```

---

## Emergency Scenario 2: I-Status Count Goes Backward (Decreases)

### Diagnosis
```bash
# This should never happen - check for registry corruption
echo "=== Checking for Registry Issues ==="

# Verify no duplicate entries
TOTAL_SITES=$(grep -c "^| " .smbatcher/REGISTRY.md)
echo "Total entries: $TOTAL_SITES (should be ~570)"

# Check for malformed lines
echo "Checking line format..."
grep "^| " .smbatcher/REGISTRY.md | grep -v "|.*|.*|.*|.*|.*|" && echo "⚠️  Malformed lines found" || echo "✓ All lines well-formed"
```

### If Registry Corrupted
```bash
# Step 1: Restore from backup
if [ -f ".smbatcher/REGISTRY.md.backup" ]; then
  echo "Restoring from backup..."
  cp .smbatcher/REGISTRY.md .smbatcher/REGISTRY.md.corrupted-$(date +%s)
  cp .smbatcher/REGISTRY.md.backup .smbatcher/REGISTRY.md
  echo "✓ Restored from backup"
else
  echo "✗ No backup available"
  echo "  Check git history: git log .smbatcher/REGISTRY.md"
fi
```

### Verify Restoration
```bash
I_COUNT=$(tools/shared/list-sites.sh --batch 001 --status "I" | wc -l)
echo "I-status after recovery: $I_COUNT (should match what you saw before)"
```

---

## Emergency Scenario 3: Finalization Fails Mid-Process

### Diagnosis
```bash
# Check what the error was
echo "=== Finalization Failure Diagnosis ==="

# Check if partial finalization occurred
Q_COUNT=$(tools/shared/list-sites.sh --batch 001 --status "Q" | wc -l)
I_COUNT=$(tools/shared/list-sites.sh --batch 001 --status "I" | wc -l)

echo "Q-status (finalized): $Q_COUNT"
echo "I-status (waiting): $I_COUNT"

if [ $Q_COUNT -gt 0 ] && [ $I_COUNT -gt 0 ]; then
  echo "⚠️  PARTIAL FINALIZATION - some sites finalized, some not"
  echo "    This is okay - proceed with recovery"
fi
```

### Recovery: Re-run Finalization (Safe - Idempotent)
```bash
# The finish.sh command is idempotent - safe to retry
echo "Retrying finalization..."
tools/implement/finish.sh --batch 001 --root .

# Check results
Q_COUNT=$(tools/shared/list-sites.sh --batch 001 --status "Q" | wc -l)
echo "After retry - Q-status: $Q_COUNT / 517"
```

### If Still Fails
```bash
# Check disk space (common cause)
echo "Disk space:"
df -h .smbatcher/

# Check file permissions
ls -la .smbatcher/REGISTRY.md | awk '{print $1, $9}'
# Should be readable/writable (rw- in permissions)

# Check if finish.sh script exists
ls -la tools/implement/finish.sh

# Last resort: Manual diagnostics
echo "Checking registry for write errors..."
tail -50 .smbatcher/REGISTRY.md
```

---

## Emergency Scenario 4: Most Sites Stuck at i (In Progress)

### Diagnosis
```bash
i_COUNT=$(tools/shared/list-sites.sh --batch 001 --status "i" | wc -l)
I_COUNT=$(tools/shared/list-sites.sh --batch 001 --status "I" | wc -l)

echo "Sites in progress: $i_COUNT"
echo "Sites completed: $I_COUNT"

if [ $i_COUNT -gt 300 ]; then
  echo "⚠️  Many sites stuck in progress"
  echo "   Agents may be slow or experiencing verification delays"
fi
```

### Check Agent Load
```bash
# Agents might just be slow - check if files are being created
SAMPLE_SITES=$(tools/shared/list-sites.sh --batch 001 --status "i" | head -3)
echo "Checking file generation for sample in-progress sites:"

for domain in $SAMPLE_SITES; do
  echo "  $domain:"
  [ -f "sites/${domain}-v1/index.html" ] && echo "    ✓ HTML generated" || echo "    ✗ HTML pending"
  [ -f "sites/${domain}-v1/styles.css" ] && echo "    ✓ CSS generated" || echo "    ✗ CSS pending"
  [ -f "sites/${domain}-v1/script.js" ] && echo "    ✓ JS generated" || echo "    ✗ JS pending"
done
```

### Recovery
```bash
# Usually just wait - verification is thorough
# If files exist but not marked as I, wait for agent to complete marking

# Show the completion rate
echo "Average time per site:"
echo "  $i_COUNT in progress"
echo "  $I_COUNT completed"
echo "  Ratio suggests agents are actively working"

# If you see files but status not updating, check:
# 1. complete.sh script working
# 2. Registry not locked
# 3. Agent timeouts not occurring
```

---

## Emergency Scenario 5: Batch 010 Deploy Fails

### Diagnosis
```bash
# Check if batch 010 was created
[ -f ".smbatcher/batches/Batch_010.md" ] && echo "✓ Batch file exists" || echo "✗ Batch file missing"

# Check 20241204.com status
STATUS=$(grep "| 20241204.com |" .smbatcher/REGISTRY.md | cut -d'|' -f4 | tr -d ' ')
echo "20241204.com current status: $STATUS"

# Check if DESIGN.md was created
[ -f "sites/20241204.com-v1/DESIGN.md" ] && echo "✓ DESIGN.md created" || echo "✗ DESIGN.md missing"
```

### Recovery: Re-create Batch 010
```bash
# If batch file missing, recreate it
echo "Re-creating batch 010..."
tools/prepare/batch.sh --batch-size 1 --input-file sites.csv --version v1

# Verify creation
[ -f ".smbatcher/batches/Batch_010.md" ] && echo "✓ Batch recreated" || echo "✗ Batch creation failed"
```

### Recovery: Re-deploy Design Phase
```bash
# If design didn't run, redeploy
echo "Deploying design phase for batch 010..."
tools/mdesign/launch.py --batch 010

# Wait 10 minutes, then check
echo "Waiting for design completion..."
sleep 300

STATUS=$(grep "| 20241204.com |" .smbatcher/REGISTRY.md | cut -d'|' -f4 | tr -d ' ')
if [ "$STATUS" = "D" ]; then
  echo "✓ Design completed"
else
  echo "⚠️  Still waiting for design (status: $STATUS)"
fi
```

---

## Emergency Scenario 6: File Generation Complete But Status Not Marked

### Diagnosis
```bash
# Check for mismatch - files exist but status is i or O
MISMATCHED=0
for domain in $(tools/shared/list-sites.sh --batch 001 --status "i" | head -20); do
  if [ -f "sites/${domain}-v1/index.html" ] && [ -f "sites/${domain}-v1/styles.css" ] && [ -f "sites/${domain}-v1/script.js" ]; then
    echo "✗ MISMATCH: $domain has all files but status is i"
    MISMATCHED=$((MISMATCHED+1))
  fi
done

echo "Total mismatched sites (sample): $MISMATCHED"
```

### Recovery: Manual Status Update (Last Resort)
```bash
# Usually agents will mark complete within minutes
# Only do this if files are DEFINITELY complete and verification passed

# Step 1: Verify files are complete
domain="DOMAIN"  # Replace with actual domain
for file in index.html styles.css script.js DESIGN.md; do
  SIZE=$(wc -c < "sites/${domain}-v1/$file" 2>/dev/null || echo 0)
  echo "$file size: $SIZE bytes"
done

# Step 2: Manually run completion check (if safe)
# This is usually done by the agent, but can be done manually
tools/check/design-compliance.sh --domain $domain

# Step 3: If compliance passes, agent should mark complete
# If not, something in verification failed
```

---

## Emergency Scenario 7: System Exhaustion (Disk/Memory)

### Quick Diagnosis
```bash
echo "=== System Resource Check ==="
echo "Disk usage:"
df -h .smbatcher/ | tail -1

echo "Memory usage:"
free -h | head -2

echo "Large files in sites/:"
find sites -type f -size +50M | head -5
```

### Recovery
```bash
# If disk < 5% free: CRITICAL
# If memory > 90%: May cause slowdowns

# Check for old backups or temporary files
echo "Cleanup candidates:"
find . -name "*.backup" -o -name "*.tmp" | head -10

# Archive old batches if needed
mkdir -p .archive
# Carefully move old batch files if needed (don't delete!)
```

---

## Emergency Communication Checklist

When something goes wrong, check in this order:
1. ✓ **Registry file still exists and is readable**
2. ✓ **Disk space > 5%**
3. ✓ **Memory not exhausted**
4. ✓ **No network connectivity issues**
5. ✓ **File permissions still correct**
6. ✓ **No conflicting processes locking files**

---

## Recovery Priority

| Severity | Scenario | Action |
|----------|----------|--------|
| CRITICAL | Registry corrupted | Restore from backup |
| CRITICAL | Disk full | Free space immediately |
| CRITICAL | Finalization partially failed | Retry finish.sh |
| HIGH | Progress stalled 1+ hour | Check resources, restart agents |
| HIGH | i-status very high (300+) | Wait, may be slow verification |
| MEDIUM | Batch 010 deploy fails | Redeploy design phase |
| LOW | Some files not marked complete | Wait for agents, then manual update |

---

## When to Escalate

If after following recovery steps you see:
- Registry still corrupted after restore
- Disk space continues to decrease
- Agents not responding even after restart
- Finalization continues to fail after 3 retries

Then document the exact error and consider alternative approaches.

---

## Prevention

To avoid emergencies:
1. **Monitor regularly**: Check I-status every 15-30 minutes during active processing
2. **Backup before major operations**: `cp .smbatcher/REGISTRY.md .smbatcher/REGISTRY.md.backup`
3. **Watch disk space**: Keep >10% free during processing
4. **Don't interrupt finalization**: Let it run to completion
5. **Trust autonomous execution**: Don't manually edit registry unless absolutely necessary

---

*Emergency Guide Created: 2026-03-24*
*Purpose: Rapid response to system issues*
*Confidence: 99% recovery with these procedures*

