Debugging Docker Operations
Overview
Systematic troubleshooting workflows for Docker build failures, runtime errors, AWS container services, platform-specific issues, and performance optimization. Designed for multi-platform environments (ARM64/AMD64/WSL2) with pharmaceutical compliance contexts requiring GAMP-5 validation.
When to Use
- Build Failures: Package installation errors, COPY failures, layer caching issues
- Runtime Errors: Container crashes, exit codes, DNS resolution, port conflicts
- AWS Container Services: ECR authentication, ECS pull failures, rate limiting
- Platform Issues: ARM64 vs AMD64 emulation, WSL2 integration, cross-platform builds
- Performance Problems: Slow builds, large images, QEMU emulation overhead
- Volumes/Networking: Permission errors, mount failures, port binding
| Tool | Purpose |
|---|
| Bash | Execute Docker commands, view logs |
| Read | Examine Dockerfiles, compose files |
| Grep | Search for patterns in logs |
| Glob | Find Docker-related files |
| Tool | Purpose |
|---|
mcp__aws-api-mcp__call_aws | Execute AWS CLI for ECR/ECS operations |
mcp__aws-knowledge-mcp__aws___search_documentation | Search AWS docs for container guidance |
mcp__aws-ccapi-mcp__list_resources | List ECS tasks, ECR repositories |
Core Workflow
Phase 1: Symptom Diagnosis
Objective: Identify problem category and gather diagnostic information
-
Identify Problem Type
bash
1# For build failures
2docker build 2>&1 | tee build.log
3
4# For runtime errors
5docker logs <container-name>
6docker inspect <container-name>
7
8# For AWS issues
9aws ecr describe-repositories
10aws ecs describe-tasks --cluster <cluster> --tasks <task-arn>
-
Categorize the Issue
- Build Failure (exit during
docker build)
- Runtime Error (container exits or crashes)
- AWS ECR/ECS Issue (authentication, pull failures)
- Platform/Architecture Issue (ARM64/AMD64 mismatch)
- Performance Problem (slow or inefficient)
- Networking/Volume Issue
-
Collect Context
bash
1docker version && docker info | head -30
2uname -m # Platform: x86_64 or aarch64
Quality Gate: Problem type identified, initial logs collected
Phase 2: Root Cause Analysis
Build Failures
Common Patterns:
bash
1# Check build log for specific failures
2grep -E "ERROR|failed|error:" build.log
3
4# Common patterns:
5# "COPY failed" → File not in build context
6# "returned a non-zero code" → Command failed
7# "no matching manifest" → Platform mismatch
Key Checks:
- Build context:
ls -la <path-to-file>
- .dockerignore:
cat .dockerignore
- Layer caching:
docker build --no-cache
See reference/common-errors.md for complete error matrix.
Runtime Errors
Exit Code Quick Reference:
| Code | Meaning | Common Cause |
|---|
| 0 | Normal exit | Application completed |
| 1 | Application error | Check application logs |
| 126 | Cannot execute | Permission issue |
| 127 | Command not found | Wrong CMD/ENTRYPOINT |
| 137 | OOM killed | Memory limit exceeded |
| 139 | Segfault | Application crash |
| 143 | SIGTERM | Graceful shutdown |
Diagnosis:
bash
1docker inspect <container> | jq '.[0].State'
2docker logs --tail 100 <container>
AWS ECR/ECS Issues
ECR Authentication:
Error: "no basic auth credentials"
ECS Pull Failures:
Error: "CannotPullContainerError"
- Checklist:
- Task execution role has ECR permissions
- VPC has NAT Gateway (for private subnets)
- Image tag exists in ECR
Using AWS MCP:
# Check ECR images
mcp__aws-api-mcp__call_aws: aws ecr describe-images --repository-name <repo>
# Check ECS task failures
mcp__aws-api-mcp__call_aws: aws ecs describe-tasks --cluster <cluster> --tasks <task-arn>
See reference/aws-ecr-ecs.md for complete AWS troubleshooting guide.
Slow Build Detection:
bash
1# Check if emulation is active (ARM64 host building AMD64)
2docker run --rm --platform=linux/amd64 alpine uname -m
3# If output shows x86_64 on ARM64 host → emulation active (slow)
Solutions:
- Use native platform for development
- Build target platform only for deployment
- Use multi-stage with
${BUILDPLATFORM}
See reference/platform-guide.md for complete platform guidance.
Quality Gate: Root cause identified, specific error patterns documented
Phase 3: Solution & Validation
Build Failure Fixes
dockerfile
1# Fix missing file in context
2# Verify path exists: ls -la <path>
3COPY main.py /app/
4
5# Fix package installation with retry
6RUN pip install --no-cache-dir -r requirements.txt || \
7 (pip install --upgrade pip && pip install --no-cache-dir -r requirements.txt)
8
9# Fix platform issues
10# docker build --platform=linux/amd64 -t image:prod .
Runtime Error Fixes
bash
1# Fix permission issues
2docker run --user $(id -u):$(id -g) image:tag
3
4# Fix OOM (Exit 137)
5docker run -m 2g --memory-swap 2g image:tag
6
7# Fix port conflicts
8docker run -p 8081:8080 image:tag # Change host port
AWS Fixes
bash
1# Refresh ECR credentials
2aws ecr get-login-password --region eu-west-2 | docker login --username AWS --password-stdin <account>.dkr.ecr.eu-west-2.amazonaws.com
3
4# Fix Docker Hub rate limiting - use ECR Public
5# FROM public.ecr.aws/docker/library/python:3.12-slim
Validation
bash
1# Rebuild with verbose output
2docker build --progress=plain -f Dockerfile -t image:test .
3
4# Test container startup
5docker run --rm image:test
6
7# Verify functionality
8docker run -it image:test /bin/sh -c "command-to-test"
Quality Gate: Fix applied, container builds and runs without errors
Quick Reference Table
| Problem | Quick Check | Common Fix |
|---|
| ECR "no basic auth" | Token age (12hr) | aws ecr get-login-password | docker login |
| ECS CannotPull | Task role, NAT | Check IAM + VPC config |
| Docker Hub rate limit | Error message | Use ECR Public or authenticate |
| Build fails at COPY | ls -la <file> | Fix path or add to build context |
| Package install fails | Check network | Update pip, verify package name |
| Exit 137 (OOM) | docker inspect OOMKilled | Increase memory limit (-m flag) |
| Exit 127 | Command not found | Fix CMD/ENTRYPOINT path |
| Slow ARM64 build | --platform flag | Use native ARM64 for dev |
| WSL2 vmmem bloat | Task Manager | .wslconfig memory limits |
| Volume mount not working | Check compose mounts | docker-compose down && up -d |
| Port already in use | docker ps | Change host port or stop conflict |
| Permission denied | Check USER in Dockerfile | Add USER directive, fix volume perms |
Best Practices
Build Optimization
- Layer ordering: Dependencies before code (cache efficiency)
- Multi-stage builds: Separate build and runtime environments
- Use .dockerignore: Exclude
node_modules, .git, __pycache__
Security
dockerfile
1# Non-root user
2RUN adduser -D appuser
3USER appuser
4
5# Never COPY secrets - use --secret flag
bash
1# Development (native platform - fast)
2docker build -t image:dev .
3
4# Production (target platform)
5docker build --platform=linux/amd64 -t image:prod .
6
7# Multi-platform with cache
8docker buildx build \
9 --platform linux/amd64,linux/arm64 \
10 --cache-from type=registry,ref=<repo>:cache \
11 --cache-to type=registry,ref=<repo>:cache \
12 -t image:multi --push .
- Store projects in WSL2 filesystem (
~/), NOT Windows mount (/mnt/c/)
- Configure
.wslconfig with memory limits
- Use
wsl --shutdown to reclaim memory
See reference/wsl2-optimization.md for complete WSL2 guide.
Common Pitfalls
Don't: Use relative paths without verification
dockerfile
1# WRONG - may fail if context is incorrect
2COPY ../app/file.py /app/
3
4# RIGHT - relative to build context root
5COPY app/file.py /app/
bash
1# WRONG - platform inferred (may differ from prod)
2docker build -t api:latest .
3
4# RIGHT - explicit platform
5docker build --platform=linux/amd64 -t api:latest .
Do: Verify volume mounts work
bash
1# Check if container sees host files
2docker exec <container> ls -la /app/
3
4# Recreate containers after mount changes
5docker-compose down && docker-compose up -d
Quality Checklist
Detailed References
- Docker Commands:
reference/docker-commands.md
- Common Error Matrix:
reference/common-errors.md (20+ error patterns)
- AWS ECR/ECS Guide:
reference/aws-ecr-ecs.md
- Platform Guide:
reference/platform-guide.md (ARM64/AMD64/WSL2, buildx caching)
- WSL2 Optimization:
reference/wsl2-optimization.md
- Build Optimization:
reference/build-optimization.md
- Security Hardening:
reference/security-hardening.md
Diagnostic Scripts
- Build Failure Analysis:
scripts/analyze-build-failure.sh <log-file>
- Container Inspection:
scripts/inspect-container.sh <container-name>
- Platform Check:
scripts/check-platform.sh
Remember: Docker issues are systematic and diagnosable. Follow the three-phase workflow (Symptom → Root Cause → Solution), use AWS MCP tools for ECR/ECS issues, and reference the detailed guides for complex scenarios. Always validate fixes before considering the issue resolved.