Remote Run over SSH
Operate CVlization examples on the remote GPU reachable as ssh l1. This playbook keeps the remote copy minimal—just the target example folder (e.g., examples/perception/multimodal_multitask/recipe_analysis_torch) and the cvlization/ library—then relies on the example’s own build.sh / train.sh Docker helpers. The user’s long-lived checkout on the remote stays untouched.
When to Use
- Heavy trainings or evaluations that require CUDA (CIFAR10 speed runs, multimodal pipelines, etc.).
- Performance or regression measurements on the remote GPU after local code changes.
- Producing reproducible logs / artifacts for discussions or CI baselines without pushing a branch first.
Prerequisites
- Local repo state ready to sync (uncommitted changes acceptable).
- SSH config already maps the GPU machine to
l1.
- Remote host provides NVIDIA GPU (currently A10) and Docker with GPU runtime enabled.
- At least ~15 GB free under
/tmp for the slim workspace, Docker context, and caches.
- Hugging Face tokens or other creds available locally if the example pulls hub assets.
Quick Reference
- Choose the example to run.
rsync only the example folder, cvlization/, and any required helper dirs to /tmp/cvlization_remote on l1.
- On
l1, run ./build.sh inside the example folder to build the Docker image.
- Run
./train.sh (or the example’s equivalent) to launch the job with GPU access.
- Collect logs / metrics and record the run in
var/skills/remote-run-ssh/runs/<timestamp>/log.md.
Detailed Procedure
1. Identify the example and supporting files
- Note the example path relative to repo root (e.g.,
examples/perception/multimodal_multitask/recipe_analysis_torch).
- List any extra assets the run needs (custom configs under
examples, top-level scripts, environment files, etc.).
- Confirm
cvlization/ includes all library modules the example imports.
2. Sync minimal workspace to /tmp
bash
1REMOTE_ROOT=/tmp/cvlization_remote
2rsync -az --delete \
3 --include='cvlization/***' \
4 --include='examples/***' \
5 --include='scripts/***' \
6 --include='pyproject.toml' \
7 --include='setup.cfg' \
8 --include='README.md' \
9 --exclude='*' \
10 ./ l1:${REMOTE_ROOT}/
Tips:
- Adjust the include list if the example needs additional files (e.g.,
docker-compose.yml, requirements.txt). The blanket --exclude='*' prevents unrelated directories from syncing.
- Keep the remote path structure (
${REMOTE_ROOT}/examples/...) aligned with the local repo so relative paths like ../../../.. used inside scripts still resolve to the repo root.
- Avoid syncing
.git, local datasets, .venv, or heavy cache directories.
3. Build the example image
bash
1ssh l1 'cd /tmp/cvlization_remote/examples/perception/multimodal_multitask/recipe_analysis_torch && ./build.sh'
- The Docker build context is limited to the example folder, keeping builds quick.
- Edit
build.sh locally if you need custom base images or dependency tweaks before re-syncing.
4. Confirm GPU availability
Ensure no conflicting jobs are consuming the GPU before starting a long run.
5. Run the training script (Docker)
bash
1ssh l1 'cd /tmp/cvlization_remote/examples/perception/multimodal_multitask/recipe_analysis_torch && ./train.sh > run.log 2>&1'
- Tail logs while the job runs:
bash
1ssh l1 'tail -f /tmp/cvlization_remote/examples/perception/multimodal_multitask/recipe_analysis_torch/run.log'
train.sh already mounts the example directory at /workspace, mounts the synced repo root read-only at /cvlization_repo, and sets PYTHONPATH=/cvlization_repo.
- Customize environment variables or extra mounts by editing
train.sh locally (e.g., injecting dataset paths, WANDB keys) then re-syncing.
- If an example lacks Docker scripts, fall back to running its entrypoint directly (
python train.py) inside the container or a temporary venv—but document the deviation in the run log.
6. Capture metrics
Log files should include per-epoch summaries and final metrics. Record:
- Wall-clock time / throughput.
- Accuracy, loss, or other task metrics.
- Warnings or notable log lines (e.g., retry downloads, CUDA warnings).
7. Retrieve artifacts (optional)
bash
1rsync -az l1:/tmp/cvlization_remote/examples/perception/multimodal_multitask/recipe_analysis_torch/run.log \
2 ./remote_runs/$(date +%Y%m%dT%H%M%S)_multimodal.log
Copy checkpoints, TensorBoard logs, or generated samples in the same manner if needed.
8. Document the run
- Create
var/skills/remote-run-ssh/runs/<timestamp>/log.md summarizing:
- Example path, git commit or diff basis.
- Commands executed (
build.sh, train.sh args).
- Key metrics / observations.
- Location of logs or artifacts (local paths or remote references).
9. Cleanup (optional)
- Delete
/tmp/cvlization_remote when finished if the disk budget is tight (ssh l1 'rm -rf /tmp/cvlization_remote').
- Clear cached datasets (
rm -rf /root/.cache/... inside Docker) only if future runs should start fresh.
Troubleshooting
- Missing module inside container: Ensure
train.sh mounts the repo root and sets PYTHONPATH. Re-run rsync if files were added after the initial sync.
- Docker build fails: Inspect the output of
./build.sh. Some examples assume base images with CUDA toolkits—update the Dockerfile accordingly.
- CUDA OOM: Reduce batch sizes or precision, or run one job at a time on
l1.
- No GPU detected: Confirm
--gpus=all is present in train.sh and check nvidia-smi on the host.
- Long sync times: Tighten the rsync include rules to the specific example and library folders required.
Outputs
Every run should leave:
- Remote workspace at
/tmp/cvlization_remote containing the synced example + library.
- Local or remote logs with the captured metrics.
- A run log under
var/skills/remote-run-ssh/runs/<timestamp>/log.md documenting the session.