Add Step (p2cs_project)
When to Use
Use this skill when:
- A new pipeline step needs to be added under
code/step_{N}_{name}/.
- A step should be inserted between existing steps, requiring renumbering of later steps.
- You need a checklist for creating the step class, substeps, README,
explore.ipynb, and wiring dependencies/config.
This skill is project-specific to p2cs_project and assumes the architecture described in the root README.md.
Overview: Step Pattern
Each numbered step follows the same high-level pattern:
- Directory:
code/step_{N}_{snake_name}/
step.py: main Step{N}{CamelName} class inheriting from base.Step.
data_classes.py: data classes inheriting from DataFile subclasses (when the step has new artifacts).
__init__.py: re-exports the main step class (and sometimes helper symbols).
README.md: step-specific documentation (inputs, outputs, substeps, external tools).
explore.ipynb: exploration notebook following the standard notebook structure from the root README.md.
- Numbered substeps:
1_*.py, 2_*.py, etc., each implementing focused logic.
- Outputs in
data/step_{N}_{snake_name}/ and figures in figures/step_{N}_{snake_name}/ are routed via base.paths.
- The step is registered in:
pipeline_config.yaml under steps: step_{N}_{snake_name}: ...
- Tests in
code/tests/test_step_{N}.py.
Good templates to copy from:
- Simple single-substep data step:
step_4_prepare_pairs/
- Multi-substep data/tooling step:
step_2_organism_distance/
- Modeling/evaluation step:
step_6_train_model/, step_7_crosstalk_estimation/
Workflow A: Append a New Step at the End
-
Determine the new step index and name
- Inspect existing numbered steps under
code/step_*.
- Let
N_max be the largest index (currently 8 for step_8_generate_paper).
- Choose:
- New index:
N_new = N_max + 1
- Snake name:
step_{N_new}_{snake_name}
- Class name:
Step{N_new}{CamelName}
-
Create the step directory
- Create
code/step_{N_new}_{snake_name}/ with at least:
__init__.py (re-export the main step class).
step.py (main step implementation).
README.md (step documentation).
explore.ipynb (exploration notebook).
- One or more numbered substeps
1_*.py, 2_*.py, etc.
data_classes.py and/or config.json if this step defines new data types or config.
- Recommended pattern: Copy the closest existing step directory (e.g.,
step_4_prepare_pairs/) and rename/trim to match the new step’s responsibilities.
-
Implement the main step class
- In
step.py:
- Import
Step and path helpers from code/base/step.py and code/base/paths.py.
- Define a class like:
class Step{N_new}{CamelName}(Step):
- Implement:
name and description properties (or class attributes).
dependencies property returning a List[str] of upstream steps, using canonical IDs like "step_1_get_p2cs_data".
get_input_paths() and get_output_paths() using data classes and get_step_input_path / get_step_output_path.
run() orchestrating any substeps via self.run_substeps(...).
-
Define data classes (if needed)
- In
data_classes.py:
- Inherit from appropriate
DataFile subclasses (e.g., PickleDataFile, CSVDataFile, NumpyDataFile).
- Define schemas, descriptions, and default loaders/savers as in existing steps.
- Use these data classes in
get_input_paths() / get_output_paths() and in substeps.
-
Create numbered substeps
- Add scripts
1_*.py, 2_*.py, etc. inside the new step directory.
- Follow existing substep patterns:
- Each substep is a small class/function using the step’s data classes and
paths helpers.
- The main step’s
run() calls self.run_substeps(...) with:
- Substep objects
step_numbers=[1, 2, ...]
descriptions=[...]
- Appropriate
on_failure mode ("strict" or "warning").
-
Create the step README
- In
README.md, mirror the structure used in other steps:
- Short description.
- Inputs (data classes, upstream steps).
- Outputs and their data classes.
- Substeps and what they do.
- Any external tools / configs required.
-
Create the explore.ipynb notebook
- Follow the standard structure from the root
README.md:
# Imports (path setup + step/data class imports).
# Load Data
## Load Inputs (using step.get_input_paths() + data classes).
## Load Outputs.
# Plot
- Display saved figures from visualization substeps first.
- Put any extra exploratory plots after those.
# Notes (short list of exploration ideas).
- Respect the collapsible headings rule (heading-only markdown cells).
-
Wire the step into pipeline_config.yaml
- Under
steps:, add a new entry:
- Key:
step_{N_new}_{snake_name}:
- Fields:
enabled, description, overwrite_outputs, optional fast_plots, and substeps:.
- Add a
substeps: section keyed by the filenames (without .py), matching patterns in other steps.
-
Add tests
- Create
code/tests/test_step_{N_new}.py by copying a nearby test (e.g., test_step_4.py) and adjusting:
- Imports to the new step and data classes.
- Test names and assertions to cover the new step’s behavior.
-
Run tests / pipeline checks
- Run
pytest code/tests/test_step_{N_new}.py.
- Optionally run the step via:
cd code && python run_pipeline.py --step step_{N_new}_{snake_name}
Workflow B: Insert a Step in the Middle (with Renumbering)
Use this when inserting a new step between existing steps (e.g., between step_3_embed_proteins and step_4_prepare_pairs).
B1. Plan the new ordering
-
Identify current step order
- List existing
code/step_* directories and their indices (including step_0_draw_theoretical).
-
Choose insertion point
- Let:
N_insert_after = index of the step before the new one.
N_new = N_insert_after + 1.
- All steps with index
> N_insert_after must be shifted up by 1:
- Old
k → new k + 1 for all k > N_insert_after.
-
Decide the new step’s ID
- Choose:
- New directory name:
step_{N_new}_{snake_name}.
- New class name:
Step{N_new}{CamelName}.
B2. Renumber existing steps (highest → lowest)
Perform renaming from highest index down to N_insert_after + 1 to avoid collisions.
For each step index k in descending order where k > N_insert_after:
-
Compute new index
-
Rename step directories
- Code:
code/step_{k}_{name}/ → code/step_{k_new}_{name}/.
- Data:
data/step_{k}_{name}/ → data/step_{k_new}_{name}/ (if exists).
- Figures:
figures/step_{k}_{name}/ → figures/step_{k_new}_{name}/ (if exists).
-
Rename tests
code/tests/test_step_{k}.py → code/tests/test_step_{k_new}.py.
-
Update configuration keys
- In
pipeline_config.yaml, change:
step_{k}_{name}: → step_{k_new}_{name}:.
-
Update string references and imports
- Use text search for
step_{k}_{name} and test_step_{k} across the repo and update to the new IDs:
- Imports like
from step_{k}_{name}....
- Dependency lists in
dependencies properties (e.g., return ["step_{k}_{name}", ...]).
- Any key strings that reference
step_{k}_{name}.
-
Update doc references
- In
README.md files and notebooks, update any textual references to the old step name or number, if present.
B3. Add the new step
After all affected steps k > N_insert_after have been shifted to k + 1:
-
Create code/step_{N_new}_{snake_name}/
- Follow Workflow A, steps 2–7 to:
- Implement
step.py and data_classes.py.
- Add numbered substeps.
- Add
README.md.
- Add
explore.ipynb.
-
Wire into pipeline_config.yaml
- Under
steps: add:
step_{N_new}_{snake_name}: with its configuration and substeps.
-
Update dependencies
- For the new step:
- Set
dependencies to the upstream steps, using the renumbered IDs.
- For downstream steps:
- Review their
dependencies properties:
- Replace any old IDs that were shifted, and add the new step as a dependency where appropriate.
-
Add test file
- Create
code/tests/test_step_{N_new}.py following neighboring step tests.
-
Sanity check references
- Run a repo-wide search for any old step IDs (
step_{k}_{name} where k was renumbered) and ensure:
- All references are either removed or updated to the new IDs.
B4. Validate after renumbering
-
Run targeted tests
- Run:
cd code && pytest tests/test_step_{N_new}.py
- Plus tests for all renumbered steps:
test_step_{k_new}.py.
-
Run a dry pipeline
- Optionally run:
cd code && python run_pipeline.py --list-steps to confirm updated IDs and ordering.
cd code && python run_pipeline.py --step step_{N_new}_{snake_name} to test the new step in context.
Notebook Guidelines (Quick Reference)
When creating or editing explore.ipynb for a step:
- Follow the standard sections:
# Imports
# Load Data
## Load Inputs
## Load Outputs
# Plot
# Notes
- Ensure each heading is in its own markdown cell to enable collapsible sections.
- Use the data classes for loading inputs/outputs, not raw paths.
- Display visualization substep figures first under
# Plot; additional exploratory plots come after.
Usage Summary
When asked to add a new step:
- Decide whether it is an append (Workflow A) or insert with renumbering (Workflow B).
- Follow the appropriate workflow carefully, especially:
- Directory and file naming:
step_{N}_{snake_name}, test_step_{N}.py.
- Dependency updates and imports.
pipeline_config.yaml step and substep entries.
explore.ipynb structure and data class usage.
- Always finish by running the relevant tests and, if feasible, a pipeline run of the new step.