clinops.split¶
clinops.split.splitters.TemporalSplitter
¶
Split clinical data on a datetime cutoff.
All rows with time_col < cutoff go to train; all rows with
time_col >= cutoff go to test. This is the only split
strategy that respects temporal ordering and prevents future leakage.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cutoff
|
str | Timestamp | None
|
Datetime string, pd.Timestamp, or None. If None, |
None
|
train_frac
|
float
|
Fraction of the time range to use for training when |
0.8
|
time_col
|
str
|
Name of the datetime column. Default |
'charttime'
|
Examples:
>>> splitter = TemporalSplitter(cutoff="2155-01-01")
>>> result = splitter.split(df)
>>> print(result.summary())
>>> # Auto-cutoff at 80% of the time range
>>> splitter = TemporalSplitter(train_frac=0.8, time_col="admittime")
>>> result = splitter.split(df)
Source code in clinops/split/splitters.py
split
¶
Split df into train and test sets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame. Must contain |
required |
Returns:
| Type | Description |
|---|---|
SplitResult
|
|
Source code in clinops/split/splitters.py
clinops.split.splitters.PatientSplitter
¶
Split clinical data at the patient level.
Ensures all rows for a given patient are entirely in train or entirely in test — no patient appears in both splits. This is required to prevent label leakage in multi-admission datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
id_col
|
str
|
Patient identifier column. Default |
'subject_id'
|
test_size
|
float
|
Fraction of patients to hold out for testing. Default 0.2. |
0.2
|
random_state
|
int
|
Random seed for reproducibility. Default 42. |
42
|
Examples:
>>> splitter = PatientSplitter(id_col="subject_id", test_size=0.2)
>>> result = splitter.split(df)
>>> # Verify no patient leakage
>>> assert not set(result.train["subject_id"]) & set(result.test["subject_id"])
Source code in clinops/split/splitters.py
split
¶
Split df into train and test sets at the patient level.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame. Must contain |
required |
Returns:
| Type | Description |
|---|---|
SplitResult
|
|
Source code in clinops/split/splitters.py
clinops.split.splitters.StratifiedPatientSplitter
¶
StratifiedPatientSplitter(
id_col="subject_id",
outcome_col="mortality",
test_size=0.2,
patient_outcome_fn=None,
random_state=42,
)
Patient-level split with outcome stratification.
Combines the patient-boundary guarantee of PatientSplitter with stratification on a binary or multi-class outcome column. Ensures the outcome rate in train and test approximately matches the population rate, which is important for imbalanced clinical outcomes (e.g., in-hospital mortality typically 5–15%).
The algorithm:
1. Compute per-patient outcome (e.g., any positive in admissions)
2. Separately sample positive and negative patients at test_size
3. Combine → test set has ~same positive rate as full population
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
id_col
|
str
|
Patient identifier column. Default |
'subject_id'
|
outcome_col
|
str
|
Binary outcome column (0/1 or bool). Default |
'mortality'
|
test_size
|
float
|
Fraction of patients to hold out. Default 0.2. |
0.2
|
patient_outcome_fn
|
Callable[[Series], int] | None
|
Function that maps a per-patient group of outcome values → scalar label. Default: any positive observation → patient is positive. |
None
|
random_state
|
int
|
Random seed. Default 42. |
42
|
Examples:
>>> splitter = StratifiedPatientSplitter(
... id_col="subject_id",
... outcome_col="hospital_expire_flag",
... test_size=0.2,
... )
>>> result = splitter.split(df)
>>> print(result.summary())
Source code in clinops/split/splitters.py
split
¶
Split df with patient-level stratification on outcome.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame. Must contain |
required |
Returns:
| Type | Description |
|---|---|
SplitResult
|
|
Source code in clinops/split/splitters.py
clinops.split.splitters.SplitResult
dataclass
¶
The result of a train/test split operation.
Attributes:
| Name | Type | Description |
|---|---|---|
train |
DataFrame
|
Training set DataFrame. |
test |
DataFrame
|
Test set DataFrame. |
metadata |
dict[str, Any]
|
Dict with split statistics (sizes, outcome rates, cutoff, etc.) |
summary
¶
Return a human-readable summary of the split.