pydpf.datautils.StateSpaceDataset#
- class pydpf.datautils.StateSpaceDataset(data_path: Path | str, *, series_id_column: str = 'series_id', state_prefix: str | None = None, observation_prefix: str = 'observation', time_column: str | None = None, control_prefix: str | None = None, device: device = device(type='cpu'), series_metadata_path: Path | str | None = None)#
Bases:
DatasetDataset class for state-observation data.
Latent state of the system stored in the state Tensor.
Dimensions are Discrete Time - Batch - Data
When used as called from a dataloader you must use the custom collate function Data will always be returned in the order ‘state’ - ‘observation’ - ‘time’ - ‘control’ - ‘series_metadata’
At the moment only functionality to load entire data set into RAM/VRAM is provided. Lazy loading is a planned feature.
- Parameters:
- data_path: Union[Path,str].
The path of the data file or folder.
- series_id_column: str. Default “series_id”
The heading of the series_id column in the csv files.
- state_prefix: str|None. Default None.
The prefix of heading of the state columns in the csv files.
- observation_prefix: str. Default “observation”.
The prefix of heading of the observation columns in the csv files.
- time_column: str|None. Default None.
The heading of the time column in the csv files.
- control_prefix: str|None. Default None.
The prefix of heading of the control columns in the csv files.
- device: torch.device. Default torch.device(‘cpu’).
Notes
We provide methods to load data from files, obeying a certain format, into a map-style
torch.utils.data.Datasetobject and therefore be accessed easily from atorch.utils.data.DataLoader. We allow one of two data storage formats, either storing the entire dataset in a single .csv file, or storing each trajectory in separate files {1.csv, 2.csv, …, T.csv} in a dedicated directory. The .csv files are formed of headed columns there must be at least one observation column, with state, time, and control columns being optional. As all the data categories, apart from time, are vector valued there can be multiple columns for each category. For the single-file format there must be additionally a series_id column that will be used to index each trajectory, for the multiple file format the series_id is encoded in the file name. The data category series_metadata exists to store exogenous variables that the trajectories might depend on, but are constant over a trajectory. These are to be stored in a separate .csv indexed by a series_id column. Given a file in the required format, loading a dataset is simple: initialise this class with the data’s path, the column labels and the device to store data retrieved by the data loader. When initialising the data loader, it is crucial that the argument collate_fn is set todataset.collatewhere dataset is the dataset passed to the data loader. PyTorch’s default collate function will not return the data in a format that obeys PyDPF conventions. When looping over the data loader, data is returned as tuple in the ordering state - observation - time - control - series_metadata with only the field that exist being returned.See test_trajectory.csv at John-JoB/pydpf for an example.
Note
When initialising a
torch.utils.data.DataLoaderthe argument collate_fn must be set todataset.collatewheredatasetis the instance of this class passed to the data loader.- __init__(data_path: Path | str, *, series_id_column: str = 'series_id', state_prefix: str | None = None, observation_prefix: str = 'observation', time_column: str | None = None, control_prefix: str | None = None, device: device = device(type='cpu'), series_metadata_path: Path | str | None = None)#
Methods
__init__(data_path, *[, series_id_column, ...])apply(f[, modified_series])Apply a function across all trajectories
collate(batch)Pass to the
collate_fnparameter of anytorch.utils.data.DataLoaderobject that uses this dataset.deterministic_split(ratios)normalise_dims([normalised_series, ...])Normalise the data to have mean zero and standard deviation one.
random_split(ratios, generator)save(path[, series_metadata_path, ...])Save a dataset to a file or folder
select(indices)Attributes
The dataset control as a tensor
The dimension of the control actions
The dataset observation as a tensor
The dimension of each observation
The dataset series_metadata as a tensor
The dataset state as a tensor
The dimension of the state
The dataset time as a tensor
- apply(f, modified_series: str = 'observation')#
Apply a function across all trajectories
Takes a function f that takes a
**dictionaryof data categories, e.g.f = lambda: time, state, **kwargs = time * statefor a function that returns the state multiplied by the time. And replaces the series given bymodified_serieswith the output of f for every trajectory in a dataset.- Parameters:
- f: function
function to be applied across all trajectories
- modified_series: str. Default ‘observation’
The series to replace with the output of f
- collate(batch) Tuple[Tensor, ...]#
Pass to the
collate_fnparameter of anytorch.utils.data.DataLoaderobject that uses this dataset.
- property control#
The dataset control as a tensor
- property control_dimension#
The dimension of the control actions
- deterministic_split(ratios)#
- normalise_dims(normalised_series: str = 'observation', scale_dims: str = 'all', individual_timesteps: bool = False, dims: Tuple[int] | None = None) Tuple[Tensor, Tensor]#
Normalise the data to have mean zero and standard deviation one.
This function normalises the data inplace and returns the offset and scale. Such that the original data can be reclaimed by original_data = normalised_data * scale + offset.
This function can be applied to either the state or observations, this is controlled by the parameter normalise_state.
- There are various methods to control the scaling, determined by the value of scale_dims:
‘all’: scale each dimension independently, such that every dimension have standard deviation 1.
‘max’: scale each dimension by the same factor, such that the maximum of the standard deviations is 1.
‘min’: scale each dimension by the same factor, such that the minimum of the standard deviations is 1.
‘norm’: scale each dimension by the same factor, such that the standard deviation of the vector norm of the data is 1.
The parameter individual_timesteps controls whether to apply the same normalisation across time-steps, or to calculate a separate mean and standard deviation per time-step.
The normalisation doesn’t have to be across all data dimensions, one can specify a tuple of dimensions to include to the parameter dims. Or set dims=None to use all dimensions.
- Parameters:
- normalise_state: bool
When True, normalise the state. When False, normalise the observations.
- scale_dims: str
The method to scale over dimensions. See above for options and details.
- individual_timesteps: bool, default=True
When true, the scaling and offset is calculated per-time-step, when false the scaling and offset are set to be the same for each time-step (in most cases this should be True).
- dims: Tuple[int] or None, default=None
The dimensions to normalise.
- Returns:
- offset: torch.Tensor
The per-element offset.
- scaling: torch.Tensor
The per-element scaling.
- property observation#
The dataset observation as a tensor
- property observation_dimension#
The dimension of each observation
- random_split(ratios, generator)#
- save(path: Path | str, series_metadata_path: Path | str | None = None, n_processes: int = -1, bypass_ask=False)#
Save a dataset to a file or folder
- Parameters:
- path: Path
The path to save the dataset to.
- series_metadata_path: Path or None
The path to save metadata to.
- n_processes: int
The number of processes to use if saving to a folder rather than a single file.
Notes
If
data_pathends in “.csv” then all trajectories will be saved in a single csv file at that path. If it is a directory then the trajectories will be saved in separate csvs in that directory.
- select(indices)#
- property series_metadata#
The dataset series_metadata as a tensor
- property state#
The dataset state as a tensor
- property state_dimension#
The dimension of the state
- property time#
The dataset time as a tensor