pydpf.datautils.StateSpaceDataset#

class pydpf.datautils.StateSpaceDataset(data_path: Path | str, *, series_id_column: str = 'series_id', state_prefix: str | None = None, observation_prefix: str = 'observation', time_column: str | None = None, control_prefix: str | None = None, device: device = device(type='cpu'), series_metadata_path: Path | str | None = None)#

Bases: Dataset

Dataset class for state-observation data.

Latent state of the system stored in the state Tensor.

Dimensions are Discrete Time - Batch - Data

When used as called from a dataloader you must use the custom collate function Data will always be returned in the order ‘state’ - ‘observation’ - ‘time’ - ‘control’ - ‘series_metadata’

At the moment only functionality to load entire data set into RAM/VRAM is provided. Lazy loading is a planned feature.

Parameters:

data_path: Union[Path,str].: The path of the data file or folder.
series_id_column: str. Default “series_id”: The heading of the series_id column in the csv files.
state_prefix: str|None. Default None.: The prefix of heading of the state columns in the csv files.
observation_prefix: str. Default “observation”.: The prefix of heading of the observation columns in the csv files.
time_column: str|None. Default None.: The heading of the time column in the csv files.
control_prefix: str|None. Default None.: The prefix of heading of the control columns in the csv files.
device: torch.device. Default torch.device(‘cpu’).

Notes

We provide methods to load data from files, obeying a certain format, into a map-style torch.utils.data.Dataset object and therefore be accessed easily from a torch.utils.data.DataLoader. We allow one of two data storage formats, either storing the entire dataset in a single .csv file, or storing each trajectory in separate files {1.csv, 2.csv, …, T.csv} in a dedicated directory. The .csv files are formed of headed columns there must be at least one observation column, with state, time, and control columns being optional. As all the data categories, apart from time, are vector valued there can be multiple columns for each category. For the single-file format there must be additionally a series_id column that will be used to index each trajectory, for the multiple file format the series_id is encoded in the file name. The data category series_metadata exists to store exogenous variables that the trajectories might depend on, but are constant over a trajectory. These are to be stored in a separate .csv indexed by a series_id column. Given a file in the required format, loading a dataset is simple: initialise this class with the data’s path, the column labels and the device to store data retrieved by the data loader. When initialising the data loader, it is crucial that the argument collate_fn is set to dataset.collate where dataset is the dataset passed to the data loader. PyTorch’s default collate function will not return the data in a format that obeys PyDPF conventions. When looping over the data loader, data is returned as tuple in the ordering state - observation - time - control - series_metadata with only the field that exist being returned.

See test_trajectory.csv at John-JoB/pydpf for an example.

Note

When initialising a torch.utils.data.DataLoader the argument collate_fn must be set to dataset.collate where dataset is the instance of this class passed to the data loader.

__init__(data_path: Path | str, *, series_id_column: str = 'series_id', state_prefix: str | None = None, observation_prefix: str = 'observation', time_column: str | None = None, control_prefix: str | None = None, device: device = device(type='cpu'), series_metadata_path: Path | str | None = None)#

Methods

`__init__`(data_path, *[, series_id_column, ...])
`apply`(f[, modified_series])	Apply a function across all trajectories
`collate`(batch)	Pass to the `collate_fn` parameter of any `torch.utils.data.DataLoader` object that uses this dataset.
`deterministic_split`(ratios)
`normalise_dims`([normalised_series, ...])	Normalise the data to have mean zero and standard deviation one.
`random_split`(ratios, generator)
`save`(path[, series_metadata_path, ...])	Save a dataset to a file or folder
`select`(indices)

Attributes

`control`	The dataset control as a tensor
`control_dimension`	The dimension of the control actions
`observation`	The dataset observation as a tensor
`observation_dimension`	The dimension of each observation
`series_metadata`	The dataset series_metadata as a tensor
`state`	The dataset state as a tensor
`state_dimension`	The dimension of the state
`time`	The dataset time as a tensor

apply(f, modified_series: str = 'observation')#

Apply a function across all trajectories

Takes a function f that takes a **dictionary of data categories, e.g. f = lambda: time, state, **kwargs = time * state for a function that returns the state multiplied by the time. And replaces the series given by modified_series with the output of f for every trajectory in a dataset.

Parameters:

f: function: function to be applied across all trajectories
modified_series: str. Default ‘observation’: The series to replace with the output of f

collate(batch) → Tuple[Tensor, ...]#: Pass to the collate_fn parameter of any torch.utils.data.DataLoader object that uses this dataset.

property control#: The dataset control as a tensor

property control_dimension#: The dimension of the control actions

deterministic_split(ratios)#

normalise_dims(normalised_series: str = 'observation', scale_dims: str = 'all', individual_timesteps: bool = False, dims: Tuple[int] | None = None) → Tuple[Tensor, Tensor]#

Normalise the data to have mean zero and standard deviation one.

This function normalises the data inplace and returns the offset and scale. Such that the original data can be reclaimed by original_data = normalised_data * scale + offset.

This function can be applied to either the state or observations, this is controlled by the parameter normalise_state.

There are various methods to control the scaling, determined by the value of scale_dims:

‘all’: scale each dimension independently, such that every dimension have standard deviation 1.
‘max’: scale each dimension by the same factor, such that the maximum of the standard deviations is 1.
‘min’: scale each dimension by the same factor, such that the minimum of the standard deviations is 1.
‘norm’: scale each dimension by the same factor, such that the standard deviation of the vector norm of the data is 1.

The parameter individual_timesteps controls whether to apply the same normalisation across time-steps, or to calculate a separate mean and standard deviation per time-step.

The normalisation doesn’t have to be across all data dimensions, one can specify a tuple of dimensions to include to the parameter dims. Or set dims=None to use all dimensions.

Parameters:

normalise_state: bool: When True, normalise the state. When False, normalise the observations.
scale_dims: str: The method to scale over dimensions. See above for options and details.
individual_timesteps: bool, default=True: When true, the scaling and offset is calculated per-time-step, when false the scaling and offset are set to be the same for each time-step (in most cases this should be True).
dims: Tuple[int] or None, default=None: The dimensions to normalise.

Returns:

offset: torch.Tensor: The per-element offset.
scaling: torch.Tensor: The per-element scaling.

property observation#: The dataset observation as a tensor

property observation_dimension#: The dimension of each observation

random_split(ratios, generator)#

save(path: Path | str, series_metadata_path: Path | str | None = None, n_processes: int = -1, bypass_ask=False)#

Save a dataset to a file or folder

Parameters:

path: Path: The path to save the dataset to.
series_metadata_path: Path or None: The path to save metadata to.
n_processes: int: The number of processes to use if saving to a folder rather than a single file.

Notes

If data_path ends in “.csv” then all trajectories will be saved in a single csv file at that path. If it is a directory then the trajectories will be saved in separate csvs in that directory.

select(indices)#

property series_metadata#: The dataset series_metadata as a tensor

property state#: The dataset state as a tensor

property state_dimension#: The dimension of the state

property time#: The dataset time as a tensor