secml.data.splitter¶

CDataSplitter¶

class secml.data.splitter.c_datasplitter.CDataSplitter(num_folds=3, random_state=None)[source]¶

Bases: secml.core.c_creator.CCreator

Abstract class that defines basic methods for dataset splitting.

Parameters

num_foldsint, optional: Number of folds to create. Default 3. This corresponds to the size of tr_idx and ts_idx lists.
random_stateint, RandomState instance or None, optional (default=None): If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, is the RandomState instance used by np.random.

Attributes

class_type: Defines class type.
logger: Logger for current object.
tr_idx: List of training idx obtained with the split of the data.
ts_idx: List of test idx obtained with the split of the data.
verbose: Verbosity level of logger output.

Methods

`compute_indices`(self, dataset)	Compute training set and test set indices for each fold.
`copy`(self)	Returns a shallow copy of current class.
`create`([class_item])	This method creates an instance of a class with given type.
`deepcopy`(self)	Returns a deep copy of current class.
`get_class_from_type`(class_type)	Return the class associated with input type.
`get_params`(self)	Returns the dictionary of class parameters.
`get_subclasses`()	Get all the subclasses of the calling class.
`list_class_types`()	This method lists all types of available subclasses of calling one.
`load`(path)	Loads class from pickle object.
`save`(self, path)	Save class object using pickle.
`set`(self, param_name, param_value[, copy])	Set a parameter that has a specific name to a specific value.
`set_params`(self, params_dict[, copy])	Set all parameters passed as a dictionary {key: value}.
`split`(self, dataset)	Returns a list of split datasets.
`timed`([msg])	Timer decorator.

abstract compute_indices(self, dataset)[source]¶

Compute training set and test set indices for each fold.

Parameters

datasetCDataset: Dataset to split.

Returns

CDataSplitter: Instance of the dataset splitter with tr/ts indices.

split(self, dataset)[source]¶

Returns a list of split datasets.

Parameters

datasetCDataset: Dataset to split.

Returns

split_dslist of tuple: List of tuples (training set, test set), one for each fold.

property tr_idx¶: List of training idx obtained with the split of the data.

property ts_idx¶: List of test idx obtained with the split of the data.

CDataSplitterKFold¶

class secml.data.splitter.c_datasplitter_kfold.CDataSplitterKFold(num_folds=3, random_state=None)[source]¶

Bases: secml.data.splitter.c_datasplitter.CDataSplitter

K-Folds dataset splitting.

Provides train/test indices to split data in train and test sets. Split dataset into ‘num_folds’ consecutive folds (with shuffling).

Each fold is then used a validation set once while the k - 1 remaining fold form the training set.

Parameters

num_foldsint, optional: Number of folds to create. Default 3. This correspond to the size of tr_idx and ts_idx lists.
random_stateint, RandomState instance or None, optional (default=None): If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, is the RandomState instance used by np.random.

Examples

>>> from secml.data import CDataset
>>> from secml.data.splitter import CDataSplitterKFold

>>> ds = CDataset([[1,2],[3,4],[5,6]],[1,0,1])
>>> kfold = CDataSplitterKFold(num_folds=3, random_state=0).compute_indices(ds)
>>> print(kfold.num_folds)
3
>>> print(kfold.tr_idx)
[CArray(2,)(dense: [0 1]), CArray(2,)(dense: [0 2]), CArray(2,)(dense: [1 2])]
>>> print(kfold.ts_idx)
[CArray(1,)(dense: [2]), CArray(1,)(dense: [1]), CArray(1,)(dense: [0])]

Attributes

class_type‘kfold’: Defines class type.

Methods

`compute_indices`(self, dataset)	Compute training set and test set indices for each fold.
`copy`(self)	Returns a shallow copy of current class.
`create`([class_item])	This method creates an instance of a class with given type.
`deepcopy`(self)	Returns a deep copy of current class.
`get_class_from_type`(class_type)	Return the class associated with input type.
`get_params`(self)	Returns the dictionary of class parameters.
`get_subclasses`()	Get all the subclasses of the calling class.
`list_class_types`()	This method lists all types of available subclasses of calling one.
`load`(path)	Loads class from pickle object.
`save`(self, path)	Save class object using pickle.
`set`(self, param_name, param_value[, copy])	Set a parameter that has a specific name to a specific value.
`set_params`(self, params_dict[, copy])	Set all parameters passed as a dictionary {key: value}.
`split`(self, dataset)	Returns a list of split datasets.
`timed`([msg])	Timer decorator.

compute_indices(self, dataset)[source]¶

Compute training set and test set indices for each fold.

Parameters

datasetCDataset: Dataset to split.

Returns

CDataSplitter: Instance of the dataset splitter with tr/ts indices.

CDataSplitterLabelKFold¶

class secml.data.splitter.c_datasplitter_labelkfold.CDataSplitterLabelKFold(num_folds=3)[source]¶

Bases: secml.data.splitter.c_datasplitter.CDataSplitter

K-Folds dataset splitting with non-overlapping labels.

The same label will not appear in two different folds (the number of distinct labels has to be at least equal to the number of folds).

The folds are approximately balanced in the sense that the number of distinct labels is approximately the same in each fold.

Parameters

num_foldsint, optional: Number of folds to create. Default 3. This correspond to the size of tr_idx and ts_idx lists.

Examples

>>> from secml.data import CDataset
>>> from secml.data import CDataset
>>> from secml.data.splitter import CDataSplitterLabelKFold
>>> ds = CDataset([[1,2],[3,4],[5,6],[7,8]], [1,0,1,2])
>>> kfold = CDataSplitterLabelKFold(num_folds=3).compute_indices(ds)
>>> print(kfold.num_folds)
3
>>> print(kfold.tr_idx)
[CArray(2,)(dense: [1 3]), CArray(3,)(dense: [0 1 2]), CArray(3,)(dense: [0 2 3])]
>>> print(kfold.ts_idx)
[CArray(2,)(dense: [0 2]), CArray(1,)(dense: [3]), CArray(1,)(dense: [1])]

Attributes

class_type‘label-kfold’: Defines class type.

Methods

`compute_indices`(self, dataset)	Compute training set and test set indices for each fold.
`copy`(self)	Returns a shallow copy of current class.
`create`([class_item])	This method creates an instance of a class with given type.
`deepcopy`(self)	Returns a deep copy of current class.
`get_class_from_type`(class_type)	Return the class associated with input type.
`get_params`(self)	Returns the dictionary of class parameters.
`get_subclasses`()	Get all the subclasses of the calling class.
`list_class_types`()	This method lists all types of available subclasses of calling one.
`load`(path)	Loads class from pickle object.
`save`(self, path)	Save class object using pickle.
`set`(self, param_name, param_value[, copy])	Set a parameter that has a specific name to a specific value.
`set_params`(self, params_dict[, copy])	Set all parameters passed as a dictionary {key: value}.
`split`(self, dataset)	Returns a list of split datasets.
`timed`([msg])	Timer decorator.

compute_indices(self, dataset)[source]¶

Compute training set and test set indices for each fold.

Parameters

datasetCDataset: Dataset to split.

Returns

CDataSplitter: Instance of the dataset splitter with tr/ts indices.

CDataSplitterOpenWorldKFold¶

class secml.data.splitter.c_datasplitter_openworld.CDataSplitterOpenWorldKFold(num_folds=3, n_train_samples=5, n_train_classes=None, random_state=None)[source]¶

Bases: secml.data.splitter.c_datasplitter.CDataSplitter

Open World K-Folds dataset splitting.

Provides train/test indices to split data in train and test sets.

In an Open World setting, half (or custom number) of the dataset classes are used for training, while all dataset classes are tested.

Split dataset into ‘num_folds’ consecutive folds (with shuffling).

Each fold is then used a validation set once while the k - 1 remaining fold form the training set.

Parameters

num_foldsint, optional: Number of folds to create. Default 3. This correspond to the size of tr_idx and ts_idx lists.
n_train_samplesint, optional: Number of training samples per client. Default 5.
n_train_classesint or None: Number of dataset classes to use as training. If not specified half of dataset classes are used (floored).
random_stateint, RandomState instance or None, optional (default=None): If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, is the RandomState instance used by np.random.

Examples

>>> from secml.data import CDataset
>>> from secml.data.splitter import CDataSplitterOpenWorldKFold

>>> ds = CDataset([[1,2],[3,4],[5,6],[10,20],[30,40],[50,60],
...                [100,200],[300,400]],[1,0,1,2,0,1,0,2])
>>> kfold = CDataSplitterOpenWorldKFold(
...     num_folds=3, n_train_samples=2, random_state=0).compute_indices(ds)
>>> kfold.num_folds
3
>>> print(kfold.tr_idx)
[CArray(2,)(dense: [2 5]), CArray(2,)(dense: [1 4]), CArray(2,)(dense: [0 2])]
>>> print(kfold.ts_idx)
[CArray(6,)(dense: [0 1 3 4 6 7]), CArray(6,)(dense: [0 2 3 5 6 7]), CArray(6,)(dense: [1 3 4 5 6 7])]
>>> print(kfold.tr_classes)  # Class 2 is skipped as there are not enough samples (at least 3)
[CArray(1,)(dense: [1]), CArray(1,)(dense: [0]), CArray(1,)(dense: [1])]

Attributes

class_type‘open-world-kfold’: Defines class type.

Methods

`compute_indices`(self, dataset)	Compute training set and test set indices for each fold.
`copy`(self)	Returns a shallow copy of current class.
`create`([class_item])	This method creates an instance of a class with given type.
`deepcopy`(self)	Returns a deep copy of current class.
`get_class_from_type`(class_type)	Return the class associated with input type.
`get_params`(self)	Returns the dictionary of class parameters.
`get_subclasses`()	Get all the subclasses of the calling class.
`list_class_types`()	This method lists all types of available subclasses of calling one.
`load`(path)	Loads class from pickle object.
`save`(self, path)	Save class object using pickle.
`set`(self, param_name, param_value[, copy])	Set a parameter that has a specific name to a specific value.
`set_params`(self, params_dict[, copy])	Set all parameters passed as a dictionary {key: value}.
`split`(self, dataset)	Returns a list of split datasets.
`timed`([msg])	Timer decorator.

compute_indices(self, dataset)[source]¶

Compute training set and test set indices for each fold.

Parameters

datasetCDataset: Dataset to split.

Returns

CDataSplitter: Instance of the dataset splitter with tr/ts indices.

property tr_classes¶: List of training classes obtained with the split of the data.

CDataSplitterShuffle¶

class secml.data.splitter.c_datasplitter_shuffle.CDataSplitterShuffle(num_folds=3, train_size=None, test_size=None, random_state=None)[source]¶

Bases: secml.data.splitter.c_datasplitter.CDataSplitter

Random permutation dataset splitting.

Yields indices to split data into training and test sets.

Note: contrary to other dataset splitting strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.

Parameters

num_foldsint, optional: Number of folds to create. Default 3. This correspond to the size of tr_idx and ts_idx lists.
train_sizefloat, int, or None, optional: If None (default), the value is automatically set to the complement of the test size. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples.
test_sizefloat, int, or None, optional: If None (default), the value is automatically set to the complement of the train size. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples.
random_stateint, RandomState instance or None, optional (default=None): If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, is the RandomState instance used by np.random.

Notes

train_size and test_size could not be both None. If one is set to None the other should be a float, representing a percentage, or an integer.

Examples

>>> from secml.data import CDataset
>>> from secml.data.splitter import CDataSplitterShuffle

>>> ds = CDataset([[1,2],[3,4],[5,6]],[1,0,1])
>>> shuffle = CDataSplitterShuffle(num_folds=3, train_size=0.5, random_state=0).compute_indices(ds)
>>> shuffle.num_folds
3
>>> shuffle.tr_idx
[CArray(1,)(dense: [0]), CArray(1,)(dense: [1]), CArray(1,)(dense: [1])]
>>> shuffle.ts_idx
[CArray(2,)(dense: [2 1]), CArray(2,)(dense: [2 0]), CArray(2,)(dense: [0 2])]

>>> # Setting the train_size or the test_size to an arbitrary percentage
>>> shuffle = CDataSplitterShuffle(num_folds=3, train_size=0.2, random_state=0).compute_indices(ds)
>>> shuffle.num_folds
3
>>> shuffle.tr_idx
[CArray(0,)(dense: []), CArray(0,)(dense: []), CArray(0,)(dense: [])]
>>> shuffle.ts_idx
[CArray(3,)(dense: [2 1 0]), CArray(3,)(dense: [2 0 1]), CArray(3,)(dense: [0 2 1])]

Attributes

class_type‘shuffle’: Defines class type.

Methods

`compute_indices`(self, dataset)	Compute training set and test set indices for each fold.
`copy`(self)	Returns a shallow copy of current class.
`create`([class_item])	This method creates an instance of a class with given type.
`deepcopy`(self)	Returns a deep copy of current class.
`get_class_from_type`(class_type)	Return the class associated with input type.
`get_params`(self)	Returns the dictionary of class parameters.
`get_subclasses`()	Get all the subclasses of the calling class.
`list_class_types`()	This method lists all types of available subclasses of calling one.
`load`(path)	Loads class from pickle object.
`save`(self, path)	Save class object using pickle.
`set`(self, param_name, param_value[, copy])	Set a parameter that has a specific name to a specific value.
`set_params`(self, params_dict[, copy])	Set all parameters passed as a dictionary {key: value}.
`split`(self, dataset)	Returns a list of split datasets.
`timed`([msg])	Timer decorator.

compute_indices(self, dataset)[source]¶

Compute training set and test set indices for each fold.

Parameters

datasetCDataset: Dataset to split.

Returns

CDataSplitter: Instance of the dataset splitter with tr/ts indices.

CDataSplitterStratifiedKFold¶

class secml.data.splitter.c_datasplitter_stratkfold.CDataSplitterStratifiedKFold(num_folds=3, random_state=None)[source]¶

Bases: secml.data.splitter.c_datasplitter.CDataSplitter

Stratified K-Folds dataset splitting.

Provides train/test indices to split data in train test sets.

This dataset splitting object is a variation of KFold, which returns stratified folds. The folds are made by preserving the percentage of samples for each class.

Parameters

num_foldsint, optional: Number of folds to create. Default 3. This correspond to the size of tr_idx and ts_idx lists. For stratified K-Fold, this cannot be higher than the minimum number of samples per class in the dataset.
random_stateint, RandomState instance or None, optional (default=None): If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, is the RandomState instance used by np.random.

Examples

>>> from secml.data import CDataset
>>> from secml.data.splitter import CDataSplitterStratifiedKFold

>>> ds = CDataset([[1,2],[3,4],[5,6],[7,8]],[1,0,0,1])
>>> stratkfold = CDataSplitterStratifiedKFold(num_folds=2, random_state=0).compute_indices(ds)
>>> stratkfold.num_folds  # Cannot be higher than the number of samples per class
2
>>> stratkfold.tr_idx
[CArray(2,)(dense: [1 3]), CArray(2,)(dense: [0 2])]
>>> stratkfold.ts_idx
[CArray(2,)(dense: [0 2]), CArray(2,)(dense: [1 3])]

Attributes

class_type‘strat-kfold’: Defines class type.

Methods

`compute_indices`(self, dataset)	Compute training set and test set indices for each fold.
`copy`(self)	Returns a shallow copy of current class.
`create`([class_item])	This method creates an instance of a class with given type.
`deepcopy`(self)	Returns a deep copy of current class.
`get_class_from_type`(class_type)	Return the class associated with input type.
`get_params`(self)	Returns the dictionary of class parameters.
`get_subclasses`()	Get all the subclasses of the calling class.
`list_class_types`()	This method lists all types of available subclasses of calling one.
`load`(path)	Loads class from pickle object.
`save`(self, path)	Save class object using pickle.
`set`(self, param_name, param_value[, copy])	Set a parameter that has a specific name to a specific value.
`set_params`(self, params_dict[, copy])	Set all parameters passed as a dictionary {key: value}.
`split`(self, dataset)	Returns a list of split datasets.
`timed`([msg])	Timer decorator.

compute_indices(self, dataset)[source]¶

Compute training set and test set indices for each fold.

Parameters

datasetCDataset: Dataset to split.

Returns

CDataSplitter: Instance of the dataset splitter with tr/ts indices.

CTrainTestSplit¶

class secml.data.splitter.c_train_test_split.CTrainTestSplit(train_size=None, test_size=None, random_state=None, shuffle=True)[source]¶

Bases: secml.core.c_creator.CCreator

Train and Test Sets splitter.

Split dataset into random train and test subsets.

Quick utility that wraps CDataSplitterShuffle().compute_indices(ds)) for splitting (and optionally subsampling) data in a oneliner.

Parameters

train_sizefloat, int, or None, optional: If None (default), the value is automatically set to the complement of the test size. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples.
test_sizefloat, int, or None, optional: If None (default), the value is automatically set to the complement of the train size. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples.
random_stateint, RandomState instance or None, optional (default=None): If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, is the RandomState instance used by np.random.
shufflebool, optional: Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None. Default True.

Notes

train_size and test_size could not be both None. If one is set to None the other should be a float, representing a percentage, or an integer.

Examples

>>> from secml.data import CDataset
>>> from secml.data.splitter import CTrainTestSplit

>>> ds = CDataset([[1,2],[3,4],[5,6],[7,8]],[1,0,1,1])
>>> tr, ts = CTrainTestSplit(train_size=0.5, random_state=0).split(ds)
>>> tr.num_samples
2
>>> ts.num_samples
2

>>> # Get splitting indices without shuffle
>>> tr_idx, ts_idx = CTrainTestSplit(train_size=0.25,
...     random_state=0, shuffle=False).compute_indices(ds)
>>> tr_idx
CArray(1,)(dense: [0])
>>> ts_idx
CArray(3,)(dense: [1 2 3])

>>> # At least one sample is needed for each set
>>> tr, ts = CTrainTestSplit(train_size=0.2, random_state=0).split(ds)
Traceback (most recent call last):
    ...
ValueError: train_size should be at least 1 or 0.25

Attributes

class_type: Defines class type.
logger: Logger for current object.
tr_idx: Training set indices obtained with the split of the data.
ts_idx: Test set indices obtained with the split of the data.
verbose: Verbosity level of logger output.

Methods

`compute_indices`(self, dataset)	Compute training set and test set indices for each fold.
`copy`(self)	Returns a shallow copy of current class.
`create`([class_item])	This method creates an instance of a class with given type.
`deepcopy`(self)	Returns a deep copy of current class.
`get_class_from_type`(class_type)	Return the class associated with input type.
`get_params`(self)	Returns the dictionary of class parameters.
`get_subclasses`()	Get all the subclasses of the calling class.
`list_class_types`()	This method lists all types of available subclasses of calling one.
`load`(path)	Loads class from pickle object.
`save`(self, path)	Save class object using pickle.
`set`(self, param_name, param_value[, copy])	Set a parameter that has a specific name to a specific value.
`set_params`(self, params_dict[, copy])	Set all parameters passed as a dictionary {key: value}.
`split`(self, dataset)	Split dataset into training set and test set.
`timed`([msg])	Timer decorator.

compute_indices(self, dataset)[source]¶

Compute training set and test set indices for each fold.

Parameters

datasetCDataset: Dataset to split.

Returns

tr_idx, ts_idxCArray: Flat arrays with the tr/ts indices.

split(self, dataset)[source]¶

Split dataset into training set and test set.

Parameters

datasetCDataset: Dataset to split.

Returns

ds_train, ds_testCDataset: Train and Test datasets.

property tr_idx¶: Training set indices obtained with the split of the data.

property ts_idx¶: Test set indices obtained with the split of the data.

CChronologicalSplitter¶

class secml.data.splitter.c_chronological_splitter.CChronologicalSplitter(th_timestamp, train_size=1.0, test_size=1.0, random_state=None, shuffle=True)[source]¶

Bases: secml.core.c_creator.CCreator

Dataset splitter based on timestamps.

Split dataset into train and test subsets,: using a timestamp as split point.

A dataset containing timestamp and timestamp_fmt header attributes is required.

Parameters

th_timestampstr: The split point in time between training and test set. Samples having timestamp <= th_timestamp will be put in the training set, while samples with timestamp > th_timestamp will be used for the test set. The timestamp must follow the ISO 8601 format. Any incomplete timestamp will be parsed too.
train_sizefloat or int, optional: If float, should be between 0.0 and 1.0 and represent the proportion of the samples having timestamp <= th_timestamp to include in the train split. Default 1.0. If int, represents the absolute number of train samples.
test_sizefloat or int, optional: If float, should be between 0.0 and 1.0 and represent the proportion of the samples having timestamp > th_timestamp to include in the test split. Default 1.0. If int, represents the absolute number of test samples.
random_stateint, RandomState instance or None, optional (default=None): If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, is the RandomState instance used by np.random.
shufflebool, optional: Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None. Default True.

Attributes

class_type: Defines class type.
logger: Logger for current object.
tr_idx: Training set indices obtained with the split of the data.
ts_idx: Test set indices obtained with the split of the data.
verbose: Verbosity level of logger output.

Methods

`compute_indices`(self, dataset)	Compute training set and test set indices.
`copy`(self)	Returns a shallow copy of current class.
`create`([class_item])	This method creates an instance of a class with given type.
`deepcopy`(self)	Returns a deep copy of current class.
`get_class_from_type`(class_type)	Return the class associated with input type.
`get_params`(self)	Returns the dictionary of class parameters.
`get_subclasses`()	Get all the subclasses of the calling class.
`list_class_types`()	This method lists all types of available subclasses of calling one.
`load`(path)	Loads class from pickle object.
`save`(self, path)	Save class object using pickle.
`set`(self, param_name, param_value[, copy])	Set a parameter that has a specific name to a specific value.
`set_params`(self, params_dict[, copy])	Set all parameters passed as a dictionary {key: value}.
`split`(self, dataset)	Split dataset into training set and test set.
`timed`([msg])	Timer decorator.

compute_indices(self, dataset)[source]¶

Compute training set and test set indices.

Parameters

datasetCDataset: Dataset to split.

Returns

tr_idx, ts_idxCArray: Flat arrays with the tr/ts indices.

split(self, dataset)[source]¶

Split dataset into training set and test set.

Parameters

datasetCDataset: Dataset to split.

Returns

ds_train, ds_testCDataset: Train and Test datasets.

property tr_idx¶: Training set indices obtained with the split of the data.

property ts_idx¶: Test set indices obtained with the split of the data.