secml.data.splitter

CDataSplitter

class secml.data.splitter.c_datasplitter.CDataSplitter(num_folds=3, random_state=None)[source]

Bases: secml.core.c_creator.CCreator

Abstract class that defines basic methods for dataset splitting.

Parameters
num_foldsint, optional

Number of folds to create. Default 3. This corresponds to the size of tr_idx and ts_idx lists.

random_stateint, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, is the RandomState instance used by np.random.

Attributes
class_type

Defines class type.

logger

Logger for current object.

tr_idx

List of training idx obtained with the split of the data.

ts_idx

List of test idx obtained with the split of the data.

verbose

Verbosity level of logger output.

Methods

compute_indices(self, dataset)

Compute training set and test set indices for each fold.

copy(self)

Returns a shallow copy of current class.

create([class_item])

This method creates an instance of a class with given type.

deepcopy(self)

Returns a deep copy of current class.

get_class_from_type(class_type)

Return the class associated with input type.

get_params(self)

Returns the dictionary of class parameters.

get_subclasses()

Get all the subclasses of the calling class.

list_class_types()

This method lists all types of available subclasses of calling one.

load(path)

Loads class from pickle object.

save(self, path)

Save class object using pickle.

set(self, param_name, param_value[, copy])

Set a parameter that has a specific name to a specific value.

set_params(self, params_dict[, copy])

Set all parameters passed as a dictionary {key: value}.

split(self, dataset)

Returns a list of split datasets.

timed([msg])

Timer decorator.

abstract compute_indices(self, dataset)[source]

Compute training set and test set indices for each fold.

Parameters
datasetCDataset

Dataset to split.

Returns
CDataSplitter

Instance of the dataset splitter with tr/ts indices.

split(self, dataset)[source]

Returns a list of split datasets.

Parameters
datasetCDataset

Dataset to split.

Returns
split_dslist of tuple

List of tuples (training set, test set), one for each fold.

property tr_idx

List of training idx obtained with the split of the data.

property ts_idx

List of test idx obtained with the split of the data.

CDataSplitterKFold

class secml.data.splitter.c_datasplitter_kfold.CDataSplitterKFold(num_folds=3, random_state=None)[source]

Bases: secml.data.splitter.c_datasplitter.CDataSplitter

K-Folds dataset splitting.

Provides train/test indices to split data in train and test sets. Split dataset into ‘num_folds’ consecutive folds (with shuffling).

Each fold is then used a validation set once while the k - 1 remaining fold form the training set.

Parameters
num_foldsint, optional

Number of folds to create. Default 3. This correspond to the size of tr_idx and ts_idx lists.

random_stateint, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, is the RandomState instance used by np.random.

Examples

>>> from secml.data import CDataset
>>> from secml.data.splitter import CDataSplitterKFold
>>> ds = CDataset([[1,2],[3,4],[5,6]],[1,0,1])
>>> kfold = CDataSplitterKFold(num_folds=3, random_state=0).compute_indices(ds)
>>> print(kfold.num_folds)
3
>>> print(kfold.tr_idx)
[CArray(2,)(dense: [0 1]), CArray(2,)(dense: [0 2]), CArray(2,)(dense: [1 2])]
>>> print(kfold.ts_idx)
[CArray(1,)(dense: [2]), CArray(1,)(dense: [1]), CArray(1,)(dense: [0])]
Attributes
class_type‘kfold’

Defines class type.

Methods

compute_indices(self, dataset)

Compute training set and test set indices for each fold.

copy(self)

Returns a shallow copy of current class.

create([class_item])

This method creates an instance of a class with given type.

deepcopy(self)

Returns a deep copy of current class.

get_class_from_type(class_type)

Return the class associated with input type.

get_params(self)

Returns the dictionary of class parameters.

get_subclasses()

Get all the subclasses of the calling class.

list_class_types()

This method lists all types of available subclasses of calling one.

load(path)

Loads class from pickle object.

save(self, path)

Save class object using pickle.

set(self, param_name, param_value[, copy])

Set a parameter that has a specific name to a specific value.

set_params(self, params_dict[, copy])

Set all parameters passed as a dictionary {key: value}.

split(self, dataset)

Returns a list of split datasets.

timed([msg])

Timer decorator.

compute_indices(self, dataset)[source]

Compute training set and test set indices for each fold.

Parameters
datasetCDataset

Dataset to split.

Returns
CDataSplitter

Instance of the dataset splitter with tr/ts indices.

CDataSplitterLabelKFold

class secml.data.splitter.c_datasplitter_labelkfold.CDataSplitterLabelKFold(num_folds=3)[source]

Bases: secml.data.splitter.c_datasplitter.CDataSplitter

K-Folds dataset splitting with non-overlapping labels.

The same label will not appear in two different folds (the number of distinct labels has to be at least equal to the number of folds).

The folds are approximately balanced in the sense that the number of distinct labels is approximately the same in each fold.

Parameters
num_foldsint, optional

Number of folds to create. Default 3. This correspond to the size of tr_idx and ts_idx lists.

Examples

>>> from secml.data import CDataset
>>> from secml.data import CDataset
>>> from secml.data.splitter import CDataSplitterLabelKFold
>>> ds = CDataset([[1,2],[3,4],[5,6],[7,8]], [1,0,1,2])
>>> kfold = CDataSplitterLabelKFold(num_folds=3).compute_indices(ds)
>>> print(kfold.num_folds)
3
>>> print(kfold.tr_idx)
[CArray(2,)(dense: [1 3]), CArray(3,)(dense: [0 1 2]), CArray(3,)(dense: [0 2 3])]
>>> print(kfold.ts_idx)
[CArray(2,)(dense: [0 2]), CArray(1,)(dense: [3]), CArray(1,)(dense: [1])]
Attributes
class_type‘label-kfold’

Defines class type.

Methods

compute_indices(self, dataset)

Compute training set and test set indices for each fold.

copy(self)

Returns a shallow copy of current class.

create([class_item])

This method creates an instance of a class with given type.

deepcopy(self)

Returns a deep copy of current class.

get_class_from_type(class_type)

Return the class associated with input type.

get_params(self)

Returns the dictionary of class parameters.

get_subclasses()

Get all the subclasses of the calling class.

list_class_types()

This method lists all types of available subclasses of calling one.

load(path)

Loads class from pickle object.

save(self, path)

Save class object using pickle.

set(self, param_name, param_value[, copy])

Set a parameter that has a specific name to a specific value.

set_params(self, params_dict[, copy])

Set all parameters passed as a dictionary {key: value}.

split(self, dataset)

Returns a list of split datasets.

timed([msg])

Timer decorator.

compute_indices(self, dataset)[source]

Compute training set and test set indices for each fold.

Parameters
datasetCDataset

Dataset to split.

Returns
CDataSplitter

Instance of the dataset splitter with tr/ts indices.

CDataSplitterOpenWorldKFold

class secml.data.splitter.c_datasplitter_openworld.CDataSplitterOpenWorldKFold(num_folds=3, n_train_samples=5, n_train_classes=None, random_state=None)[source]

Bases: secml.data.splitter.c_datasplitter.CDataSplitter

Open World K-Folds dataset splitting.

Provides train/test indices to split data in train and test sets.

In an Open World setting, half (or custom number) of the dataset classes are used for training, while all dataset classes are tested.

Split dataset into ‘num_folds’ consecutive folds (with shuffling).

Each fold is then used a validation set once while the k - 1 remaining fold form the training set.

Parameters
num_foldsint, optional

Number of folds to create. Default 3. This correspond to the size of tr_idx and ts_idx lists.

n_train_samplesint, optional

Number of training samples per client. Default 5.

n_train_classesint or None

Number of dataset classes to use as training. If not specified half of dataset classes are used (floored).

random_stateint, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, is the RandomState instance used by np.random.

Examples

>>> from secml.data import CDataset
>>> from secml.data.splitter import CDataSplitterOpenWorldKFold
>>> ds = CDataset([[1,2],[3,4],[5,6],[10,20],[30,40],[50,60],
...                [100,200],[300,400]],[1,0,1,2,0,1,0,2])
>>> kfold = CDataSplitterOpenWorldKFold(
...     num_folds=3, n_train_samples=2, random_state=0).compute_indices(ds)
>>> kfold.num_folds
3
>>> print(kfold.tr_idx)
[CArray(2,)(dense: [2 5]), CArray(2,)(dense: [1 4]), CArray(2,)(dense: [0 2])]
>>> print(kfold.ts_idx)
[CArray(6,)(dense: [0 1 3 4 6 7]), CArray(6,)(dense: [0 2 3 5 6 7]), CArray(6,)(dense: [1 3 4 5 6 7])]
>>> print(kfold.tr_classes)  # Class 2 is skipped as there are not enough samples (at least 3)
[CArray(1,)(dense: [1]), CArray(1,)(dense: [0]), CArray(1,)(dense: [1])]
Attributes
class_type‘open-world-kfold’

Defines class type.

Methods

compute_indices(self, dataset)

Compute training set and test set indices for each fold.

copy(self)

Returns a shallow copy of current class.

create([class_item])

This method creates an instance of a class with given type.

deepcopy(self)

Returns a deep copy of current class.

get_class_from_type(class_type)

Return the class associated with input type.

get_params(self)

Returns the dictionary of class parameters.

get_subclasses()

Get all the subclasses of the calling class.

list_class_types()

This method lists all types of available subclasses of calling one.

load(path)

Loads class from pickle object.

save(self, path)

Save class object using pickle.

set(self, param_name, param_value[, copy])

Set a parameter that has a specific name to a specific value.

set_params(self, params_dict[, copy])

Set all parameters passed as a dictionary {key: value}.

split(self, dataset)

Returns a list of split datasets.

timed([msg])

Timer decorator.

compute_indices(self, dataset)[source]

Compute training set and test set indices for each fold.

Parameters
datasetCDataset

Dataset to split.

Returns
CDataSplitter

Instance of the dataset splitter with tr/ts indices.

property tr_classes

List of training classes obtained with the split of the data.

CDataSplitterShuffle

class secml.data.splitter.c_datasplitter_shuffle.CDataSplitterShuffle(num_folds=3, train_size=None, test_size=None, random_state=None)[source]

Bases: secml.data.splitter.c_datasplitter.CDataSplitter

Random permutation dataset splitting.

Yields indices to split data into training and test sets.

Note: contrary to other dataset splitting strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.

Parameters
num_foldsint, optional

Number of folds to create. Default 3. This correspond to the size of tr_idx and ts_idx lists.

train_sizefloat, int, or None, optional

If None (default), the value is automatically set to the complement of the test size. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples.

test_sizefloat, int, or None, optional

If None (default), the value is automatically set to the complement of the train size. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples.

random_stateint, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, is the RandomState instance used by np.random.

Notes

train_size and test_size could not be both None. If one is set to None the other should be a float, representing a percentage, or an integer.

Examples

>>> from secml.data import CDataset
>>> from secml.data.splitter import CDataSplitterShuffle
>>> ds = CDataset([[1,2],[3,4],[5,6]],[1,0,1])
>>> shuffle = CDataSplitterShuffle(num_folds=3, train_size=0.5, random_state=0).compute_indices(ds)
>>> shuffle.num_folds
3
>>> shuffle.tr_idx
[CArray(1,)(dense: [0]), CArray(1,)(dense: [1]), CArray(1,)(dense: [1])]
>>> shuffle.ts_idx
[CArray(2,)(dense: [2 1]), CArray(2,)(dense: [2 0]), CArray(2,)(dense: [0 2])]
>>> # Setting the train_size or the test_size to an arbitrary percentage
>>> shuffle = CDataSplitterShuffle(num_folds=3, train_size=0.2, random_state=0).compute_indices(ds)
>>> shuffle.num_folds
3
>>> shuffle.tr_idx
[CArray(0,)(dense: []), CArray(0,)(dense: []), CArray(0,)(dense: [])]
>>> shuffle.ts_idx
[CArray(3,)(dense: [2 1 0]), CArray(3,)(dense: [2 0 1]), CArray(3,)(dense: [0 2 1])]
Attributes
class_type‘shuffle’

Defines class type.

Methods

compute_indices(self, dataset)

Compute training set and test set indices for each fold.

copy(self)

Returns a shallow copy of current class.

create([class_item])

This method creates an instance of a class with given type.

deepcopy(self)

Returns a deep copy of current class.

get_class_from_type(class_type)

Return the class associated with input type.

get_params(self)

Returns the dictionary of class parameters.

get_subclasses()

Get all the subclasses of the calling class.

list_class_types()

This method lists all types of available subclasses of calling one.

load(path)

Loads class from pickle object.

save(self, path)

Save class object using pickle.

set(self, param_name, param_value[, copy])

Set a parameter that has a specific name to a specific value.

set_params(self, params_dict[, copy])

Set all parameters passed as a dictionary {key: value}.

split(self, dataset)

Returns a list of split datasets.

timed([msg])

Timer decorator.

compute_indices(self, dataset)[source]

Compute training set and test set indices for each fold.

Parameters
datasetCDataset

Dataset to split.

Returns
CDataSplitter

Instance of the dataset splitter with tr/ts indices.

CDataSplitterStratifiedKFold

class secml.data.splitter.c_datasplitter_stratkfold.CDataSplitterStratifiedKFold(num_folds=3, random_state=None)[source]

Bases: secml.data.splitter.c_datasplitter.CDataSplitter

Stratified K-Folds dataset splitting.

Provides train/test indices to split data in train test sets.

This dataset splitting object is a variation of KFold, which returns stratified folds. The folds are made by preserving the percentage of samples for each class.

Parameters
num_foldsint, optional

Number of folds to create. Default 3. This correspond to the size of tr_idx and ts_idx lists. For stratified K-Fold, this cannot be higher than the minimum number of samples per class in the dataset.

random_stateint, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, is the RandomState instance used by np.random.

Examples

>>> from secml.data import CDataset
>>> from secml.data.splitter import CDataSplitterStratifiedKFold
>>> ds = CDataset([[1,2],[3,4],[5,6],[7,8]],[1,0,0,1])
>>> stratkfold = CDataSplitterStratifiedKFold(num_folds=2, random_state=0).compute_indices(ds)
>>> stratkfold.num_folds  # Cannot be higher than the number of samples per class
2
>>> stratkfold.tr_idx
[CArray(2,)(dense: [1 3]), CArray(2,)(dense: [0 2])]
>>> stratkfold.ts_idx
[CArray(2,)(dense: [0 2]), CArray(2,)(dense: [1 3])]
Attributes
class_type‘strat-kfold’

Defines class type.

Methods

compute_indices(self, dataset)

Compute training set and test set indices for each fold.

copy(self)

Returns a shallow copy of current class.

create([class_item])

This method creates an instance of a class with given type.

deepcopy(self)

Returns a deep copy of current class.

get_class_from_type(class_type)

Return the class associated with input type.

get_params(self)

Returns the dictionary of class parameters.

get_subclasses()

Get all the subclasses of the calling class.

list_class_types()

This method lists all types of available subclasses of calling one.

load(path)

Loads class from pickle object.

save(self, path)

Save class object using pickle.

set(self, param_name, param_value[, copy])

Set a parameter that has a specific name to a specific value.

set_params(self, params_dict[, copy])

Set all parameters passed as a dictionary {key: value}.

split(self, dataset)

Returns a list of split datasets.

timed([msg])

Timer decorator.

compute_indices(self, dataset)[source]

Compute training set and test set indices for each fold.

Parameters
datasetCDataset

Dataset to split.

Returns
CDataSplitter

Instance of the dataset splitter with tr/ts indices.

CTrainTestSplit

class secml.data.splitter.c_train_test_split.CTrainTestSplit(train_size=None, test_size=None, random_state=None, shuffle=True)[source]

Bases: secml.core.c_creator.CCreator

Train and Test Sets splitter.

Split dataset into random train and test subsets.

Quick utility that wraps CDataSplitterShuffle().compute_indices(ds)) for splitting (and optionally subsampling) data in a oneliner.

Parameters
train_sizefloat, int, or None, optional

If None (default), the value is automatically set to the complement of the test size. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples.

test_sizefloat, int, or None, optional

If None (default), the value is automatically set to the complement of the train size. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples.

random_stateint, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, is the RandomState instance used by np.random.

shufflebool, optional

Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None. Default True.

Notes

train_size and test_size could not be both None. If one is set to None the other should be a float, representing a percentage, or an integer.

Examples

>>> from secml.data import CDataset
>>> from secml.data.splitter import CTrainTestSplit
>>> ds = CDataset([[1,2],[3,4],[5,6],[7,8]],[1,0,1,1])
>>> tr, ts = CTrainTestSplit(train_size=0.5, random_state=0).split(ds)
>>> tr.num_samples
2
>>> ts.num_samples
2
>>> # Get splitting indices without shuffle
>>> tr_idx, ts_idx = CTrainTestSplit(train_size=0.25,
...     random_state=0, shuffle=False).compute_indices(ds)
>>> tr_idx
CArray(1,)(dense: [0])
>>> ts_idx
CArray(3,)(dense: [1 2 3])
>>> # At least one sample is needed for each set
>>> tr, ts = CTrainTestSplit(train_size=0.2, random_state=0).split(ds)
Traceback (most recent call last):
    ...
ValueError: train_size should be at least 1 or 0.25
Attributes
class_type

Defines class type.

logger

Logger for current object.

tr_idx

Training set indices obtained with the split of the data.

ts_idx

Test set indices obtained with the split of the data.

verbose

Verbosity level of logger output.

Methods

compute_indices(self, dataset)

Compute training set and test set indices for each fold.

copy(self)

Returns a shallow copy of current class.

create([class_item])

This method creates an instance of a class with given type.

deepcopy(self)

Returns a deep copy of current class.

get_class_from_type(class_type)

Return the class associated with input type.

get_params(self)

Returns the dictionary of class parameters.

get_subclasses()

Get all the subclasses of the calling class.

list_class_types()

This method lists all types of available subclasses of calling one.

load(path)

Loads class from pickle object.

save(self, path)

Save class object using pickle.

set(self, param_name, param_value[, copy])

Set a parameter that has a specific name to a specific value.

set_params(self, params_dict[, copy])

Set all parameters passed as a dictionary {key: value}.

split(self, dataset)

Split dataset into training set and test set.

timed([msg])

Timer decorator.

compute_indices(self, dataset)[source]

Compute training set and test set indices for each fold.

Parameters
datasetCDataset

Dataset to split.

Returns
tr_idx, ts_idxCArray

Flat arrays with the tr/ts indices.

split(self, dataset)[source]

Split dataset into training set and test set.

Parameters
datasetCDataset

Dataset to split.

Returns
ds_train, ds_testCDataset

Train and Test datasets.

property tr_idx

Training set indices obtained with the split of the data.

property ts_idx

Test set indices obtained with the split of the data.

CChronologicalSplitter

class secml.data.splitter.c_chronological_splitter.CChronologicalSplitter(th_timestamp, train_size=1.0, test_size=1.0, random_state=None, shuffle=True)[source]

Bases: secml.core.c_creator.CCreator

Dataset splitter based on timestamps.

Split dataset into train and test subsets,

using a timestamp as split point.

A dataset containing timestamp and timestamp_fmt header attributes is required.

Parameters
th_timestampstr

The split point in time between training and test set. Samples having timestamp <= th_timestamp will be put in the training set, while samples with timestamp > th_timestamp will be used for the test set. The timestamp must follow the ISO 8601 format. Any incomplete timestamp will be parsed too.

train_sizefloat or int, optional

If float, should be between 0.0 and 1.0 and represent the proportion of the samples having timestamp <= th_timestamp to include in the train split. Default 1.0. If int, represents the absolute number of train samples.

test_sizefloat or int, optional

If float, should be between 0.0 and 1.0 and represent the proportion of the samples having timestamp > th_timestamp to include in the test split. Default 1.0. If int, represents the absolute number of test samples.

random_stateint, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, is the RandomState instance used by np.random.

shufflebool, optional

Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None. Default True.

Attributes
class_type

Defines class type.

logger

Logger for current object.

tr_idx

Training set indices obtained with the split of the data.

ts_idx

Test set indices obtained with the split of the data.

verbose

Verbosity level of logger output.

Methods

compute_indices(self, dataset)

Compute training set and test set indices.

copy(self)

Returns a shallow copy of current class.

create([class_item])

This method creates an instance of a class with given type.

deepcopy(self)

Returns a deep copy of current class.

get_class_from_type(class_type)

Return the class associated with input type.

get_params(self)

Returns the dictionary of class parameters.

get_subclasses()

Get all the subclasses of the calling class.

list_class_types()

This method lists all types of available subclasses of calling one.

load(path)

Loads class from pickle object.

save(self, path)

Save class object using pickle.

set(self, param_name, param_value[, copy])

Set a parameter that has a specific name to a specific value.

set_params(self, params_dict[, copy])

Set all parameters passed as a dictionary {key: value}.

split(self, dataset)

Split dataset into training set and test set.

timed([msg])

Timer decorator.

compute_indices(self, dataset)[source]

Compute training set and test set indices.

Parameters
datasetCDataset

Dataset to split.

Returns
tr_idx, ts_idxCArray

Flat arrays with the tr/ts indices.

split(self, dataset)[source]

Split dataset into training set and test set.

Parameters
datasetCDataset

Dataset to split.

Returns
ds_train, ds_testCDataset

Train and Test datasets.

property tr_idx

Training set indices obtained with the split of the data.

property ts_idx

Test set indices obtained with the split of the data.