secml.data.splitter¶
CDataSplitter¶
-
class
secml.data.splitter.c_datasplitter.
CDataSplitter
(num_folds=3, random_state=None)[source]¶ Bases:
secml.core.c_creator.CCreator
Abstract class that defines basic methods for dataset splitting.
- Parameters
- num_foldsint, optional
Number of folds to create. Default 3. This corresponds to the size of tr_idx and ts_idx lists.
- random_stateint, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, is the RandomState instance used by np.random.
- Attributes
Methods
compute_indices
(self, dataset)Compute training set and test set indices for each fold.
copy
(self)Returns a shallow copy of current class.
create
([class_item])This method creates an instance of a class with given type.
deepcopy
(self)Returns a deep copy of current class.
get_class_from_type
(class_type)Return the class associated with input type.
get_params
(self)Returns the dictionary of class hyperparameters.
get_state
(self)Returns the object state dictionary.
get_subclasses
()Get all the subclasses of the calling class.
list_class_types
()This method lists all types of available subclasses of calling one.
load
(path)Loads object from file.
load_state
(self, path)Sets the object state from file.
save
(self, path)Save class object to file.
save_state
(self, path)Store the object state to file.
set
(self, param_name, param_value[, copy])Set a parameter of the class.
set_params
(self, params_dict[, copy])Set all parameters passed as a dictionary {key: value}.
set_state
(self, state_dict[, copy])Sets the object state using input dictionary.
split
(self, dataset)Returns a list of split datasets.
timed
([msg])Timer decorator.
-
abstract
compute_indices
(self, dataset)[source]¶ Compute training set and test set indices for each fold.
- Parameters
- datasetCDataset
Dataset to split.
- Returns
- CDataSplitter
Instance of the dataset splitter with tr/ts indices.
-
split
(self, dataset)[source]¶ Returns a list of split datasets.
- Parameters
- datasetCDataset
Dataset to split.
- Returns
- split_dslist of tuple
List of tuples (training set, test set), one for each fold.
-
property
tr_idx
¶ List of training idx obtained with the split of the data.
-
property
ts_idx
¶ List of test idx obtained with the split of the data.
CDataSplitterKFold¶
-
class
secml.data.splitter.c_datasplitter_kfold.
CDataSplitterKFold
(num_folds=3, random_state=None)[source]¶ Bases:
secml.data.splitter.c_datasplitter.CDataSplitter
K-Folds dataset splitting.
Provides train/test indices to split data in train and test sets. Split dataset into ‘num_folds’ consecutive folds (with shuffling).
Each fold is then used a validation set once while the k - 1 remaining fold form the training set.
- Parameters
- num_foldsint, optional
Number of folds to create. Default 3. This correspond to the size of tr_idx and ts_idx lists.
- random_stateint, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, is the RandomState instance used by np.random.
Examples
>>> from secml.data import CDataset >>> from secml.data.splitter import CDataSplitterKFold
>>> ds = CDataset([[1,2],[3,4],[5,6]],[1,0,1]) >>> kfold = CDataSplitterKFold(num_folds=3, random_state=0).compute_indices(ds) >>> print(kfold.num_folds) 3 >>> print(kfold.tr_idx) [CArray(2,)(dense: [0 1]), CArray(2,)(dense: [0 2]), CArray(2,)(dense: [1 2])] >>> print(kfold.ts_idx) [CArray(1,)(dense: [2]), CArray(1,)(dense: [1]), CArray(1,)(dense: [0])]
- Attributes
class_type
‘kfold’Defines class type.
Methods
compute_indices
(self, dataset)Compute training set and test set indices for each fold.
copy
(self)Returns a shallow copy of current class.
create
([class_item])This method creates an instance of a class with given type.
deepcopy
(self)Returns a deep copy of current class.
get_class_from_type
(class_type)Return the class associated with input type.
get_params
(self)Returns the dictionary of class hyperparameters.
get_state
(self)Returns the object state dictionary.
get_subclasses
()Get all the subclasses of the calling class.
list_class_types
()This method lists all types of available subclasses of calling one.
load
(path)Loads object from file.
load_state
(self, path)Sets the object state from file.
save
(self, path)Save class object to file.
save_state
(self, path)Store the object state to file.
set
(self, param_name, param_value[, copy])Set a parameter of the class.
set_params
(self, params_dict[, copy])Set all parameters passed as a dictionary {key: value}.
set_state
(self, state_dict[, copy])Sets the object state using input dictionary.
split
(self, dataset)Returns a list of split datasets.
timed
([msg])Timer decorator.
CDataSplitterLabelKFold¶
-
class
secml.data.splitter.c_datasplitter_labelkfold.
CDataSplitterLabelKFold
(num_folds=3)[source]¶ Bases:
secml.data.splitter.c_datasplitter.CDataSplitter
K-Folds dataset splitting with non-overlapping labels.
The same label will not appear in two different folds (the number of distinct labels has to be at least equal to the number of folds).
The folds are approximately balanced in the sense that the number of distinct labels is approximately the same in each fold.
- Parameters
- num_foldsint, optional
Number of folds to create. Default 3. This correspond to the size of tr_idx and ts_idx lists.
Examples
>>> from secml.data import CDataset >>> from secml.data import CDataset >>> from secml.data.splitter import CDataSplitterLabelKFold >>> ds = CDataset([[1,2],[3,4],[5,6],[7,8]], [1,0,1,2]) >>> kfold = CDataSplitterLabelKFold(num_folds=3).compute_indices(ds) >>> print(kfold.num_folds) 3 >>> print(kfold.tr_idx) [CArray(2,)(dense: [1 3]), CArray(3,)(dense: [0 1 2]), CArray(3,)(dense: [0 2 3])] >>> print(kfold.ts_idx) [CArray(2,)(dense: [0 2]), CArray(1,)(dense: [3]), CArray(1,)(dense: [1])]
- Attributes
class_type
‘label-kfold’Defines class type.
Methods
compute_indices
(self, dataset)Compute training set and test set indices for each fold.
copy
(self)Returns a shallow copy of current class.
create
([class_item])This method creates an instance of a class with given type.
deepcopy
(self)Returns a deep copy of current class.
get_class_from_type
(class_type)Return the class associated with input type.
get_params
(self)Returns the dictionary of class hyperparameters.
get_state
(self)Returns the object state dictionary.
get_subclasses
()Get all the subclasses of the calling class.
list_class_types
()This method lists all types of available subclasses of calling one.
load
(path)Loads object from file.
load_state
(self, path)Sets the object state from file.
save
(self, path)Save class object to file.
save_state
(self, path)Store the object state to file.
set
(self, param_name, param_value[, copy])Set a parameter of the class.
set_params
(self, params_dict[, copy])Set all parameters passed as a dictionary {key: value}.
set_state
(self, state_dict[, copy])Sets the object state using input dictionary.
split
(self, dataset)Returns a list of split datasets.
timed
([msg])Timer decorator.
CDataSplitterOpenWorldKFold¶
-
class
secml.data.splitter.c_datasplitter_openworld.
CDataSplitterOpenWorldKFold
(num_folds=3, n_train_samples=5, n_train_classes=None, random_state=None)[source]¶ Bases:
secml.data.splitter.c_datasplitter.CDataSplitter
Open World K-Folds dataset splitting.
Provides train/test indices to split data in train and test sets.
In an Open World setting, half (or custom number) of the dataset classes are used for training, while all dataset classes are tested.
Split dataset into ‘num_folds’ consecutive folds (with shuffling).
Each fold is then used a validation set once while the k - 1 remaining fold form the training set.
- Parameters
- num_foldsint, optional
Number of folds to create. Default 3. This correspond to the size of tr_idx and ts_idx lists.
- n_train_samplesint, optional
Number of training samples per client. Default 5.
- n_train_classesint or None
Number of dataset classes to use as training. If not specified half of dataset classes are used (floored).
- random_stateint, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, is the RandomState instance used by np.random.
Examples
>>> from secml.data import CDataset >>> from secml.data.splitter import CDataSplitterOpenWorldKFold
>>> ds = CDataset([[1,2],[3,4],[5,6],[10,20],[30,40],[50,60], ... [100,200],[300,400]],[1,0,1,2,0,1,0,2]) >>> kfold = CDataSplitterOpenWorldKFold( ... num_folds=3, n_train_samples=2, random_state=0).compute_indices(ds) >>> kfold.num_folds 3 >>> print(kfold.tr_idx) [CArray(2,)(dense: [2 5]), CArray(2,)(dense: [1 4]), CArray(2,)(dense: [0 2])] >>> print(kfold.ts_idx) [CArray(6,)(dense: [0 1 3 4 6 7]), CArray(6,)(dense: [0 2 3 5 6 7]), CArray(6,)(dense: [1 3 4 5 6 7])] >>> print(kfold.tr_classes) # Class 2 is skipped as there are not enough samples (at least 3) [CArray(1,)(dense: [1]), CArray(1,)(dense: [0]), CArray(1,)(dense: [1])]
- Attributes
class_type
‘open-world-kfold’Defines class type.
Methods
compute_indices
(self, dataset)Compute training set and test set indices for each fold.
copy
(self)Returns a shallow copy of current class.
create
([class_item])This method creates an instance of a class with given type.
deepcopy
(self)Returns a deep copy of current class.
get_class_from_type
(class_type)Return the class associated with input type.
get_params
(self)Returns the dictionary of class hyperparameters.
get_state
(self)Returns the object state dictionary.
get_subclasses
()Get all the subclasses of the calling class.
list_class_types
()This method lists all types of available subclasses of calling one.
load
(path)Loads object from file.
load_state
(self, path)Sets the object state from file.
save
(self, path)Save class object to file.
save_state
(self, path)Store the object state to file.
set
(self, param_name, param_value[, copy])Set a parameter of the class.
set_params
(self, params_dict[, copy])Set all parameters passed as a dictionary {key: value}.
set_state
(self, state_dict[, copy])Sets the object state using input dictionary.
split
(self, dataset)Returns a list of split datasets.
timed
([msg])Timer decorator.
-
compute_indices
(self, dataset)[source]¶ Compute training set and test set indices for each fold.
- Parameters
- datasetCDataset
Dataset to split.
- Returns
- CDataSplitter
Instance of the dataset splitter with tr/ts indices.
-
property
tr_classes
¶ List of training classes obtained with the split of the data.
CDataSplitterShuffle¶
-
class
secml.data.splitter.c_datasplitter_shuffle.
CDataSplitterShuffle
(num_folds=3, train_size=None, test_size=None, random_state=None)[source]¶ Bases:
secml.data.splitter.c_datasplitter.CDataSplitter
Random permutation dataset splitting.
Yields indices to split data into training and test sets.
Note: contrary to other dataset splitting strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.
- Parameters
- num_foldsint, optional
Number of folds to create. Default 3. This correspond to the size of tr_idx and ts_idx lists.
- train_sizefloat, int, or None, optional
If None (default), the value is automatically set to the complement of the test size. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples.
- test_sizefloat, int, or None, optional
If None (default), the value is automatically set to the complement of the train size. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples.
- random_stateint, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, is the RandomState instance used by np.random.
Notes
train_size and test_size could not be both None. If one is set to None the other should be a float, representing a percentage, or an integer.
Examples
>>> from secml.data import CDataset >>> from secml.data.splitter import CDataSplitterShuffle
>>> ds = CDataset([[1,2],[3,4],[5,6]],[1,0,1]) >>> shuffle = CDataSplitterShuffle(num_folds=3, train_size=0.5, random_state=0).compute_indices(ds) >>> shuffle.num_folds 3 >>> shuffle.tr_idx [CArray(1,)(dense: [0]), CArray(1,)(dense: [1]), CArray(1,)(dense: [1])] >>> shuffle.ts_idx [CArray(2,)(dense: [2 1]), CArray(2,)(dense: [2 0]), CArray(2,)(dense: [0 2])]
>>> # Setting the train_size or the test_size to an arbitrary percentage >>> shuffle = CDataSplitterShuffle(num_folds=3, train_size=0.2, random_state=0).compute_indices(ds) >>> shuffle.num_folds 3 >>> shuffle.tr_idx [CArray(0,)(dense: []), CArray(0,)(dense: []), CArray(0,)(dense: [])] >>> shuffle.ts_idx [CArray(3,)(dense: [2 1 0]), CArray(3,)(dense: [2 0 1]), CArray(3,)(dense: [0 2 1])]
- Attributes
class_type
‘shuffle’Defines class type.
Methods
compute_indices
(self, dataset)Compute training set and test set indices for each fold.
copy
(self)Returns a shallow copy of current class.
create
([class_item])This method creates an instance of a class with given type.
deepcopy
(self)Returns a deep copy of current class.
get_class_from_type
(class_type)Return the class associated with input type.
get_params
(self)Returns the dictionary of class hyperparameters.
get_state
(self)Returns the object state dictionary.
get_subclasses
()Get all the subclasses of the calling class.
list_class_types
()This method lists all types of available subclasses of calling one.
load
(path)Loads object from file.
load_state
(self, path)Sets the object state from file.
save
(self, path)Save class object to file.
save_state
(self, path)Store the object state to file.
set
(self, param_name, param_value[, copy])Set a parameter of the class.
set_params
(self, params_dict[, copy])Set all parameters passed as a dictionary {key: value}.
set_state
(self, state_dict[, copy])Sets the object state using input dictionary.
split
(self, dataset)Returns a list of split datasets.
timed
([msg])Timer decorator.
CDataSplitterStratifiedKFold¶
-
class
secml.data.splitter.c_datasplitter_stratkfold.
CDataSplitterStratifiedKFold
(num_folds=3, random_state=None)[source]¶ Bases:
secml.data.splitter.c_datasplitter.CDataSplitter
Stratified K-Folds dataset splitting.
Provides train/test indices to split data in train test sets.
This dataset splitting object is a variation of KFold, which returns stratified folds. The folds are made by preserving the percentage of samples for each class.
- Parameters
- num_foldsint, optional
Number of folds to create. Default 3. This correspond to the size of tr_idx and ts_idx lists. For stratified K-Fold, this cannot be higher than the minimum number of samples per class in the dataset.
- random_stateint, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, is the RandomState instance used by np.random.
Examples
>>> from secml.data import CDataset >>> from secml.data.splitter import CDataSplitterStratifiedKFold
>>> ds = CDataset([[1,2],[3,4],[5,6],[7,8]],[1,0,0,1]) >>> stratkfold = CDataSplitterStratifiedKFold(num_folds=2, random_state=0).compute_indices(ds) >>> stratkfold.num_folds # Cannot be higher than the number of samples per class 2 >>> stratkfold.tr_idx [CArray(2,)(dense: [1 3]), CArray(2,)(dense: [0 2])] >>> stratkfold.ts_idx [CArray(2,)(dense: [0 2]), CArray(2,)(dense: [1 3])]
- Attributes
class_type
‘strat-kfold’Defines class type.
Methods
compute_indices
(self, dataset)Compute training set and test set indices for each fold.
copy
(self)Returns a shallow copy of current class.
create
([class_item])This method creates an instance of a class with given type.
deepcopy
(self)Returns a deep copy of current class.
get_class_from_type
(class_type)Return the class associated with input type.
get_params
(self)Returns the dictionary of class hyperparameters.
get_state
(self)Returns the object state dictionary.
get_subclasses
()Get all the subclasses of the calling class.
list_class_types
()This method lists all types of available subclasses of calling one.
load
(path)Loads object from file.
load_state
(self, path)Sets the object state from file.
save
(self, path)Save class object to file.
save_state
(self, path)Store the object state to file.
set
(self, param_name, param_value[, copy])Set a parameter of the class.
set_params
(self, params_dict[, copy])Set all parameters passed as a dictionary {key: value}.
set_state
(self, state_dict[, copy])Sets the object state using input dictionary.
split
(self, dataset)Returns a list of split datasets.
timed
([msg])Timer decorator.
CTrainTestSplit¶
-
class
secml.data.splitter.c_train_test_split.
CTrainTestSplit
(train_size=None, test_size=None, random_state=None, shuffle=True)[source]¶ Bases:
secml.core.c_creator.CCreator
Train and Test Sets splitter.
Split dataset into random train and test subsets.
Quick utility that wraps CDataSplitterShuffle().compute_indices(ds)) for splitting (and optionally subsampling) data in a oneliner.
- Parameters
- train_sizefloat, int, or None, optional
If None (default), the value is automatically set to the complement of the test size. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples.
- test_sizefloat, int, or None, optional
If None (default), the value is automatically set to the complement of the train size. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples.
- random_stateint, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, is the RandomState instance used by np.random.
- shufflebool, optional
Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None. Default True.
Notes
train_size and test_size could not be both None. If one is set to None the other should be a float, representing a percentage, or an integer.
Examples
>>> from secml.data import CDataset >>> from secml.data.splitter import CTrainTestSplit
>>> ds = CDataset([[1,2],[3,4],[5,6],[7,8]],[1,0,1,1]) >>> tr, ts = CTrainTestSplit(train_size=0.5, random_state=0).split(ds) >>> tr.num_samples 2 >>> ts.num_samples 2
>>> # Get splitting indices without shuffle >>> tr_idx, ts_idx = CTrainTestSplit(train_size=0.25, ... random_state=0, shuffle=False).compute_indices(ds) >>> tr_idx CArray(1,)(dense: [0]) >>> ts_idx CArray(3,)(dense: [1 2 3])
>>> # At least one sample is needed for each set >>> tr, ts = CTrainTestSplit(train_size=0.2, random_state=0).split(ds) Traceback (most recent call last): ... ValueError: train_size should be at least 1 or 0.25
- Attributes
Methods
compute_indices
(self, dataset)Compute training set and test set indices for each fold.
copy
(self)Returns a shallow copy of current class.
create
([class_item])This method creates an instance of a class with given type.
deepcopy
(self)Returns a deep copy of current class.
get_class_from_type
(class_type)Return the class associated with input type.
get_params
(self)Returns the dictionary of class hyperparameters.
get_state
(self)Returns the object state dictionary.
get_subclasses
()Get all the subclasses of the calling class.
list_class_types
()This method lists all types of available subclasses of calling one.
load
(path)Loads object from file.
load_state
(self, path)Sets the object state from file.
save
(self, path)Save class object to file.
save_state
(self, path)Store the object state to file.
set
(self, param_name, param_value[, copy])Set a parameter of the class.
set_params
(self, params_dict[, copy])Set all parameters passed as a dictionary {key: value}.
set_state
(self, state_dict[, copy])Sets the object state using input dictionary.
split
(self, dataset)Split dataset into training set and test set.
timed
([msg])Timer decorator.
-
compute_indices
(self, dataset)[source]¶ Compute training set and test set indices for each fold.
- Parameters
- datasetCDataset
Dataset to split.
- Returns
- tr_idx, ts_idxCArray
Flat arrays with the tr/ts indices.
-
split
(self, dataset)[source]¶ Split dataset into training set and test set.
- Parameters
- datasetCDataset
Dataset to split.
- Returns
- ds_train, ds_testCDataset
Train and Test datasets.
-
property
tr_idx
¶ Training set indices obtained with the split of the data.
-
property
ts_idx
¶ Test set indices obtained with the split of the data.
CChronologicalSplitter¶
-
class
secml.data.splitter.c_chronological_splitter.
CChronologicalSplitter
(th_timestamp, train_size=1.0, test_size=1.0, random_state=None, shuffle=True)[source]¶ Bases:
secml.core.c_creator.CCreator
Dataset splitter based on timestamps.
- Split dataset into train and test subsets,
using a timestamp as split point.
A dataset containing timestamp and timestamp_fmt header attributes is required.
- Parameters
- th_timestampstr
The split point in time between training and test set. Samples having timestamp <= th_timestamp will be put in the training set, while samples with timestamp > th_timestamp will be used for the test set. The timestamp must follow the ISO 8601 format. Any incomplete timestamp will be parsed too.
- train_sizefloat or int, optional
If float, should be between 0.0 and 1.0 and represent the proportion of the samples having timestamp <= th_timestamp to include in the train split. Default 1.0. If int, represents the absolute number of train samples.
- test_sizefloat or int, optional
If float, should be between 0.0 and 1.0 and represent the proportion of the samples having timestamp > th_timestamp to include in the test split. Default 1.0. If int, represents the absolute number of test samples.
- random_stateint, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, is the RandomState instance used by np.random.
- shufflebool, optional
Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None. Default True.
- Attributes
Methods
compute_indices
(self, dataset)Compute training set and test set indices.
copy
(self)Returns a shallow copy of current class.
create
([class_item])This method creates an instance of a class with given type.
deepcopy
(self)Returns a deep copy of current class.
get_class_from_type
(class_type)Return the class associated with input type.
get_params
(self)Returns the dictionary of class hyperparameters.
get_state
(self)Returns the object state dictionary.
get_subclasses
()Get all the subclasses of the calling class.
list_class_types
()This method lists all types of available subclasses of calling one.
load
(path)Loads object from file.
load_state
(self, path)Sets the object state from file.
save
(self, path)Save class object to file.
save_state
(self, path)Store the object state to file.
set
(self, param_name, param_value[, copy])Set a parameter of the class.
set_params
(self, params_dict[, copy])Set all parameters passed as a dictionary {key: value}.
set_state
(self, state_dict[, copy])Sets the object state using input dictionary.
split
(self, dataset)Split dataset into training set and test set.
timed
([msg])Timer decorator.
-
compute_indices
(self, dataset)[source]¶ Compute training set and test set indices.
- Parameters
- datasetCDataset
Dataset to split.
- Returns
- tr_idx, ts_idxCArray
Flat arrays with the tr/ts indices.
-
split
(self, dataset)[source]¶ Split dataset into training set and test set.
- Parameters
- datasetCDataset
Dataset to split.
- Returns
- ds_train, ds_testCDataset
Train and Test datasets.
-
property
tr_idx
¶ Training set indices obtained with the split of the data.
-
property
ts_idx
¶ Test set indices obtained with the split of the data.