ertk.dataset.Dataset
- class ertk.dataset.Dataset(corpus_info: PathLike | str, features: PathLike | str | None = None, subset: str = 'default', label: str = 'label')
Bases:
objectClass representing a generic dataset, consisting of a set of features and optional partitions and annotations. Has various preprocessing methods.
An annotation is a scalar value for all instances in the dataset.
A partition is a partition of instances into disjoint groups (e.g. speakers). A partition should be complete in that each instance is in exactly one group in the partition. Each partition has a corresponding annotation with the same name, using the group names as categorical annotations.
- Parameters:
- corpus_info: Pathlike or str, optional
Path to corpus info in YAML format.
- features: Pathlike or str, optional
Path to features file, or unique name of features in corpus features directory.
- subset: str, optional
The subset of instances to use.
- __init__(corpus_info: PathLike | str, features: PathLike | str | None = None, subset: str = 'default', label: str = 'label')
Methods
__init__(corpus_info[, features, subset, label])annotation_type(annot_name)clip_arrays(length)Clips each array to the specified maximum length.
clone()copy()frame_arrays(frame_size, frame_shift[, ...])Create a sequence of frames from the raw signal.
get_annotations(annot_name)Get a list of annotations, one for each instance currently in the dataset.
get_audio_paths()get_group_counts(annot_name)Get group counts for a partition.
get_group_indices(annot_name)Gets the group indices (i.e.
get_group_names(annot_name)Get the names of groups in a partition.
get_idx_for_names(names)Gets indices of instances corresponding to
names.Gets indices of instances corresponding to the selection given by
split.get_ratings(name[, rating_set])Get per-annotator ratings of a specified column, for this dataset.
init_corpus_info(path)Initialise corpus metadata from YAML.
map_and_select(map, select, remove)Convenience function for mapping one or more partitions and then selecting one or more groups.
map_classes(mapping)Modifies classses based on the mapping in map.
map_groups(part_name, mapping)Map group names in a partition.
normalise(partition[, normaliser])Transforms the data matrix of this dataset in-place using some (offline) normalisation method.
pad_arrays([pad])Pads each array to the nearest multiple of
padgreater than the array size.remove_annotation(annot_name)Removes a set of annotations from the dataset.
remove_classes(*[, drop, keep])Remove instances with labels not in
keep.remove_groups(part_name, *[, drop, keep])Remove instances corresponding to groups from the given partition.
remove_instances(*[, drop, keep])Remove instances from dataset.
remove_ratings(rating_set)Delete a set of ratings for this dataset.
rename_annotation(old_name, new_name)Renames an annotation.
Transpose the time and feature axis of each instance.
update_annotation(annot_name, annotations[, ...])Add or update an annotation.
update_features(features, **read_kwargs)Update the features matrix and feature names for this dataset.
update_labels(labels)update_ratings(rating_set, ratings)Update a set of ratings.
use_subset([subset])Use a different subset of the instances.
Inherited Methods
Attributes
Full annotation matrix for dataset.
Number of instances for each class.
List of unique class labels.
The corpus this dataset represents.
The descriptive name of this corpus
List of feature names.
The annotation used as target label.
List of labels for instances.
Number of unique classes.
Number of features.
Number of instances in this dataset.
n_speakersList of instance names.
Partitions in this dataset.
Full ratings for dataset.
Number of instances for each speaker.
Indices into speakers array of corresponding speaker for each instance.
List of unique speakers in this dataset.
speakersName of clip subset used.
Dict from subset name to set of clip names.
The data matrix.
The class label array; one label per instance.
- property annotations: DataFrame
Full annotation matrix for dataset.
- frame_arrays(frame_size: int, frame_shift: int, max_frames: int | None = None)
Create a sequence of frames from the raw signal.
- get_annotations(annot_name: str) ndarray
Get a list of annotations, one for each instance currently in the dataset.
- Parameters:
- annot_name: str
Annotation name.
- Returns:
- A pd.Series of values, one for each instance in the datset, in
- the same order they appear in names and x.
- get_group_counts(annot_name: str) ndarray
Get group counts for a partition.
- Parameters:
- annot_name: str
The partition name.
- Returns:
- A NumPy array of counts for the corresponding group in this
- partition.
- get_group_indices(annot_name: str) ndarray
Gets the group indices (i.e. indices into the groups array) for a given partition.
- Parameters:
- annot_name: str
The partition name.
- Returns:
- A NumPy array of group indices for each instance in the dataset.
- get_group_names(annot_name: str) List[str]
Get the names of groups in a partition.
- Parameters:
- annot_name: str
Annotation name.
- get_idx_for_names(names: Collection[str]) ndarray
Gets indices of instances corresponding to
names.- Parameters:
- names: collection of str
The names to get indices for.
- Returns:
- idx: np.ndarray
The indices corresponding to
names, in order.
- get_idx_for_split(split: str | Dict[str, Collection[str]] | DataSelector, return_complement: Literal[False] = False) ndarray
- get_idx_for_split(split: str | Dict[str, Collection[str]] | DataSelector, return_complement: Literal[True]) Tuple[ndarray, ndarray]
Gets indices of instances corresponding to the selection given by
split.- Parameters:
- split: str or dict
Either a string containing the subset to select, groups to select, a path to such a config file, or a mapping object containing the groups to select.
- return_complement: bool
If True, return a tuple of
idx,comp_idxwherecomp_idxcontains the complement indices not in the split.
- Returns:
- idx: np.ndarray
The corresponding indices.
- comp_idx: np.ndarray, optional
If
return_complementis True, this is also returned and contains the complement indices.
- get_ratings(name: str, rating_set: str = 'ratings') Series
Get per-annotator ratings of a specified column, for this dataset.
- Parameters:
- name: str
The name of the rating column to get.
- rating_set: str
The name of the rating set to get.
- Returns:
- pd.Series
Pandas Series with (name, rater) multiindex.
- init_corpus_info(path: PathLike | str) None
Initialise corpus metadata from YAML.
- Parameters:
- path: os.Pathlike or str
The path to a YAML file containing corpus metadata.
- map_and_select(map: Mapping[str, Mapping[str, str]], select: Mapping[str, str | Collection[str]], remove: Mapping[str, str | Collection[str]]) None
Convenience function for mapping one or more partitions and then selecting one or more groups.
- Parameters:
- map: mapping
The groups mapping. May have one or more partitions.
- select: mapping
Mapping from partitions to groups to select.
- map_classes(mapping: Mapping[str, str])
Modifies classses based on the mapping in map. Keys not corresponding to classes are ignored. The new classes will be sorted lexicographically.
- map_groups(part_name: str, mapping: Mapping[str, str]) None
Map group names in a partition.
- Parameters:
- part_name: str
Name of partition.
- mapping: dict
Group name mapping.
- property names: Index
List of instance names.
- normalise(partition: str, normaliser: TransformerMixin = StandardScaler())
Transforms the data matrix of this dataset in-place using some (offline) normalisation method.
- Parameters:
- normaliser:
The transform to apply. Must implement fit_transform().
- partition: str
The partition to apply in a per-group fashion. If “all”, then perform global normalisation on all the data.
- pad_arrays(pad: int = 32)
Pads each array to the nearest multiple of
padgreater than the array size. Assumes axis 0 of x is time.
- remove_annotation(annot_name: str) None
Removes a set of annotations from the dataset. This does not remove any instances.
- Parameters:
- annot_name: str
Annotation name.
- remove_classes(*, drop: Collection[str] | None = None, keep: Collection[str] | None = None)
Remove instances with labels not in
keep.
- remove_groups(part_name: str, *, drop: Collection[str] | None = None, keep: Collection[str] | None = None) None
Remove instances corresponding to groups from the given partition.
- Parameters:
- part_name: str
The partition name.
- drop: collection of str
The groups to remove in given partition.
- keep: collection of str
The groups to keep in given partition.
- remove_instances(*, drop: Collection[str] | None = None, keep: Collection[str] | None = None)
Remove instances from dataset. Recalculate annotations, partitions, etc.
- Parameters:
- drop: collection of str
Instances to drop.
- keep: collection of str
Instances to keep. Exactly one of drop and keep should be given.
- remove_ratings(rating_set: str) None
Delete a set of ratings for this dataset.
- Parameters:
- rating_set: str
The name of a set of ratings.
- rename_annotation(old_name: str, new_name: str) None
Renames an annotation. This is useful for example if you want to select a different labelling to use. If an annotation already exists with the given new_name, then it is replaced destructively.
- Parameters:
- old_name: str
The name of the annotation to rename.
- new_name: str
The new name of the annotation.
- property speaker_indices: ndarray
Indices into speakers array of corresponding speaker for each instance.
- transpose_time()
Transpose the time and feature axis of each instance.
- update_annotation(annot_name: str, annotations: PathLike | str | Mapping[str, Any] | Sequence[Any] | Series, dtype: Type | Literal['category'] | None = None) None
Add or update an annotation.
- Parameters:
- annot_name: str
The name of the annotation.
- annotations: PathLike, str, mapping, DataFrame, Series or sequence
Annotations to add, similar to update_partition(). If PathLike or str, annotations are read from a CSV. If a dict, should be of the form {instance: annotation}. If a list, should have an annotation for each instance.
- dtype: type, optional
The type of annotations for reading from CSV file. If the literal “category” is given the annotations are converted to the Pandas categorical dtype.
- update_features(features: PathLike | str, **read_kwargs) None
Update the features matrix and feature names for this dataset.
- Parameters:
- features: os.PathLike or str
Path to a set of features or unique name of features in corpus features dir.
- **read_kwargs:
Other arguments to pass to
read_features().
- update_ratings(rating_set: str, ratings: PathLike | str | Mapping[str, Mapping[str, Any]] | Series) None
Update a set of ratings.
- Parameters:
- rating_set: str
The name for this set of ratings.
- ratings: PathLike, str, mapping, DataFrame, Series
The ratings to add. Must have a joint index where