Classification

The documentation of the classification module.

The pyts.classification module includes classification algorithms.

class pyts.classification.KNNClassifier(n_neighbors=1, weights=u'uniform', algorithm=u'auto', leaf_size=30, p=2, metric=u'minkowski', metric_params=None, n_jobs=1, **kwargs)[source]

k nearest neighbors classifier.

Parameters:
n_neighbors : int, optional (default = 1)

Number of neighbors to use.

weights : str or callable, optional (default = ‘uniform’)

weight function used in prediction. Possible values:

  • ‘uniform’ : uniform weights. All points in each neighborhood are weighted equally.
  • ‘distance’ : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.
  • [callable] : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights.
algorithm : {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional

Algorithm used to compute the nearest neighbors.

Note: fitting on sparse input will override the setting of this parameter, using brute force.

leaf_size : int, optional (default = 30)

Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.

metric : string or DistanceMetric object (default = ‘minkowski’)

the distance metric to use for the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric. See the documentation of the DistanceMetric class for a list of available metrics. ‘dtw’ and ‘fast_dtw’ are also available.

p : integer, optional (default = 2)

Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.

metric_params : dict, optional (default = None)

Additional keyword arguments for the metric function.

n_jobs : int, optional (default = 1)

The number of parallel jobs to run for neighbors search. If n_jobs=-1, then the number of jobs is set to the number of CPU cores. Doesn’t affect fit() method.

Methods

fit(X, y) Fit the model according to the given training data.
get_params([deep]) Get parameters for this estimator.
predict(X) Predict the class labels for the provided data.
score(X, y[, sample_weight]) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of this estimator.
fit(X, y)[source]

Fit the model according to the given training data.

Parameters:
X : array-like, shape = [n_samples, n_features]

Training vector, where n_samples is the number of samples and n_features is the number of features.

y : array-like, shape = [n_samples]

Class labels for each data sample.

Returns:
self : object

Returns self.

predict(X)[source]

Predict the class labels for the provided data.

Parameters:
X : array-like, shape = [n_samples, n_features]
Returns:
y : array-like, shape [n_samples]

Class labels for each data sample.

class pyts.classification.SAXVSMClassifier(n_bins=4, quantiles=u'empirical', window_size=4, numerosity_reduction=True, use_idf=True, smooth_idf=True, sublinear_tf=False)[source]

Classifier based on SAX-VSM representation and tf-idf statistics.

Parameters:
n_bins : int (default = 4)

Number of bins (also known as the size of the alphabet).

quantiles : {‘gaussian’, ‘empirical’} (default = ‘empirical’)

The way to compute quantiles. If ‘gaussian’, quantiles from a gaussian distribution N(0,1) are used. If ‘empirical’, empirical quantiles are used.

window_size : int (default = 4)

Size of the window (i.e. the size of each word).

numerosity_reduction : bool (default = True)

If True, delete all but one occurence of back to back identical occurences of the same words.

use_idf : bool (default = True)

Enable inverse-document-frequency reweighting.

smooth_idf : bool (default = True)

Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.

sublinear_tf : bool (default = False)

Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

Attributes:
vocabulary_ : dict

A mapping of feature indices to terms.

tfidf_ : sparse matrix, shape = [n_classes, n_words]

Term-document matrix

idf_ : array, shape = [n_features], or None

The learned idf vector (global term weights) when use_idf=True, None otherwise.

stop_words_ : set
Terms that were ignored because they either:
  • occurred in too many documents (max_df)
  • occurred in too few documents (min_df)
  • were cut off by feature selection (max_features).

This is only available if no vocabulary was given.

Methods

fit(X, y) Fit the model according to the given training data.
get_params([deep]) Get parameters for this estimator.
predict(X) Predict the class labels for the provided data.
score(X, y[, sample_weight]) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of this estimator.
fit(X, y)[source]

Fit the model according to the given training data.

Parameters:
X : array-like, shape = [n_samples]

Training vector, where n_samples is the number of samples.

y : array-like, shape = [n_samples]

Class labels for each data sample.

Returns:
self : object

Returns self.

predict(X)[source]

Predict the class labels for the provided data.

Parameters:
X : array-like, shape = [n_samples, n_features]
Returns:
y : array-like, shape [n_samples]

Class labels for each data sample.

class pyts.classification.BOSSVSClassifier(n_coefs, window_size, norm_mean=True, norm_std=True, n_bins=4, quantiles=u'empirical', variance_selection=False, variance_threshold=0.0, numerosity_reduction=True, smooth_idf=True, sublinear_tf=True)[source]

Bag-of-SFA Symbols in Vector Space.

Parameters:
n_coefs : None or int (default = None)

The number of Fourier coefficients to keep. If n_coefs=None, all Fourier coefficients are returned. If n_coefs is an integer, the n_coefs most significant Fourier coefficients are returned if anova=True, otherwise the first n_coefs Fourier coefficients are returned. A even number is required (for real and imaginary values) if anova=False.

window_size : int

Window length to use to extracte sub time series.

norm_mean : bool (default = True)

If True, center the data before scaling. If norm_mean=True and anova=False, the first Fourier coefficient will be dropped.

norm_std : bool (default = True)

If True, scale the data to unit variance.

n_bins : int (default = 4)

The number of bins. Ignored if quantiles='entropy'.

quantiles : {‘gaussian’, ‘empirical’} (default = ‘gaussian’)

The way to compute quantiles. If ‘gaussian’, quantiles from a gaussian distribution N(0,1) are used. If ‘empirical’, empirical quantiles are used.

variance_selection : bool (default = False)

If True, the Fourier coefficients with low variance are removed.

variance_threshold : float (default = 0.)

Fourier coefficients with a training-set variance lower than this threshold will be removed. Ignored if variance_selection=False.

numerosity_reduction : boolean (default = True)

whether or not numerosity reduction is applied. When the same word occurs several times in a row, only one instance of this word is kept if numerosity_reduction=True, otherwise all instances are kept.

smooth_idf : boolean, default=True

smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.

sublinear_tf : boolean, default=False

apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

Attributes:
vocabulary_ : dict

A mapping of features indices to terms.

Methods

fit(X, y[, overlapping]) Fit the model according to the given training data.
get_params([deep]) Get parameters for this estimator.
predict(X) Transform the provided data.
score(X, y[, sample_weight]) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of this estimator.
fit(X, y, overlapping=True)[source]

Fit the model according to the given training data.

Parameters:
X : array-like, shape = [n_samples, n_features]

Training vector, where n_samples in the number of samples and n_features is the number of features.

y : array-like, shape = [n_samples]

Class labels for each data sample.

overlapping : bool (default = False)

If True, overlapping windows are used for the training phase.

Returns:
self : object
predict(X)[source]

Transform the provided data.

Parameters:
X : array-like, shape = [n_samples, n_features]
Returns:
X_new : sparse matrix, shape [n_samples, n_words]

Document-term matrix.