# Performs clustering according to the bisecting k-means algorithm

Source:`R/sklearn-cluster.R`

`BisectingKMeans.Rd`

This is a wrapper around the Python class sklearn.cluster.BisectingKMeans.

## Super classes

`rgudhi::PythonClass`

-> `rgudhi::SKLearnClass`

-> `rgudhi::BaseClustering`

-> `BisectingKMeans`

## Methods

## Inherited methods

### Method `new()`

The BisectingKMeans class constructor.

#### Usage

```
BisectingKMeans$new(
n_clusters = 2L,
init = c("k-means++", "random"),
n_init = 10L,
max_iter = 300L,
tol = 1e-04,
verbose = 0L,
random_state = NULL,
copy_x = TRUE,
algorithm = c("lloyd", "elkan"),
bisecting_strategy = c("biggest_inertia", "largest_cluster")
)
```

#### Arguments

`n_clusters`

An integer value specifying the number of clusters to form as well as the number of centroids to generate. Defaults to

`2L`

.`init`

Either a string or a numeric matrix of shape \(\mathrm{n_clusters} \times \mathrm{n_features}\) specifying the method for initialization. If a string, choices are:

`"k-means++"`

: selects initial cluster centroids using sampling based on an empirical probability distribution of the points’ contribution to the overall inertia. This technique speeds up convergence, and is theoretically proven to be \(\mathcal{O}(\log(k))\)-optimal. See the description of`n_init`

for more details;`"random"`

: chooses`n_clusters`

observations (rows) at random from data for the initial centroids. Defaults to`"k-means++"`

.

`n_init`

An integer value specifying the number of times the k-means algorithm will be run with different centroid seeds. The final results will be the best output of

`n_init`

consecutive runs in terms of inertia. Defaults to`10L`

.`max_iter`

An integer value specifying the maximum number of iterations of the k-means algorithm for a single run. Defaults to

`300L`

.`tol`

A numeric value specifying the relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence. Defaults to

`1e-4`

.`verbose`

An integer value specifying the level of verbosity. Defaults to

`0L`

which is equivalent to no verbose.`random_state`

An integer value specifying the initial seed of the random number generator. Defaults to

`NULL`

which uses the current timestamp.`copy_x`

A boolean value specifying whether the original data is to be modified. When pre-computing distances it is more numerically accurate to center the data first. If

`copy_x`

is`TRUE`

, then the original data is not modified. If`copy_x`

is`FALSE`

, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean. Note that if the original data is not C-contiguous, a copy will be made even if`copy_x`

is`FALSE`

. If the original data is sparse, but not in CSR format, a copy will be made even if`copy_x`

is`FALSE`

. Defaults to`TRUE`

.`algorithm`

A string specifying the k-means algorithm to use. The classical EM-style algorithm is

`"lloyd"`

. The`"elkan"`

variation can be more efficient on some datasets with well-defined clusters, by using the triangle inequality. However it’s more memory-intensive due to the allocation of an extra array of shape \(\mathrm{n_samples} \times \mathrm{n_clusters}\). Defaults to`"lloyd"`

.`bisecting_strategy`

A string specifying how bisection should be performed. Choices are:

`"biggest_inertia"`

: means that it will always check all calculated cluster for cluster with biggest SSE (Sum of squared errors) and bisect it. This approach concentrates on precision, but may be costly in terms of execution time (especially for larger amount of data points).`"largest_cluster"`

: means that it will always split cluster with largest amount of points assigned to it from all clusters previously calculated. That should work faster than picking by SSE and may produce similar results in most cases. Defaults to`"biggest_inertia"`

.