Unsectioned

OPTICS reachability distance

Front

Active users

1

All-time users

1

Favorites

0

Last updated

3 years ago

Date created

Oct 4, 2021

Unsectioned

(31 cards)

OPTICS reachability distance

Reachability Distance of object \(p\) from core object \(q\) is the min radius value that makes \(p\) directly density-reachable from \(q\). Reachability distance is undefined if \(q\) is not a core object. Otherwise,

\(\text{Reachability-distance}_{\epsilon, \text{MinPts}}(p, q) = \max(\text{Core-distance}(q), \text{distance}(q, p))\)

Gini split

$$\text{gini}_\text{split}(T) = \frac{N_1}{N}\text{gini}(T_1) + \frac{N_2}{N}\text{gini}(T_2)$$

where \(T\) is split into two subsets \(T_1\) and \(T_2\) by an attribute. The attribute that provides the smallest \(\text{gini}_\text{split}(T)\) is chosen to split the node.

Entropy of attribute \(A\) with values \(\{a_1, a_2, \ldots, a_v\}\)

$$E(A) = \sum_{j=1}^v \frac{s_{1j} + \ldots + s_{mj}}{s} I(s_{1j}, \ldots, s_{mj})$$

where \(s_{ij}\) is training examples in class \(C_i\) and attribute value \(A = a_j\)

Density-reachable

Directly density-reachable: \(p_1 \rightarrow p_2, p_2 \rightarrow p_3, \ldots, p_{n-1} \rightarrow p_n\)

Then \(p_n\) is **density-reachable** from \(p_1\)

Apriori Algorithm

\(C_k\): Candidate itemset of size \(k\)

\(L_k\): Frequent itemset of size \(k\)

\(L_1\) = {frequent items}

for (\(k = 1\); \(L_k \ne \emptyset\); \(k\)++) do

\(C_{k+1}\) = candidates generated from \(L_k\)

for each transaction \(t\) in database do

incr count of all candidates in \(C_{k+1}\) contained in \(t\)

\(L_{k+1}\) = candidates in \(C_{k+1}\) with `min_support`

return \(\bigcup_kL_k\)

Directly density reachable core object

Core object \(p\): \(|\text{Neps}(p)| \geqslant \text{MinPts}\)

where

- Eps: max radius of neighborhood
- MinPts: Min points in Eps-neighborhood of point
- Neps\((p)\): \(\{q | \text{dist}(p, q) \leqslant \text{Eps}\}\)

Point \(q\) directly density-reachable from \(p\) iff \(q \in \text{Neps}(p)\) and \(p\) is a core object

CBA frequent ruleitems

A ruleitem is *frequent* if its support is above `minsup`

CBA possible rule

For all ruleitems that have the same `condset`

, the ruleitem with the highest confidence is the possible rule of this set of ruleitems

Gini index

$$\text{gini}(T) = 1 - \sum_{j=1}^n p_j^2$$

where \(T\) contains examples from \(n\) classes and \(p_j\) is the relative frequency of class \(j\) in \(T\)

CF-tree parameters

**Branching factor**: max # of children in each node**Threshold**: max diameter of subclusters at leaf nodes

Succinctness

Given \(A_1\), the set of items satisfying a succinctness constraint \(C\), any itemset \(S\) satisfying \(C\) must be based on \(A_1\), i.e., \(S\) contains a subset belonging to \(A_1\)

DBSCAN algorithm

- Arbitrarily select a point \(p\)
- Retrieve all points directly density-reachable from \(p\) wrt Eps and MinPts
- If \(p\) is a core point, a cluster is formed
- If \(p\) is a border point, no points are density-reachable from \(p\) and DBSCAN visits the next point of the database
- Continue the process until all of the points have been processed

PrefixSpan algorithm

PrefixSpan\((\alpha, i, \text{SDB}|_\alpha)\)

- Scan \(\text{SDB}|_\alpha\) for set of frequent items \(b\) s.t.
- \(b\) can be assembled to last element of \(\alpha\) or
- \(<b>\) can be appended to \(\alpha\) to form a seq pattern

- For each \(b\), append to \(\alpha\) to form seq pattern \(\alpha'\), output
- For each \(\alpha'\), construct \(\text{SDB}|_{\alpha'}\), call PrefixSpan\((\alpha', i+1, \text{SDB}|_{\alpha'})\)

CBA (classification based association) steps

- Discretize continuous attributes
- Generate all CARs (class association rules) that satisfy user-specified minimum support (minsup) and minimum confidence (minconf) constraints
- Build a classifier from the CARs

Information gained by branching on attribute \(A\)

$$\text{Gain}(A) = I(s_1, s_2, \ldots, s_m) - E(A)$$

CBA support and confidence

$$\begin{aligned}\text{support} &= \frac{\text{rulesupCount}}{|D|} \cdot 100\%\\\text{confidence} &= \frac{\text{rulesupCount}}{\text{condsupCount}} \cdot 100\%\end{aligned}$$

Information of tuple \(\langle s_1, s_2, \ldots, s_m \rangle\) where \(i = \{1, \ldots, m\}\) are the classes

$$I(s_1, s_2, \ldots, s_m) = -\sum_{i=1}^m \frac{s_i}{s} \log_2 \frac{s_i}{s}$$

CBA ruleitem and rule

Ruleitem: `<condset, y>`

where `condset`

is a set of items, \(y\) is a class label. Each ruleitem represents a rule: `condset→y`

CBA CARs

All *possible rules* that are both frequent and accurate

Density-connected

Points \(p\) and \(q\) are density-reachable from \(o\), then \(p\) and \(q\) are **density-connected**

CBA accurate rule

A rule is *accurate* if its confidence is above `minconf`

Apriori algorithm candidate generation

\(L_{k-1}\) frequent itemsets are in order

- Self-join \(L_{k-1}\)

INSERT INTO \(C_k\)

SELECT \(p.\text{item}_1, p.\text{item}_2, \ldots, p.\text{item}_{k-1}, q.\text{item}_{k-1}\)

FROM \(L_{k-1} p, L_{k-1} q\)

WHERE \(p.\text{item}_1=q.\text{item}_1, \ldots, p.\text{item}_{k-2}=q.\text{item}_{k-2}, p.\text{item}_{k-1} < q.\text{item}_{k-1}\) - Pruning

For each itemset \(c\) in \(C_k\) do

For each (\(k-1\))-subsets \(s\) of \(c\) do if (\(s\) is not in \(L_{k-1}\)) then delete \(c\) from \(C_k\)

Clustering Feature Vector

$$CF = (N, \overrightarrow{LS}, \overrightarrow{SS})$$

\(N\): Number of data points

\(LS: \sum_{i=1}^N \overrightarrow{X}_i\)

\(SS: \sum_{i=1}^N \overrightarrow{X}_i^2\)

BIRCH clustering algorithm

- Scan DB to build initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data)
- Use arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree

Maximal Proper (MP) Submatrix

Submatrix obtained by removing the last non-zero entry.

- A CAM's MP submatrix is CAM
- A connected CAM's MP submatrix is connected

CBA `condsupCount`

The number of cases in \(D\) that contain `condset`

Anti-monotone

Constraint \(C\) is *anti-monotone *if when a pattern satisfies \(C\) then all of its sub-patterns do so too.

In other words, *anti-monotonicity: *If an itemset \(S\) **violates **the constraint, so does any of its supersets.

Max patterns vs Frequent Closed Patterns

**Max Patterns**: when supersets are infrequent

**Frequent Closed Patterns**: when supersets are less frequent

Data anti-monotone

Constraint \(c\) is *data anti-monotone* if, for a pattern \(p\), it cannot be satisfied by all items of transaction \(t\) in the \(p\)-projected database, including \(p\), then it cannot be satisfied by a subset of \(t\)'s projection on \(p\) either.

CBA `rulesupCount`

The number of cases in \(D\) that contain `condset`

and are labeled with class \(y\)

OPTICS core distance

Core distance of an object \(p\): the smallest value ϵ such that the ϵ-neighborhood of \(p\) has at least MinPts objects

$$\begin{aligned}N_\epsilon(p) &= \epsilon\text{-neighborhood of }p\end{aligned}$$

If \(\text{card}(N_\epsilon(p)) < \text{MinPts}\)

$$\text{Core-distance}_{\epsilon, \text{MinPts}}(p) = \text{Undefined}$$

Else

$$\begin{aligned}\text{Core-distance}_{\epsilon, \text{MinPts}}(p) = \text{MinPts-distance}(p)\end{aligned}$$