# Big Data Analytics

Scott Mueller (lvl 19)
Unsectioned

### Preview this deck

OPTICS reachability distance

Front

### 0.0

0 reviews

 5 0 4 0 3 0 2 0 1 0

Active users

1

All-time users

1

Favorites

0

Last updated

3 years ago

Date created

Oct 4, 2021

## Cards(31)

Unsectioned

(31 cards)

OPTICS reachability distance

Front

Reachability Distance of object $$p$$ from core object $$q$$ is the min radius value that makes $$p$$ directly density-reachable from $$q$$. Reachability distance is undefined if $$q$$ is not a core object. Otherwise,

$$\text{Reachability-distance}_{\epsilon, \text{MinPts}}(p, q) = \max(\text{Core-distance}(q), \text{distance}(q, p))$$

Back

Gini split

Front

$$\text{gini}_\text{split}(T) = \frac{N_1}{N}\text{gini}(T_1) + \frac{N_2}{N}\text{gini}(T_2)$$

where $$T$$ is split into two subsets $$T_1$$ and $$T_2$$ by an attribute. The attribute that provides the smallest $$\text{gini}_\text{split}(T)$$ is chosen to split the node.

Back

Entropy of attribute $$A$$ with values $$\{a_1, a_2, \ldots, a_v\}$$

Front

$$E(A) = \sum_{j=1}^v \frac{s_{1j} + \ldots + s_{mj}}{s} I(s_{1j}, \ldots, s_{mj})$$

where $$s_{ij}$$ is training examples in class $$C_i$$ and attribute value $$A = a_j$$

Back

Density-reachable

Front

Directly density-reachable: $$p_1 \rightarrow p_2, p_2 \rightarrow p_3, \ldots, p_{n-1} \rightarrow p_n$$

Then $$p_n$$ is density-reachable from $$p_1$$

Back

Apriori Algorithm

Front

$$C_k$$: Candidate itemset of size $$k$$

$$L_k$$: Frequent itemset of size $$k$$

$$L_1$$ = {frequent items}

for ($$k = 1$$; $$L_k \ne \emptyset$$; $$k$$++) do

$$C_{k+1}$$ = candidates generated from $$L_k$$

for each transaction $$t$$ in database do

incr count of all candidates in $$C_{k+1}$$ contained in $$t$$

$$L_{k+1}$$ = candidates in $$C_{k+1}$$ with min_support

return $$\bigcup_kL_k$$

Back

Directly density reachable core object

Front

Core object $$p$$: $$|\text{Neps}(p)| \geqslant \text{MinPts}$$

where

• Eps: max radius of neighborhood
• MinPts: Min points in Eps-neighborhood of point
• Neps$$(p)$$: $$\{q | \text{dist}(p, q) \leqslant \text{Eps}\}$$

Point $$q$$ directly density-reachable from $$p$$ iff $$q \in \text{Neps}(p)$$ and $$p$$ is a core object

Back

CBA frequent ruleitems

Front

A ruleitem is frequent if its support is above minsup

Back

CBA possible rule

Front

For all ruleitems that have the same condset, the ruleitem with the highest confidence is the possible rule of this set of ruleitems

Back

Gini index

Front

$$\text{gini}(T) = 1 - \sum_{j=1}^n p_j^2$$

where $$T$$ contains examples from $$n$$ classes and $$p_j$$ is the relative frequency of class $$j$$ in $$T$$

Back

CF-tree parameters

Front
• Branching factor: max # of children in each node
• Threshold: max diameter of subclusters at leaf nodes
Back

Succinctness

Front

Given $$A_1$$, the set of items satisfying a succinctness constraint $$C$$, any itemset $$S$$ satisfying $$C$$ must be based on $$A_1$$, i.e., $$S$$ contains a subset belonging to $$A_1$$

Back

DBSCAN algorithm

Front
1. Arbitrarily select a point $$p$$
2. Retrieve all points directly density-reachable from $$p$$ wrt Eps and MinPts
3. If $$p$$ is a core point, a cluster is formed
4. If $$p$$ is a border point, no points are density-reachable from $$p$$ and DBSCAN visits the next point of the database
5. Continue the process until all of the points have been processed
Back

PrefixSpan algorithm

Front

PrefixSpan$$(\alpha, i, \text{SDB}|_\alpha)$$

1. Scan $$\text{SDB}|_\alpha$$ for set of frequent items $$b$$ s.t.
• $$b$$ can be assembled to last element of $$\alpha$$ or
• $$<b>$$ can be appended to $$\alpha$$ to form a seq pattern
2. For each $$b$$, append to $$\alpha$$ to form seq pattern $$\alpha'$$, output
3. For each $$\alpha'$$, construct $$\text{SDB}|_{\alpha'}$$, call PrefixSpan$$(\alpha', i+1, \text{SDB}|_{\alpha'})$$
Back

CBA (classification based association) steps

Front
1. Discretize continuous attributes
2. Generate all CARs (class association rules) that satisfy user-specified minimum support (minsup) and minimum confidence (minconf) constraints
3. Build a classifier from the CARs
Back

Information gained by branching on attribute $$A$$

Front

$$\text{Gain}(A) = I(s_1, s_2, \ldots, s_m) - E(A)$$

Back

CBA support and confidence

Front

\begin{aligned}\text{support} &= \frac{\text{rulesupCount}}{|D|} \cdot 100\%\\\text{confidence} &= \frac{\text{rulesupCount}}{\text{condsupCount}} \cdot 100\%\end{aligned}

Back

Information of tuple $$\langle s_1, s_2, \ldots, s_m \rangle$$ where $$i = \{1, \ldots, m\}$$ are the classes

Front

$$I(s_1, s_2, \ldots, s_m) = -\sum_{i=1}^m \frac{s_i}{s} \log_2 \frac{s_i}{s}$$

Back

CBA ruleitem and rule

Front

Ruleitem: <condset, y>

where condset is a set of items, $$y$$ is a class label. Each ruleitem represents a rule: condset→y

Back

CBA CARs

Front

All possible rules that are both frequent and accurate

Back

Density-connected

Front

Points $$p$$ and $$q$$ are density-reachable from $$o$$, then $$p$$ and $$q$$ are density-connected

Back

CBA accurate rule

Front

A rule is accurate if its confidence is above minconf

Back

Apriori algorithm candidate generation

Front

$$L_{k-1}$$ frequent itemsets are in order

1. Self-join $$L_{k-1}$$
INSERT INTO $$C_k$$
SELECT $$p.\text{item}_1, p.\text{item}_2, \ldots, p.\text{item}_{k-1}, q.\text{item}_{k-1}$$
FROM $$L_{k-1} p, L_{k-1} q$$
WHERE $$p.\text{item}_1=q.\text{item}_1, \ldots, p.\text{item}_{k-2}=q.\text{item}_{k-2}, p.\text{item}_{k-1} < q.\text{item}_{k-1}$$
2. Pruning
For each itemset $$c$$ in $$C_k$$ do
For each ($$k-1$$)-subsets $$s$$ of $$c$$ do if ($$s$$ is not in $$L_{k-1}$$) then delete $$c$$ from $$C_k$$
Back

Clustering Feature Vector

Front

$$CF = (N, \overrightarrow{LS}, \overrightarrow{SS})$$

$$N$$: Number of data points

$$LS: \sum_{i=1}^N \overrightarrow{X}_i$$

$$SS: \sum_{i=1}^N \overrightarrow{X}_i^2$$

Back

BIRCH clustering algorithm

Front
1. Scan DB to build initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data)
2. Use arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree
Back

Maximal Proper (MP) Submatrix

Front

Submatrix obtained by removing the last non-zero entry.

• A CAM's MP submatrix is CAM
• A connected CAM's MP submatrix is connected
Back

CBA condsupCount

Front

The number of cases in $$D$$ that contain condset

Back

Anti-monotone

Front

Constraint $$C$$ is anti-monotone if when a pattern satisfies $$C$$ then all of its sub-patterns do so too.

In other words, anti-monotonicity: If an itemset $$S$$ violates the constraint, so does any of its supersets.

Back

Max patterns vs Frequent Closed Patterns

Front

Max Patterns: when supersets are infrequent

Frequent Closed Patterns: when supersets are less frequent

Back

Data anti-monotone

Front

Constraint $$c$$ is data anti-monotone if, for a pattern $$p$$, it cannot be satisfied by all items of transaction $$t$$ in the $$p$$-projected database, including $$p$$, then it cannot be satisfied by a subset of $$t$$'s projection on $$p$$ either.

Back

CBA rulesupCount

Front

The number of cases in $$D$$ that contain condset and are labeled with class $$y$$

Back

OPTICS core distance

Front

Core distance of an object $$p$$: the smallest value ϵ such that the ϵ-neighborhood of $$p$$ has at least MinPts objects

\begin{aligned}N_\epsilon(p) &= \epsilon\text{-neighborhood of }p\end{aligned}

If $$\text{card}(N_\epsilon(p)) < \text{MinPts}$$

$$\text{Core-distance}_{\epsilon, \text{MinPts}}(p) = \text{Undefined}$$

Else

\begin{aligned}\text{Core-distance}_{\epsilon, \text{MinPts}}(p) = \text{MinPts-distance}(p)\end{aligned}

Back