Section 1

Preview this deck

dir('dir')

Front

Star 0%
Star 0%
Star 0%
Star 0%
Star 0%

0.0

0 reviews

5
0
4
0
3
0
2
0
1
0

Active users

0

All-time users

0

Favorites

0

Last updated

6 years ago

Date created

Mar 1, 2020

Cards (213)

Section 1

(50 cards)

dir('dir')

Front

Lists the files in the directory specified

Back

All items in the array except the third.

Front

A[-3]

Back

seq(0, 100, by=1)

Front

Creates a vector of numerics from 0 to 100 that increment by 1.

Back

Simple Variable Assignment

Front

a = 1

Back

Element wise arithmetic

Front

Applying a function to every element of an array

Back

outer(y[1,], y[1,])

Front

The outer product of two vectors (results in a matrix)

Back

y$name

Front

the name variable in the list y

Back

typeof(a)

Front

Returns the type of variable a

Back

x[-1,]

Front

all rows but the first

Back

Factor variables

Front

Represent values from an ordered or unordered finite set

Back

seq(0, 1, length.out = 11)

Front

11 evenly spaced numbers between 0 and 1

Back

Multiple Variable Assignment on a Single Line

Front

a = 1; b = 2; c = 3

Back

Vector

Front

A one dimensional ordered variables of the same type.

Back

Show example of a function X

Front

example(X)

Back

which(w<0.5)

Front

returns the indices of w where their value is less than 0.5, otherwise 0

Back

all(w < 0.5)

Front

returns TRUE if all values of w are less than 0.5, otherwise FALSE

Back

y = x[c(1,2),c(1,2)]

Front

extracts the top left 2x2 of an array

Back

as.integer(b)

Front

Casts variable b to an integer

Back

Start the web based help page

Front

help.start()

Back

any(w < 0.5)

Front

returns TRUE if any values of w are less than 0.5, otherwise FALSE

Back

x[2,]

Front

accesses the entire 2nd row

Back

Remove all variables from memory

Front

rm(list=ls())

Back

x = array(data = z, dim = c(4, 5))

Front

creates a two dimensional array with 4 rows and 5 columns

Back

Matrix transpose

Front

A rotation of the matrix around the diagonal

Back

Search the help pages for a specific term X

Front

help.search('X')

Back

Concatentation Function

Front

c()

Back

cbind(x[1,], x[1,])

Front

Horizontal concatenation of vectors.

Back

rep(3.2, times=10) function

Front

Repeats the value 3.2 10 times.

Back

ls.str()

Front

Shows variables in memory and their types

Back

y %*% y

Front

Matrix or Inner Product of y and y

Back

dir()

Front

Lists the files in the current directory

Back

List Signature

Front

An ordered list of variable types in the list.

Back

system('ls')

Front

Runs the ls command on the system

Back

x[2,3]

Front

access the 2nd row, 3rd column of a 2d array

Back

w[w>50] = 0

Front

set all values of w that are greater than 50 to 0

Back

length(A)

Front

Returns the length of array A

Back

Array

Front

A multi-dimensional generalization of vectors.

Back

is.array(w)

Front

returns TRUE if w is an array, otherwise FALSE

Back

Length of array A

Front

length(A)

Back

Set working directory to '/'

Front

setwd('/')

Back

List

Front

An ordered collection of variables of possibly different types.

Back

x <- matrix(c(1,2,3,4), nrow=2, ncol=2)

Front

creates a matrix (or array) that is 2x2

Back

rbind(x[1,], x[1,])

Front

Vertical concatenation of vectors.

Back

Variable assignment with pointer syntax

Front

a <- 1

Back

Remove a specific variable from memory

Front

rm(a)

Back

y = list(name="Mike", title="badass")

Front

creates a list with two attributes

Back

Dataframe

Front

An ordered collection of lists having the same list signature.

Back

List all variables in memory

Front

ls()

Back

Vector function example

Front

y = vector(mode="logical", length=4)

Back

Declaring a Vector

Front

A = c(1, 2, 3)

Back

Section 2

(50 cards)

while (b > a) { }

Front

Repeats code block until b !> a

Back

Variable passing

Front

R passes variables to functions by value.

Back

L[[1]]

Front

Returns the first variable from the List L. If the first variable were a character, the returned type would be a character.

Back

log(10)

Front

natural log of 10

Back

for(num in seq(1, 100, by=1) { print("hello world") }

Front

A for loop that prints "hello world" 100 times.

Back

n = c('Mike', 'Mark') a = c(30, 40) s = c(1000,2000) R = data.frame(name=n, age=a, salary=s) names(R) = c('Name', 'Age', 'Salary') R

Front

Creates a DataFrame, essentially a table. Name Age Salary 1 mike 30 1000 2 mark 40 2000

Back

Vectorized code

Front

Code, like element wise operations in arrays, that avoids computations in loops in the interpreter. Vectorized code is executed natively, like in C.

Back

unname(L)

Front

Strips out the names of attributes in the list or dataframe.

Back

iris$Sepal.Length[1:10]

Front

The first 10 Sepal Length values. A vector.

Back

save.image(file='fname')

Front

Saves an image of the current working memory (i.e. all variables) to a file named fname.

Back

L$name

Front

Prints the name variable in the list L.

Back

a = 1:10

Front

Creates a vector of 10 integers, 1 through 10.

Back

L['name']

Front

Prints the property 'name' from the list L.

Back

repeat { }

Front

Repeats code block until break is called.

Back

exp(0)

Front

e raised to the power of 0, i.e. 1

Back

a = c(1,2) b=c(1,2,3,4) c = a+b c

Front

2 4 4 6

Back

Iris = read.table('someFile.txt', useHeader=TRUE)

Front

Reads someFile.txt into Iris as a dataframe and expects a header to be use as the column names.

Back

mypower = function(bas = 10, pow = 2) { return(bas ^ pow) }

Front

Defines a function, mypower, that takes 2 arguments, bas and pow and returns bas ^ pow. bas has a default value of 10, pow has a default value of 2.

Back

mean(Sepal.Length)

Front

The average of Sepal Lengths.

Back

iris = edit(iris)

Front

Edits the iris dataset in a spreadsheet and sets the resulting dataset to the iris variable.

Back

dim(subset(iris, Sepal.Length < 5 & Species != 'setosa'))[1]

Front

Returns the number of rows of the resulting subset. 2

Back

colMeans(iris[,1:4])

Front

The means of the first 4 columns of iris.

Back

pi

Front

A global variable representing pi.

Back

b = c(1,2,3) b[5] = 5 b

Front

1 2 3 NA 5

Back

cat()

Front

Prints arguments one after the other. e.g. cat("Hello", "World")

Back

edit(iris)

Front

Opens the dataset in a spreadsheet editor.

Back

load('fname')

Front

Loads working data into memory from filename fname.

Back

iris$Sepal.Length

Front

All Sepal Length values in the iris dataset. A vector.

Back

subset(iris, Sepal.Length < 5 & Species != 'setosa')

Front

Returns a subset of the iris dataframe which includes rows where the Sepal Length is less than 5 and the Species is not setosa. Sepal.Length Sepal.Width Petal.Length Petal.Width Species 58 4.9 2.4 3.3 1.0 versicolor 107 4.9 2.5 4.5 1.7 virginica

Back

names(L)=c('a','b','c','d')

Front

Overrides the variable names in the list L with a, b, c, d.

Back

sink('outFile', split=TRUE)

Front

Sends console output to the 'outFile' file AND to the console.

Back

is.vector(a)

Front

Returns TRUE if a is a vector, FALSE otherwise.

Back

L = list(name='mike', age=100, no.children=2, children.ages=c(60,50))

Front

Creates a new list with a number of properties.

Back

iris dataset

Front

A core package dataset that includes flower measurements of different flower species.

Back

foo(1,2,3)

Front

Calls function foo, passing in argument 1, 2, 3. If argument(s) are omitted, the default value for the variable is used.

Back

foo(name='mike', age=36)

Front

Calls function food, passing in arguments name and age. Note: order does not matter when variables are named.

Back

exp(1)

Front

e raised to the power of 1, i.e. 2.718282

Back

sink('outFile')

Front

Sends console output to the 'outFile' file, not to the console.

Back

log10(100)

Front

log base 10 of 100, i.e. 2

Back

L[1]

Front

Returns a List with the first variable from the List L in it. The returned type is a List.

Back

summary(iris)

Front

A useful statistical summary of the dataset iris.

Back

dim(iris)

Front

Returns the number of rows and columns of the iris dataset. 150 5

Back

tail(iris, 4)

Front

the last 4 rows of the iris dataset

Back

history(10)

Front

Displays the 10 most recent commands execute.

Back

attach(iris, warn.conflicts=FALSE)

Front

Attaches the iris dataset to the local namespace. This allows us to call attributes from it directly. i.e. we can just say Sepal.Length instead of iris$Sepal.Length.

Back

names(L)

Front

Prints the keys (attribute values from the list L).

Back

system.time(function)

Front

Displays the time it took to execute function.

Back

head(iris, 4)

Front

the first 4 rows of the iris dataset

Back

R CMD SHLIB foo.c

Front

Compiles foo.c into a foo.so object file that can be called by R, .C or .CALL functions. e.g. dynload('foo.so'); .C('foo',...)

Back

sapply(data, function)

Front

Applies the supplied function to the data set provided.

Back

Section 3

(50 cards)

lines

Front

A low-level function in the graphics package that adds a line plot to an existing graph.

Back

low-level functions

Front

In the graphics package, functions that edit graphs.

Back

S = sort.int(mpg$cty, index.return=T)

Front

Returns a sorted list of values from the mpg$cty dimension. The list contains two lists. The first is the sorted list of values. The second is the index of each value from the original data set.

Back

title(main="Some title")

Front

Adds a title to the current plot.

Back

hist(data_frame$x, xlab="x label", ylab="y label", main="title", breaks=20)

Front

Uses the graphics package to generate a histogram of the x dimension of data_frame with 20 breaks or 21 bins.

Back

xlab

Front

Defines the x label attribute in the plot function.

Back

ylab

Front

Defines the y label attribute in the plot function.

Back

ggplot(data_frame, aes(x=x, y=..density..)) + geom_histogram(binwidth=4)

Front

Creates a plot of the x attribute of data_frame and shows the probability distribution of the bin values on the y axis and displays as a histogram of binwidth 4.

Back

curve(sinc, -7, 7)

Front

Creates a line plot using the graphics package that applies values from -7 to 7 to the function sinc.

Back

main

Front

Defines the title string in the plot function.

Back

title

Front

A low-level function in the graphics package that modifies the title of a graph.

Back

microbenchmark(function)

Front

Displays the execution time of function in microseconds.

Back

graphics package

Front

The default visualization package in R. Harder to use than ggplot2 but may run faster.

Back

legend

Front

A low-level function in the graphics package that connects symbols, colors, and line types to descriptive strings.

Back

breaks

Front

In a histogram (hist) plot, the number of breaks between bins. The number of bins equals breaks + 1.

Back

datasets package

Front

A package of datasets that come installed by default in R.

Back

ggplot(data_frame, aes(x=x, y=y)) + geom_point()

Front

Creates a scatter plot with ggplot and adds point geometry to it.

Back

.C("fooC", A, B, C, D)

Front

Calls the fooC c program passing input A, B, C, and result D.

Back

stopifnot(boolean)

Front

Similar to an assert statement. Stops the program if the boolean is not TRUE.)

Back

.Rprofile

Front

Placed in the user's home directory, can be used to define .First and .Last functions.

Back

lines(mpg$hwy[S$ix], lty=1)

Front

Adds a line representing highway mileage to an existing plot using line type 1.

Back

bin width

Front

Good bin width balances information loss with good aggregation.

Back

qplot

Front

A function from the ggplot2 package that produces a scatter plot by default.

Back

library(ggplot2)

Front

Loads the ggplot2 library into memory.

Back

mtcars

Front

A dataset of car model data from the Motor Trend Magazine in 1974. Part of the datasets package.

Back

options(expressions=500000)

Front

Sets the maximum number of nested recursive calls to 500,000.

Back

#pragma omp parallel for

Front

A directive placed before a for loop in C that mult-threads the for loop. From OpenMP extension.

Back

ggplot

Front

A function from the ggplot2 package that returns a graphics object that may be modified by adding layers to it. Provides automatic axes labeling and legends.

Back

high-level functions

Front

In the graphics package, functions that produce graphs, i.e. plot, hist, or curve.

Back

ggplot2

Front

A visualization package that may be simpler to use than graphics. It may also run slower, however. It is based on the Grammar of Graphics by Wilkinson (2005).

Back

diamonds

Front

A dataset from the ggplot2 package that lists the details of 50,000 round cut diamonds.

Back

install.packages('ggplot2') library(ggplot2)

Front

Install the ggplot2 package and bring into scope.

Back

Histograms vs Strip Plots

Front

Histograms discard the ordering of the data.

Back

print

Front

A function that will print a ggplot graph.

Back

source('h1.R')

Front

Loads the R code in h1.R into the interpreter and runs it.

Back

ggplot(data_frame, aes(x=x)) + geom_histogram(binwidth=1)

Front

Creates a plot of the x attribute of data_frame and displays it as a histogram.

Back

qplot(x = x, data=data_frame, binwidth=3, main="title")

Front

Uses the ggplot2 package to generate a histogram from dimension x of data_frame.

Back

Strip Plot

Front

A plot that maps the ordered index of a data frame against a single attribute. The strip plot can expose strips or lines of similar data.

Back

mpg

Front

A dataset of car model data from fueleconomy.gov. Part of the ggplot2 package.

Back

plot(S$x, type="l", lty=2, xlab="x label", ylab="y label")

Front

Creates a line plot of the S$x dimension with line type of 2.

Back

qplot(x=x, y=y, data=data_frame, main="title feature", geom="point")

Front

Creates a basic scatter plot with ggplot's qplot function.

Back

legend("topleft", c("highway mpg", "city mpg"), lty=c(1,1))

Front

Creates a legend in the top left corner of an existing plot with the labels and line types given.

Back

.First

Front

Function in .Rprofile that gets executed when R starts.

Back

dyn.load("fooC.so")

Front

Loads the fooC.so shared object file into memory.

Back

.Last

Front

Function in .Rprofile that executes when R stops.

Back

plot(x=data_frame$x, y=data_frame$y)

Front

Using the high-level plot function from the graphics package that produces a simple scatter plot.

Back

aes

Front

A function that accepts data variables as arguments and is passed to ggplot.

Back

Histogram

Front

A one dimensional plot that groups data into bins and shows counts of those bins.

Back

grid

Front

A low-level function in the graphics package that adds grid lines to a graph.

Back

faithful

Front

A dataset of eruption times of Old Faithful in Yellowstone National Park, Wyoming, USA. Part of the datasets package.

Back

Section 4

(50 cards)

numeric variables

Front

variables that are real valued. difference between two numeric variables is expected to be the euclidean distance. abs(b-a)

Back

Outlier

Front

A data item is an outlier if it s below the alpha percentile or above the 100-alpha percentile.

Back

na.omit(dataframe)

Front

Returns a new dataframe that omits all rows with missing data.

Back

IQR

Front

Inter quartile range

Back

Indicator Variables

Front

Breaking out categories into separate binary variables. For example, if we had 6 height categories (A-F), we would create 5 height category binary indicators (all 0 would mean category was A). Category variable would be set to 1 for appropriate category.

Back

Outliers

Front

Two sources: 1.) errors in entering data (like with a human typing data in) 2.) non corrupt data but highly unlikely

Back

Replacing missing data for MAR

Front

Methods may introduce systematic bias into the data.

Back

qplot(x, y, geom=c("line"))

Front

Create a line chart with functions x and y.

Back

geom_point()

Front

Adds points to a ggplot plot.

Back

MCAR

Front

Missing Completely At Random Missingness does not depend on observed or unobserved variables.

Back

ggsave("filename.pdf")

Front

Saves the current graph as a pdf.

Back

geom_line()

Front

Adds a line to a ggplot plot.

Back

complete.cases(dataframe)

Front

Returns a vector whose components are FALSE for all data rows missing data and TRUE for all data rows with no missing data.

Back

ordinal variables

Front

variables representing measurements in a certain range R with a well defined order relation. e.g. the seasons.

Back

kernel function

Front

smooths a distribution

Back

Dealing with Outliers

Front

1.) Truncate - drop them 2.) Winsorization - reset outliers to the highest value remaining outside of the outliers 3.) Robustness - analyze data with a robust procedure

Back

Problems with Data sets

Front

1.) Missing Data 2.) Outliers 3.) Highly skewed data

Back

log-log Relationship

Front

log log plot can reveal linear relationships between variables that are otherwise hard to see. e.g. qplot(brain, body, log = "xy", data = Animals)

Back

Sampling

Front

Selecting a random number of rows.

Back

sample function

Front

sample(data, number to sample, replace=TRUE/FALSE)

Back

Raster graphics

Front

Lower resolution but smaller file sizes.

Back

data shuffling

Front

randomly mixing up data frame rows

Back

Sampling without Replacement

Front

Rows can only be sampled once.

Back

Vector graphics

Front

Higher resolution but larger file sizes.

Back

ggplot(data_frame, aes(x=x, y=y)) + geom_line() + geom_point()

Front

Creates a line plot with points from the x and y dimensions of data_frame.

Back

pch, col, cex

Front

Shape, Color, Size of points in a scatter plot

Back

Replacing missing data for MCAR

Front

Any of the 3 missing data techniques are OK.

Back

na.rm

Front

An option passed to functions that causes the function to operate only on rows without missing data.

Back

Sampling with Replacement

Front

When you sample, you can sample the same rows multiple times.

Back

data partitioning

Front

splitting data into two sets, like 75% in one and 25% in another

Back

qplot(waiting, eruptions, data=faithful)

Front

Scatterplot of faithful waiting vs eruptions

Back

Standard Deviation with Outliers

Front

First remove the most extreme values, then calculate the standard deviation.

Back

Faceting

Front

Displaying multiple panels in the same graph.

Back

qplot(x, y, geom=c("line", "point"))

Front

Creates a line plot of x and y with points.

Back

categorial variables

Front

variables that do not satisfy the ordinal or numeric assumption. e.g. items on a restaurant menu.

Back

Robustness to Outliers

Front

Means a model is not sensitive to outliers. Mean is not robust. Median is robust.

Back

Binarization

Front

Replacing a value with 0 or 1 based on threshold.

Back

Box plot whiskers

Front

Extend no more than 1.5 times the IQR away from the edges of the box

Back

is.na(dataframe)

Front

Returns TRUE where dataframe value is NA and FALSE otherwise.

Back

Scatter Plot

Front

Plots two variables against each other as points.

Back

Smoothed Histogram

Front

f_h : R -> R_+ f_h(x) = (1/n) * sum_i_to_n( K(x-x_of_i) )

Back

qqplot

Front

quantile-quantile plots

Back

Lambda in Power Transformation

Front

Lambda < 1 removes right skewness. Lambda > 1 removes left skewness. Smaller values of lambda are more aggressive in removing skewness.

Back

qplot(x, y, geom="line")

Front

Creates a line plot of x and y.

Back

MAR

Front

Missing At Random Missingness depends on observed variables only.

Back

Ways to Deal with Missing Data

Front

1.) Remove data with missing values. 2.) Replace missing values with a substitute (e.g. mean). 3.) Estimate a probability model and replace with values from that model.

Back

Binning (or discretization)

Front

Taking a numeric variable (real number), dividing its range into several bins, and replacing it with a number representing the corresponding bin.

Back

Power Transformation

Front

A data transformation for dealing with skewed data.

Back

dataframe concatenation

Front

taking two dataframes with identical columns and combining them together

Back

plot(faithful$waiting, faithful$eruption, xlab="waiting time (min)", ylab="waiting time (min)")

Front

Creates a scatter plot with the Graphics library which puts waiting time on the x axis and eruption time on the y axis.

Back

Section 5

(13 cards)

Split-apply-combine

Front

Split a dataframe into segments, apply some operation to each segment, recombine segments into one array or dataframe.

Back

reshape2

Front

converts data from tall to wide

Back

nrow()

Front

the number of rows in a dataframe

Back

unlist()

Front

turns a list into a vector

Back

acast / dcast

Front

converts from tall to wide e.g. dcast(tipsm, sex+time~variable, fun.aggregate = mean, margins = TRUE)

Back

dataframe joining

Front

when you have two dataframes with not identical columns, you can join them together

Back

strsplit("a,b,c", ",")

Front

splits string into a list of characters

Back

melt

Front

converts from wide to tall

Back

tall data

Front

many rows, fewer columns. one or more columns act as an id, one other column acts as a value. e.g. date | product | sales 1/1/1 | apples | 100 1/1/1 | oranges | 200 1/2/1 | apples | 50 1/2/1 | oranges | 150

Back

wide data

Front

more columns. basically a table. categories are columns, indexes are the first column of a row, cell values represent corresponding values. e.g. date | apples | oranges 1/1/11 | 100 | 200 1/2/11 | 50 | 150

Back

gsub(" ', "", x)

Front

replaces all spaces with nothing in string x

Back

merge(df1, df2, by=0)

Front

join dataframes on row index

Back

melt function

Front

converts wide to tall melt(data, id = id columns) e.g. melt(smiths, id=c(1,2,3))

Back