Defines a function, mypower, that takes 2 arguments, bas and pow and returns bas ^ pow. bas has a default value of 10, pow has a default value of 2.
Back
mean(Sepal.Length)
Front
The average of Sepal Lengths.
Back
iris = edit(iris)
Front
Edits the iris dataset in a spreadsheet and sets the resulting dataset to the iris variable.
Back
dim(subset(iris, Sepal.Length < 5 & Species != 'setosa'))[1]
Front
Returns the number of rows of the resulting subset.
2
Back
colMeans(iris[,1:4])
Front
The means of the first 4 columns of iris.
Back
pi
Front
A global variable representing pi.
Back
b = c(1,2,3)
b[5] = 5
b
Front
1 2 3 NA 5
Back
cat()
Front
Prints arguments one after the other. e.g. cat("Hello", "World")
Back
edit(iris)
Front
Opens the dataset in a spreadsheet editor.
Back
load('fname')
Front
Loads working data into memory from filename fname.
Back
iris$Sepal.Length
Front
All Sepal Length values in the iris dataset. A vector.
Back
subset(iris, Sepal.Length < 5 & Species != 'setosa')
Front
Returns a subset of the iris dataframe which includes rows where the Sepal Length is less than 5 and the Species is not setosa.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
58 4.9 2.4 3.3 1.0 versicolor
107 4.9 2.5 4.5 1.7 virginica
Back
names(L)=c('a','b','c','d')
Front
Overrides the variable names in the list L with a, b, c, d.
Back
sink('outFile', split=TRUE)
Front
Sends console output to the 'outFile' file AND to the console.
Back
is.vector(a)
Front
Returns TRUE if a is a vector, FALSE otherwise.
Back
L = list(name='mike', age=100, no.children=2, children.ages=c(60,50))
Front
Creates a new list with a number of properties.
Back
iris dataset
Front
A core package dataset that includes flower measurements of different flower species.
Back
foo(1,2,3)
Front
Calls function foo, passing in argument 1, 2, 3.
If argument(s) are omitted, the default value for the variable is used.
Back
foo(name='mike', age=36)
Front
Calls function food, passing in arguments name and age. Note: order does not matter when variables are named.
Back
exp(1)
Front
e raised to the power of 1, i.e. 2.718282
Back
sink('outFile')
Front
Sends console output to the 'outFile' file, not to the console.
Back
log10(100)
Front
log base 10 of 100, i.e. 2
Back
L[1]
Front
Returns a List with the first variable from the List L in it. The returned type is a List.
Back
summary(iris)
Front
A useful statistical summary of the dataset iris.
Back
dim(iris)
Front
Returns the number of rows and columns of the iris dataset.
150 5
Back
tail(iris, 4)
Front
the last 4 rows of the iris dataset
Back
history(10)
Front
Displays the 10 most recent commands execute.
Back
attach(iris, warn.conflicts=FALSE)
Front
Attaches the iris dataset to the local namespace. This allows us to call attributes from it directly. i.e. we can just say Sepal.Length instead of iris$Sepal.Length.
Back
names(L)
Front
Prints the keys (attribute values from the list L).
Back
system.time(function)
Front
Displays the time it took to execute function.
Back
head(iris, 4)
Front
the first 4 rows of the iris dataset
Back
R CMD SHLIB foo.c
Front
Compiles foo.c into a foo.so object file that can be called by R, .C or .CALL functions. e.g. dynload('foo.so'); .C('foo',...)
Back
sapply(data, function)
Front
Applies the supplied function to the data set provided.
Back
Section 3
(50 cards)
lines
Front
A low-level function in the graphics package that adds a line plot to an existing graph.
Back
low-level functions
Front
In the graphics package, functions that edit graphs.
Back
S = sort.int(mpg$cty, index.return=T)
Front
Returns a sorted list of values from the mpg$cty dimension. The list contains two lists. The first is the sorted list of values. The second is the index of each value from the original data set.
Creates a plot of the x attribute of data_frame and shows the probability distribution of the bin values on the y axis and displays as a histogram of binwidth 4.
Back
curve(sinc, -7, 7)
Front
Creates a line plot using the graphics package that applies values from -7 to 7 to the function sinc.
Back
main
Front
Defines the title string in the plot function.
Back
title
Front
A low-level function in the graphics package that modifies the title of a graph.
Back
microbenchmark(function)
Front
Displays the execution time of function in microseconds.
Back
graphics package
Front
The default visualization package in R. Harder to use than ggplot2 but may run faster.
Back
legend
Front
A low-level function in the graphics package that connects symbols, colors, and line types to descriptive strings.
Back
breaks
Front
In a histogram (hist) plot, the number of breaks between bins. The number of bins equals breaks + 1.
Back
datasets package
Front
A package of datasets that come installed by default in R.
Back
ggplot(data_frame, aes(x=x, y=y)) + geom_point()
Front
Creates a scatter plot with ggplot and adds point geometry to it.
Back
.C("fooC", A, B, C, D)
Front
Calls the fooC c program passing input A, B, C, and result D.
Back
stopifnot(boolean)
Front
Similar to an assert statement. Stops the program if the boolean is not TRUE.)
Back
.Rprofile
Front
Placed in the user's home directory, can be used to define .First and .Last functions.
Back
lines(mpg$hwy[S$ix], lty=1)
Front
Adds a line representing highway mileage to an existing plot using line type 1.
Back
bin width
Front
Good bin width balances information loss with good aggregation.
Back
qplot
Front
A function from the ggplot2 package that produces a scatter plot by default.
Back
library(ggplot2)
Front
Loads the ggplot2 library into memory.
Back
mtcars
Front
A dataset of car model data from the Motor Trend Magazine in 1974. Part of the datasets package.
Back
options(expressions=500000)
Front
Sets the maximum number of nested recursive calls to 500,000.
Back
#pragma omp parallel for
Front
A directive placed before a for loop in C that mult-threads the for loop. From OpenMP extension.
Back
ggplot
Front
A function from the ggplot2 package that returns a graphics object that may be modified by adding layers to it. Provides automatic axes labeling and legends.
Back
high-level functions
Front
In the graphics package, functions that produce graphs, i.e. plot, hist, or curve.
Back
ggplot2
Front
A visualization package that may be simpler to use than graphics. It may also run slower, however. It is based on the Grammar of Graphics by Wilkinson (2005).
Back
diamonds
Front
A dataset from the ggplot2 package that lists the details of 50,000 round cut diamonds.
Back
install.packages('ggplot2')
library(ggplot2)
Front
Install the ggplot2 package and bring into scope.
Back
Histograms vs Strip Plots
Front
Histograms discard the ordering of the data.
Back
print
Front
A function that will print a ggplot graph.
Back
source('h1.R')
Front
Loads the R code in h1.R into the interpreter and runs it.
Creates a legend in the top left corner of an existing plot with the labels and line types given.
Back
.First
Front
Function in .Rprofile that gets executed when R starts.
Back
dyn.load("fooC.so")
Front
Loads the fooC.so shared object file into memory.
Back
.Last
Front
Function in .Rprofile that executes when R stops.
Back
plot(x=data_frame$x, y=data_frame$y)
Front
Using the high-level plot function from the graphics package that produces a simple scatter plot.
Back
aes
Front
A function that accepts data variables as arguments and is passed to ggplot.
Back
Histogram
Front
A one dimensional plot that groups data into bins and shows counts of those bins.
Back
grid
Front
A low-level function in the graphics package that adds grid lines to a graph.
Back
faithful
Front
A dataset of eruption times of Old Faithful in Yellowstone National Park, Wyoming, USA. Part of the datasets package.
Back
Section 4
(50 cards)
numeric variables
Front
variables that are real valued. difference between two numeric variables is expected to be the euclidean distance. abs(b-a)
Back
Outlier
Front
A data item is an outlier if it s below the alpha percentile or above the 100-alpha percentile.
Back
na.omit(dataframe)
Front
Returns a new dataframe that omits all rows with missing data.
Back
IQR
Front
Inter quartile range
Back
Indicator Variables
Front
Breaking out categories into separate binary variables. For example, if we had 6 height categories (A-F), we would create 5 height category binary indicators (all 0 would mean category was A). Category variable would be set to 1 for appropriate category.
Back
Outliers
Front
Two sources:
1.) errors in entering data (like with a human typing data in)
2.) non corrupt data but highly unlikely
Back
Replacing missing data for MAR
Front
Methods may introduce systematic bias into the data.
Back
qplot(x, y, geom=c("line"))
Front
Create a line chart with functions x and y.
Back
geom_point()
Front
Adds points to a ggplot plot.
Back
MCAR
Front
Missing Completely At Random
Missingness does not depend on observed or unobserved variables.
Back
ggsave("filename.pdf")
Front
Saves the current graph as a pdf.
Back
geom_line()
Front
Adds a line to a ggplot plot.
Back
complete.cases(dataframe)
Front
Returns a vector whose components are FALSE for all data rows missing data and TRUE for all data rows with no missing data.
Back
ordinal variables
Front
variables representing measurements in a certain range R with a well defined order relation. e.g. the seasons.
Back
kernel function
Front
smooths a distribution
Back
Dealing with Outliers
Front
1.) Truncate - drop them
2.) Winsorization - reset outliers to the highest value remaining outside of the outliers
3.) Robustness - analyze data with a robust procedure
Back
Problems with Data sets
Front
1.) Missing Data
2.) Outliers
3.) Highly skewed data
Back
log-log Relationship
Front
log log plot can reveal linear relationships between variables that are otherwise hard to see.
e.g. qplot(brain, body, log = "xy", data = Animals)
Back
Sampling
Front
Selecting a random number of rows.
Back
sample function
Front
sample(data, number to sample, replace=TRUE/FALSE)
Lambda < 1 removes right skewness.
Lambda > 1 removes left skewness.
Smaller values of lambda are more aggressive in removing skewness.
Back
qplot(x, y, geom="line")
Front
Creates a line plot of x and y.
Back
MAR
Front
Missing At Random
Missingness depends on observed variables only.
Back
Ways to Deal with Missing Data
Front
1.) Remove data with missing values.
2.) Replace missing values with a substitute (e.g. mean).
3.) Estimate a probability model and replace with values from that model.
Back
Binning (or discretization)
Front
Taking a numeric variable (real number), dividing its range into several bins, and replacing it with a number representing the corresponding bin.
Back
Power Transformation
Front
A data transformation for dealing with skewed data.
Back
dataframe concatenation
Front
taking two dataframes with identical columns and combining them together
Back
plot(faithful$waiting, faithful$eruption, xlab="waiting time (min)", ylab="waiting time (min)")
Front
Creates a scatter plot with the Graphics library which puts waiting time on the x axis and eruption time on the y axis.
Back
Section 5
(13 cards)
Split-apply-combine
Front
Split a dataframe into segments, apply some operation to each segment, recombine segments into one array or dataframe.
Back
reshape2
Front
converts data from tall to wide
Back
nrow()
Front
the number of rows in a dataframe
Back
unlist()
Front
turns a list into a vector
Back
acast / dcast
Front
converts from tall to wide
e.g.
dcast(tipsm, sex+time~variable, fun.aggregate = mean, margins = TRUE)
Back
dataframe joining
Front
when you have two dataframes with not identical columns, you can join them together
Back
strsplit("a,b,c", ",")
Front
splits string into a list of characters
Back
melt
Front
converts from wide to tall
Back
tall data
Front
many rows, fewer columns. one or more columns act as an id, one other column acts as a value. e.g.
date | product | sales
1/1/1 | apples | 100
1/1/1 | oranges | 200
1/2/1 | apples | 50
1/2/1 | oranges | 150
Back
wide data
Front
more columns. basically a table. categories are columns, indexes are the first column of a row, cell values represent corresponding values.
e.g.
date | apples | oranges
1/1/11 | 100 | 200
1/2/11 | 50 | 150
Back
gsub(" ', "", x)
Front
replaces all spaces with nothing in string x
Back
merge(df1, df2, by=0)
Front
join dataframes on row index
Back
melt function
Front
converts wide to tall
melt(data, id = id columns)
e.g. melt(smiths, id=c(1,2,3))