Scala: To load data from text file to RDD.
assume SparkContext is available as sc, SqlContext is available as SqlContext
Front
val sfpd = sc.textFile("path_to_file/file.txt"
(4)
Back
What was spark's original purpose?
Front
Something to do with mesos
Back
Advantage of spark over Hadoop and storm
Front
Tons of built in tools etc already available in spark
Back
Spark has about __ built in operators for big data analy
Front
80
Back
Matei Zaharia also created
Front
Mesos (co)
Fair scheduler (Hadoop )
Back
You can submit your spark app on 4 modes
Front
Local, Mesos, standalone, YARN
(5)
Back
Used ______________ to create the RDDs
Front
SparkContext
Back
Python: create data to hold numbers 1 through 999
Front
Data = xrange (1,999)
Back
Scala context
Front
sc
Back
Worker programs can run on _____ and _____
Front
Cluster and local threads
Back
Main program of your spark app is called?
Front
Driver
Just like mapreduce
Back
9. 4 types of workload (not limited to) Spark can handle:
Front
1. Batch
2. Iterative
3. Query
4. Streaming
Back
Client talks to server nodes via
Front
Ssh
Back
In Python , how to spread RDD (name data) to 8 nodes
Front
XrangeRDD = sc.parallelize(data,8)
Back
Python: how to change name of XrangeRDD to newRDD
Front
XrangeRDD.setName(newRDD)
Back
11. When and where was Spark born?
Front
In 2009 in UC Berkeley's AMPLab by Matei Zaharia
Back
A Spark programs first creates _______________ object
Front
SparkContext object
Back
Spark extends Mapreduce model to include iterative programming and _________
Front
Stream processing
Back
Spark was created at?
Front
UC Berkeley
Back
We know, Spark supports Java, Python, Scala. Does it support R?
Front
Yup, recently added!
Back
What is pySpark?
Front
Python programming interface to Spark
Back
Spark was built on top of Hadoop _____
Front
Mapreduce
Back
Your spark app is coordinated by _____ object
Front
SparkContext
Back
RDD:more partitions equals
Front
More parallelization
Back
10. Why is the management overload for Spark is low?
Front
Because it can do many types of workloads, you don't mean various tools for managing each.
Back
RDDs are distributed across _______________
Front
workers
Back
Scala command to get Scala ver
Front
sc.version
Back
to build a standalone app, to create a spark context,
what to include in the main method (2 statements)
Front
val conf = new SparkConf() .setAppName("MyAPP")
val sc = ...................
FIRST: SparkConf OBJECT
THEN: SparkContext
(3)
Back
3. RDD Question: Where does RDD reside?
E. On HDFS
F. In the memory of the worker nodes
G. In the memory of master node
H. On local disk
Front
B
Back
Where was Mesos invented?
Front
Also, UC Berkeley
Back
Sparksql: part of spark core distribution?
Front
Yes
Back
4. Key difference between hadoop and spark in terms of reason for speed?
Front
Spark runs on memory
Back
What is lineage for RDD ?
Front
Each RDD remembers how it came about (transformations)
Back
What is RDD is layman's terms?
Front
This is the holy grail of what an RDD is. RDDs are a 'immutable resilient distributed collection of records' which can be stored in the volatile memory or in a persistent storage (HDFS, HBase etc) and can be converted into another RDD through some of the transformations.
Source: http://www.thecloudavenue.com/2014/01/resilient-distributed-datasets-rdd.html
Back
Where was Spark invented?
Front
UC Berkeley
Back
Which variable helps transfer of large data sets very efficiently
Front
broadcast
Back
3 languages at lest that can be used to code on spark
Front
Java
Python
Scala
Back
12. Spark provides built-in APIs in ______________ (3) languages
Front
Java, Scala, and Python
Back
Created by
Front
Matei Zaharia
Back
Spark programs are two programs:
Front
Driver and Workers
Back
Action on RDD returns what?
Front
a value to the driver code
(mapr 1)
Back
RDDs are immutable. True/False
Front
True
(2)
Back
7. T/F: Spark has its own cluster management. True/False
Front
True
Back
How do you start pyspark on windows?
Front
just a command, as long as you have path added
Back
What is the name of the python interface to Spark?
Front
pyspark
Back
8. The main feature of Spark is _______________
Front
Ans: its in-memory cluster computing
Back
Section 2
(50 cards)
once you get the scala prompt after kicking off spark-shell, 2 commands you can run for sanity check
Front
sc.version
sc.appName (case sensitive)
Back
4 components of spark that sits on top of core spark
Front
1.sparksql
2.Mlib
3. GraphX
4. Spark Streaming
Back
how to start up pyspark? (windows)
Front
pyspark (command)
Back
will :history remember commands from the previous sessions?
Front
YES!
Back
every sparkcontext launches a webUI on port?
Front
4040
Back
Which has richer set of libraries: Python or Scala
Front
Python
Back
When code is sent to executors, what is sent?
Front
JAR or Python code passed to sparkContext
Back
calling cache() does what?
Front
saves the RDD in cache
Back
where will you find start-master script? (even on windows)
Front
sbin (peer to bin where spark-shell and pyspark are)
Back
in customer mode, where does the driver start?
Front
same process as the customer app
Back
It is efficient and faster to use _____ serialization
Front
Kryo
Back
How much faster is spark (memory) than hadoop
Front
100 times
Back
To run an application on the Spark cluster, simply pass _________________________________
Front
To run an application on the Spark cluster, simply pass the spark://IP:PORT URL of the master as to the SparkContext constructor.
Back
____________ variable can use to cache read-only values on each node's RAM
Front
broadcast
Back
Once connected to cluster, spark acquires ____ on nodes to do its work
Front
Executors
Back
Which has higher learning curve: Python or Scala
Front
Scala
Back
what does webUI on port 4040 show?
Front
Every SparkContext launches a web UI, by default on port 4040, that displays useful information about the application
Back
Spark was written in ______
Front
Scala
Back
2 modes for independent batch jobs:
Front
1. customer mode
2. batch mode
Back
What is REPL?
Front
The spark-repl is referred to as the interactive spark shell and can
be run from your spark installation directory.
./spark shell
The spark-repl ( read evaluate print loop ) is a modified version of
the interactive scala repl that can be used with spark
Back
You can start a standalone master server by executing:
Front
You can start a standalone master server by executing: ./sbin/start-master.sh
Back
spark-shell (scala) : to get history
Front
:history
Back
Which has much greater community support: Python or Scala
Front
Python
Back
If you really wanted understand Spark, which pays more. Python or Scala
Front
Scala
Back
in scala prompt (spark-shell) way to get help
Front
:help
Back
Name 3 cluster managers sparkContext
Front
Yarn
Mesos
Spark's standalone cluster
Back
RDD stands for:
Front
Resilient Distributed Data Set
Back
example of 4 things that port 4040 will show
Front
A list of scheduler stages and tasks
A summary of RDD sizes and memory usage
Environmental information.
Information about the running executors
Back
RDDs have transformations, which return ?
Front
transformations return pointers to new RDDs.
Back
spark does ________ analysis or large-scale data
Front
complex
Back
command to start up spark-shell (windows)
Front
spark-shell
Back
once accumulator has been created with the initial value, new tasks can be added to the variable using _____ command
spark-shell (scala) command to read a file from external source
Front
:load _PATH_
Back
command to fire up spark master on widnows
Front
bin\spark-class.cmd org.apache.spark.deploy.master.Master
and this ran successfully and I got a web page at http://localhost:8080
this says
Spark Master at spark://192.168.1.4:7077
Back
Accumulator variables are added through _______________ operation
Front
associative
Back
Mlib is __ times faster than disk-based mahout
Front
9
Back
default URL for client GUI
Front
localhost:8080
Back
7 options for the master URL
Front
local
local[K]
local[*]
spark://host:port
mesos://host:port
yarn-client
yarn-cluster
Back
in scala shell, how to read a file from the spark source directory?
Front
val textFile = sc.textFile("README.md")
Back
How much faster is spark (disk) than hadoop on disk
Front
10
Back
RDDs have actions, which return ?
Front
RDDs have actions, which return values
Back
Components of spark that has machine learning
Front
Mlib
Back
if multiple SparkContexts are running on the same host, they will bind to which ports?
Front
f multiple SparkContexts are running on the same host, they will bind to successive ports beginning with 4040 (4041, 4042, etc).
Back
Section 3
(50 cards)
accumulators are great for for running in ________________
Front
parallel
Back
T/F: unlike mapreduce, excecutors run on the worknode even when spark application is not running a task
Front
True
Back
through which entity driver code communicates to Cluster Manager
Front
SparkContext
Back
Use _______________ to create RDDs
Front
SparkContext
Back
what you set "master" to point to mesos cluster ?
Front
mesos://host:port
Back
if you specify local as master, what does that mean?
Front
1 thread on local box, no parallelism
Back
Lamba in programming is _________________ functions
Front
anonymous
Back
what is the benefit of executors running when task is not running
Front
sharing information from memory
Back
what is the default port for standalone spark cluster?
Front
7077
Back
RDD transformations are lazy. What does that mean?
Front
they are not computed immediately
Back
Name 2 examples of actions
Front
collect count
Back
Which parameter in SparkContext dictates which cluster and how many threads to use?
Front
master
Back
3 most common transformations on scala collections
Front
map
filter
flatmap
Back
default port for mesos cluster
Front
5050
Back
To install Spark Standalone mode, you simply place a compiled version of Spark on ________________
Front
To install Spark Standalone mode, you simply place a compiled version of Spark on each node on the cluster.
Back
A lambda is a type of function, defined _____________.
Front
A lambda is a type of function, defined inline.
Back
scala REPL = spark ______
Front
scala REPL IS the spark shell
Back
what percentage of spark programs are written in scala?
Front
95%
You tube video posted by new century trainer 10/2015
he mentions support for R added, so video cant be that old.
Back
where do you keep a list of hostnames that run worker nodes
Front
conf/slaves
To launch a Spark standalone cluster with the launch scripts, you should create a file called conf/slaves in your Spark directory, which must contain the hostnames of all the machines where you intend to start Spark workers, one per line
Back
where is spark-submit script?
Front
bin
Back
Databricks's product?
Front
Notebooks
Back
sbt is used to build spark applications in spark ______
Front
shell
Back
what are "code shorting" advantages of lambda functions?
Front
except that you don't need to bother with:
Declaring the function somewhere else (and having to look for it when you revisit the code later).
Naming something that you're only using once.
Once you're used to function values, having to do without them seems as silly as being required to name every expression, such as:
NORMAL WAY:
int tempVar = 2 * a + b
...
println(tempVar)
LAMBDA WAY:
println(2 * a + b)
(Stackoverflow)
Back
Lambda comes from ____________ calculus
Front
Lambda
Back
pyspark: how to display the content of collection a
Front
a
(just a enter)
Back
when spark runs on hadoop, data node is typically same as __________________
Front
worker node
Back
Cluster Manager and _______________ are typically the same device
Front
Spark Master
Back
Collect action causes ________, filter, and map transformations to be executed
Front
parallelize
Back
spark uses ______________ to parallelize and create pipelines for the sub-processes needed to execute a task.
Front
DAG
Back
if the client does, what happens to currently running apps and new apps that have not been submitted (assume no zookeeper)
Front
currently running, no impact
new: no longer possible
Back
Lambda allows you to write quick throw away functions without _____________
Front
allows you to write quick throw away functions without naming them
Back
when add to accumulator's initial vale using the add command, who has access to it?
Front
only the driver program
Back
An anonymous function allows you to create_______ from functions
Front
An anonymous function allows you to create functions from functions
Back
accumulators are great for _________ and ___________ (examples)
Front
sums and counters
Back
Simply put, a lambda is a function without a __________
Front
Simply put, a lambda is a function without a name
Back
when is "transformation" is actually done?
Front
when an action is performed on it
Back
what you set "master" to point to standalone spark cluster ?
Front
spark://host:port
Back
3 ways to construct a RDD
Front
1. transform another RDD
2. from files (HDFS or otherwise)
3. parallelize python collections (lists)
Back
example of defining a collection (list) in python
Front
>>> a = [1 2 3 4 5]
(>>> is the pyspark prompt)
Back
3 components of worker node
Front
executor
cache
task
Back
where are the start/stop scripts stored?
Front
sbin
Back
iPython and programs must use ________________ to create a new sparkContext
Front
Constructor
Back
Which company employs most of Spark committers?
Front
Databricks
Back
True/False: pySpark shell and Databricks cloud automatically create sc variable
Front
True!
Back
sbt is built from ____
Front
scala
Back
which has more data science tools? Python or Scala
Front
Python!
Back
True/False: RDD enables operations on collection of elements in parallel.
Front
True
Back
if you have 2 cores in your local box, what should you set yous "master" to?
Front
local[2]
Back
Name 2 examples of transformations
Front
map filter
Back
Scala example of a lambda function:
Front
args.foreach(arg => println(arg))
where the argument to the foreach method is an expression for an anonymous function.
(Stackoverflow)
Back
Section 4
(50 cards)
Lambda function that returns arg - 1
Front
Define minusone(foo): return foo -1
Back
if your scala file is test.scala and object inside is HelloWorld, which files will be created when you compile it?
Front
HelloWorld.class
HelloWorld$.class
Back
once driver gets slots from the worker nodes, how much communication happens between driver and worker AND between driver and master
Front
NONE
Back
scala is lot like java, so if you know java, learning curve for scala would be less. T/F
Front
True
Back
define a function fsq that will return square of the argument passed to it
Front
def fsq(x: Double) = x*x
(just like on the REPL; D has to capitalize)
Back
which method to call to get number of partitions for a RDD?
Front
Rdd1.getnumpartitions
Back
Which language uses lambdas heavily?
Front
LISP!
Back
Spark SQL provides new abstraction called _____________
Front
schemaRDD
Back
If you see:
val sc = new SparkContext(conf)
what does mean for conf , in terms of dependency
Front
conf has to be defined with certain parameters which will turn be used in creating SparkContext sc
Back
how to print the string hello?
Front
println("hello")
quotes needed. Otherwise it tries to find variable hello
Back
Scala lacks the same amount of Data Science libraries and ______ as Python
Front
tools
Back
Close cousin of lambdas in C?
Front
function pointer
Back
Transformation actually does nothing (like a file pointer)
Front
nothing really happens until an action is called
Back
You can cache to many types of locations
Front
Memory, disk and more
Back
python API vs scala API (performance)
Front
neglizible, because its almost all scala under the hood
Back
How much faster is MLib compared to Mahout (disk-based) before it got spark interface?
Front
9 times
Back
if your file is named test.scala , but object inside is called HelloWorld, will that work?
Front
Yes!
Back
If you do 5 transformations on 5gb RDD and do 1 simple action, data is finally pulled and network traffic ensues. Then you do one more small action, then what happens?
Front
data is pulled over again!!
Back
how to add 5 and 6 and save the result into variable x
Front
val x = 5 + 6
Back
Scala is both object oriented and functional. T/F
Front
True
Back
How do you compile helloworld.scala?
Front
scalac helloworld.scala
Back
you do: (scala REPL)
2+2
is the answer saved anywhere?
Front
yes , it is stored in variable res0
Back
if function fsq is define, how do you pass variable a to it and save the result in variable b?
Front
val b = fsq(a)
Back
Once a user application is bundled, it can be launched using the _____________ script
Front
Once a user application is bundled, it can be launched using the bin/spark-submit script
Back
if you write a class in scala, what extension should you give it?
Front
.scala
(e.g. HelloWorld.scala)
Back
If you install scala and cygwin on windows , can you run scala from cygwin shell?
Front
yes
just add scala's bin to windows PATH
Back
One difference between lambdas and anonymous inner class in Java
Front
The first one is a formal difference, lambdas usually define closures - which means they can refer to variables visible at the time they were defined.
Back
Close cousin of lambdas in Java?
Front
anonymous inner class
http://martinfowler.com/bliki/Lambda.html
Back
Can we easily call R directly from python?
Front
People have ported most of the core parts of R to Python and there's easy ways to call R directly from Python
(source: quora)
Back
Rdd1.todebugstring(RDD)
Front
public String toDebugString()
A description of this RDD and its recursive dependencies for debugging.
Back
what was the appeal of Spark besides speed?
Front
interactive algorithms and queries were not served enough by the batch frameworks (i.e. hadoop oriented programmatic tools)
Back
Map
Front
function played on each element of the RDD
Back
4 Items on the TOP layer of Spark Stack
Front
1. Spark SQL
2. Spark Streaming
3. Mlib
4. GraphX
Back
Is the name "worker" misleading?
Front
yes. because worker manages all the resources (slots)
Back
3 Items on the bottom layer of spark stack
Front
1. Spark Stand Alone
2. Mesos
3. YARN
Back
Example of defining parameters using conf
Front
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("CountingSheep")
(apache apark web site)
Back
How do you execute a complied class from command prompt?
Front
scala HelloWorld,class OR
scala HelloWorld
both works!
Back
If you do 5 transformations on 5gb RDD and do 1 simple action, when is data pulled
Front
only when the first action is called
Back
Executors
Front
JVM that actually runs tasks (on worker nodes)
Back
3 (examples) of what scala lacks
Front
1. visualization
2. local tools
3. good local transformations
(source: quora)
How do you see what version of java you have installed in cygwin?
Front
java -version
(from cygwin shell)
Back
Members of packages, classes, or objects can have which kinds of access?
Front
public
private
protected
(just like java!!)
Back
how to get the next re-presentable value after 2.0 towards 3
Front
val z = Math.nextAfter(2, 3))
THIS worked.
Back
scala: difference between private and protected?
Front
private: only that class, not even its subclasses
protected: that class and its subclasses
Back
you want to calculate e^a (exponential) and assign to b
Front
b = exp(a)
Back
what does * operator do to a string?
Front
append N-1 times
val foo = abc
bar = foo * 2
so, bar will be abcabc
Back
Clojure is a dialect of ______________
Front
LISP
Back
can you specify absolute path when using sc.textfile to read a file into RDD?
Front
yes
this worked for me
Back
how to split a string based on empty space as a delimeter
Front
scala> "hello world".split(" ")
res0: Array[java.lang.String] = Array(hello, world)
(This is useful when we read in a text file)
Back
reverse a string named foo
Front
foo.reverse
(works!)
Back
Clojure
Front
Clojure is a dynamic programming language that targets the Java Virtual Machine (and the CLR, and JavaScript). It is designed to be a general-purpose language, combining the approachability and interactive development of a scripting language with an efficient and robust infrastructure for multithreaded programming
Back
do to import scala math package?
Front
import scala.math._
Back
Default caching mechanism
Front
Memory
Back
if you run sc.textfile on a non-existing file , will you get an error?
Front
NO!
only when you call an action, will it give an error
This actually happened to me!
Back
get absolute value of a variable foo
Front
foo.abs
Back
example of simple procedural scala program
Front
def cubed (x: Double) = x x x
val a = 10
val b = cubed(a)
println (b)
Back
How do you which version of scala you have installed in cygwin
Front
scala -version
(from cygwin shell)
Back
how to convert a string to Int?
Front
val foo = "123" (this is a string)
val bar =foo.Int
(bar is an integer)
Back
you want to see the value of the variable r3
Front
r3 OR
println(r3)
Back
Define a list
Front
Val foo = List(1,2,4)
Back
how to split a string using soace as a delimiter and then print each word out
Front
scala> "hello world".split(" ").foreach(println)
hello
world
Back
if you write simple procedural scala program (e.g. a bunch statements, no class), how to do you run it?
assume filename test2.scala
Front
from OS command line, run:
scala test2.scala
Back
what happens if you do: (in REPL)
2 + 5.5
Front
you 7.5 , which is a double
Back
if you want to read a text file using sc.textfile as soon as you enter scala REPL using spark-shell, where can you put the file?
Front
put in the last directory you were in before launching spark-shell
This worked for me!
Back
why
val z = math.Nextafter(2, 3)
gives error in Unix?
Front
case sensitivity!
Back
does scala have the concept of "this" like java?
Front
YES!
Back
To add up all the list elements in one line of code
Front
List1.foreach (sum += _)
Back
Do a foreach over a list
Front
Foo.foreach {println}
Back
val abc = sc.texFile(_______)
Front
path to file name to read from
Back
how to get more info on spark-shell command?
Front
spark-shell --help
Back
how to create a variable z with range of 1 through 10 (z will have values 1 through 10)
Front
val z = 1 to 10
(wow)
Back
T/F: scala has logical operators like || && and !
Front
True
Back
T/F: null is special value compatible with all types
Front
True
scala> val a = 1
a: Int = 1
scala> val a = "hello"
a: String = hello
scala> val a = null
a: Null = null
scala>
Back
what happens if a variable is assign too big number (double) to a variable? e.g. val b = exp(10000000)
Front
b gets value "infinity"
lol, I actually did this :-)
Back
You have read a file into an RDD called myfile. How do you count the lines in that file?
Front
myfile.count
Back
create a range r2 from 1 to 9999 , but skip by 10s
Front
val r2 = 1 to 9999 by 10
(its like speaking English :-)
scala: what is "default" access level (if not specified)
Front
public
Back
When you run spark-shell, you get the follwoing:
what does that mean?
> Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/
FSDataInputStream
at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSpa
rkProperties$1.apply(SparkSubmitArguments.scala:111)
at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSpa
rkProperties$1.apply(SparkSubmitArguments.scala:111)
Front
it can't find hadoop.
Once I installed hadoop 2.6, it went away.
Back
Example of Collaborative filtering
Front
N users M products.
Based of previous data, trying to make more matches between the two sides.
Back
There is a concept of driver memory
Front
Yes
Back
convert r3 (a range) to a list?
Front
r3.toList
(watch the capital L)
Back
multi-line comment in scala
Front
/* comment
still comment
still no code
*/
Back
How do you see what version of java complier you have installed in cygwin?
Front
javac -version
Back
you have RDD called myfile. You want to filter out only the lines that has the word "Sadia" in it.
Which code?
Front
val Sadia_only = myfile.filter(line => line.contains("Sadia"))