Apache Spark and Scala mamun

Apache Spark and Scala mamun

memorize.aimemorize.ai (lvl 286)
Section 1

Preview this deck

5. T/F: Spark is modified version of Hadoop

Front

Star 0%
Star 0%
Star 0%
Star 0%
Star 0%

0.0

0 reviews

5
0
4
0
3
0
2
0
1
0

Active users

1

All-time users

1

Favorites

0

Last updated

1 year ago

Date created

Mar 1, 2020

Cards (246)

Section 1

(50 cards)

5. T/F: Spark is modified version of Hadoop

Front

False!

Back

Which command to submit your spark app:

Front

/spark-home/bin/spark-submit .......options... (5)

Back

6. T/F: Hadoop is a dependency for Spark

Front

False

Back

True or false: you need Hadoop to install spark.

Front

False!

Back

Scala: To load data from text file to RDD. assume SparkContext is available as sc, SqlContext is available as SqlContext

Front

val sfpd = sc.textFile("path_to_file/file.txt" (4)

Back

What was spark's original purpose?

Front

Something to do with mesos

Back

Advantage of spark over Hadoop and storm

Front

Tons of built in tools etc already available in spark

Back

Spark has about __ built in operators for big data analy

Front

80

Back

Matei Zaharia also created

Front

Mesos (co) Fair scheduler (Hadoop )

Back

You can submit your spark app on 4 modes

Front

Local, Mesos, standalone, YARN (5)

Back

Used ______________ to create the RDDs

Front

SparkContext

Back

Python: create data to hold numbers 1 through 999

Front

Data = xrange (1,999)

Back

Scala context

Front

sc

Back

Worker programs can run on _____ and _____

Front

Cluster and local threads

Back

Main program of your spark app is called?

Front

Driver Just like mapreduce

Back

9. 4 types of workload (not limited to) Spark can handle:

Front

1. Batch 2. Iterative 3. Query 4. Streaming

Back

Client talks to server nodes via

Front

Ssh

Back

In Python , how to spread RDD (name data) to 8 nodes

Front

XrangeRDD = sc.parallelize(data,8)

Back

Python: how to change name of XrangeRDD to newRDD

Front

XrangeRDD.setName(newRDD)

Back

11. When and where was Spark born?

Front

In 2009 in UC Berkeley's AMPLab by Matei Zaharia

Back

A Spark programs first creates _______________ object

Front

SparkContext object

Back

Spark extends Mapreduce model to include iterative programming and _________

Front

Stream processing

Back

Spark was created at?

Front

UC Berkeley

Back

We know, Spark supports Java, Python, Scala. Does it support R?

Front

Yup, recently added!

Back

What is pySpark?

Front

Python programming interface to Spark

Back

Spark was built on top of Hadoop _____

Front

Mapreduce

Back

Your spark app is coordinated by _____ object

Front

SparkContext

Back

RDD:more partitions equals

Front

More parallelization

Back

10. Why is the management overload for Spark is low?

Front

Because it can do many types of workloads, you don't mean various tools for managing each.

Back

RDDs are distributed across _______________

Front

workers

Back

Scala command to get Scala ver

Front

sc.version

Back

to build a standalone app, to create a spark context, what to include in the main method (2 statements)

Front

val conf = new SparkConf() .setAppName("MyAPP") val sc = ................... FIRST: SparkConf OBJECT THEN: SparkContext (3)

Back

3. RDD Question: Where does RDD reside? E. On HDFS F. In the memory of the worker nodes G. In the memory of master node H. On local disk

Front

B

Back

Where was Mesos invented?

Front

Also, UC Berkeley

Back

Sparksql: part of spark core distribution?

Front

Yes

Back

4. Key difference between hadoop and spark in terms of reason for speed?

Front

Spark runs on memory

Back

What is lineage for RDD ?

Front

Each RDD remembers how it came about (transformations)

Back

What is RDD is layman's terms?

Front

This is the holy grail of what an RDD is. RDDs are a 'immutable resilient distributed collection of records' which can be stored in the volatile memory or in a persistent storage (HDFS, HBase etc) and can be converted into another RDD through some of the transformations. Source: http://www.thecloudavenue.com/2014/01/resilient-distributed-datasets-rdd.html

Back

Where was Spark invented?

Front

UC Berkeley

Back

Which variable helps transfer of large data sets very efficiently

Front

broadcast

Back

3 languages at lest that can be used to code on spark

Front

Java Python Scala

Back

12. Spark provides built-in APIs in ______________ (3) languages

Front

Java, Scala, and Python

Back

Created by

Front

Matei Zaharia

Back

Spark programs are two programs:

Front

Driver and Workers

Back

Action on RDD returns what?

Front

a value to the driver code (mapr 1)

Back

RDDs are immutable. True/False

Front

True (2)

Back

7. T/F: Spark has its own cluster management. True/False

Front

True

Back

How do you start pyspark on windows?

Front

just a command, as long as you have path added

Back

What is the name of the python interface to Spark?

Front

pyspark

Back

8. The main feature of Spark is _______________

Front

Ans: its in-memory cluster computing

Back

Section 2

(50 cards)

once you get the scala prompt after kicking off spark-shell, 2 commands you can run for sanity check

Front

sc.version sc.appName (case sensitive)

Back

4 components of spark that sits on top of core spark

Front

1.sparksql 2.Mlib 3. GraphX 4. Spark Streaming

Back

how to start up pyspark? (windows)

Front

pyspark (command)

Back

will :history remember commands from the previous sessions?

Front

YES!

Back

every sparkcontext launches a webUI on port?

Front

4040

Back

Which has richer set of libraries: Python or Scala

Front

Python

Back

When code is sent to executors, what is sent?

Front

JAR or Python code passed to sparkContext

Back

calling cache() does what?

Front

saves the RDD in cache

Back

where will you find start-master script? (even on windows)

Front

sbin (peer to bin where spark-shell and pyspark are)

Back

in customer mode, where does the driver start?

Front

same process as the customer app

Back

It is efficient and faster to use _____ serialization

Front

Kryo

Back

How much faster is spark (memory) than hadoop

Front

100 times

Back

To run an application on the Spark cluster, simply pass _________________________________

Front

To run an application on the Spark cluster, simply pass the spark://IP:PORT URL of the master as to the SparkContext constructor.

Back

____________ variable can use to cache read-only values on each node's RAM

Front

broadcast

Back

Once connected to cluster, spark acquires ____ on nodes to do its work

Front

Executors

Back

Which has higher learning curve: Python or Scala

Front

Scala

Back

what does webUI on port 4040 show?

Front

Every SparkContext launches a web UI, by default on port 4040, that displays useful information about the application

Back

Spark was written in ______

Front

Scala

Back

2 modes for independent batch jobs:

Front

1. customer mode 2. batch mode

Back

What is REPL?

Front

The spark-repl is referred to as the interactive spark shell and can be run from your spark installation directory. ./spark shell The spark-repl ( read evaluate print loop ) is a modified version of the interactive scala repl that can be used with spark

Back

You can start a standalone master server by executing:

Front

You can start a standalone master server by executing: ./sbin/start-master.sh

Back

spark-shell (scala) : to get history

Front

:history

Back

Which has much greater community support: Python or Scala

Front

Python

Back

If you really wanted understand Spark, which pays more. Python or Scala

Front

Scala

Back

in scala prompt (spark-shell) way to get help

Front

:help

Back

Name 3 cluster managers sparkContext

Front

Yarn Mesos Spark's standalone cluster

Back

RDD stands for:

Front

Resilient Distributed Data Set

Back

example of 4 things that port 4040 will show

Front

A list of scheduler stages and tasks A summary of RDD sizes and memory usage Environmental information. Information about the running executors

Back

RDDs have transformations, which return ?

Front

transformations return pointers to new RDDs.

Back

spark does ________ analysis or large-scale data

Front

complex

Back

command to start up spark-shell (windows)

Front

spark-shell

Back

once accumulator has been created with the initial value, new tasks can be added to the variable using _____ command

Front

add

Back

most basic abstraction in spark:

Front

RDD

Back

how to get out of pyspark?

Front

quit()

Back

Spark acquires executors, copied cope, what's next?

Front

Sends tasks to executors

Back

spark-submit is a script. T/F

Front

True

Back

how to search history from scala prompt?

Front

:h? (literally) e.g. :h? foo

Back

REPL stands for?

Front

read evaluate print loop

Back

in batch mode, where does the driver start?

Front

in one of the worker forms

Back

spark-shell (scala) command to read a file from external source

Front

:load _PATH_

Back

command to fire up spark master on widnows

Front

bin\spark-class.cmd org.apache.spark.deploy.master.Master and this ran successfully and I got a web page at http://localhost:8080 this says Spark Master at spark://192.168.1.4:7077

Back

Accumulator variables are added through _______________ operation

Front

associative

Back

Mlib is __ times faster than disk-based mahout

Front

9

Back

default URL for client GUI

Front

localhost:8080

Back

7 options for the master URL

Front

local local[K] local[*] spark://host:port mesos://host:port yarn-client yarn-cluster

Back

in scala shell, how to read a file from the spark source directory?

Front

val textFile = sc.textFile("README.md")

Back

How much faster is spark (disk) than hadoop on disk

Front

10

Back

RDDs have actions, which return ?

Front

RDDs have actions, which return values

Back

Components of spark that has machine learning

Front

Mlib

Back

if multiple SparkContexts are running on the same host, they will bind to which ports?

Front

f multiple SparkContexts are running on the same host, they will bind to successive ports beginning with 4040 (4041, 4042, etc).

Back

Section 3

(50 cards)

accumulators are great for for running in ________________

Front

parallel

Back

T/F: unlike mapreduce, excecutors run on the worknode even when spark application is not running a task

Front

True

Back

through which entity driver code communicates to Cluster Manager

Front

SparkContext

Back

Use _______________ to create RDDs

Front

SparkContext

Back

what you set "master" to point to mesos cluster ?

Front

mesos://host:port

Back

if you specify local as master, what does that mean?

Front

1 thread on local box, no parallelism

Back

Lamba in programming is _________________ functions

Front

anonymous

Back

what is the benefit of executors running when task is not running

Front

sharing information from memory

Back

what is the default port for standalone spark cluster?

Front

7077

Back

RDD transformations are lazy. What does that mean?

Front

they are not computed immediately

Back

Name 2 examples of actions

Front

collect count

Back

Which parameter in SparkContext dictates which cluster and how many threads to use?

Front

master

Back

3 most common transformations on scala collections

Front

map filter flatmap

Back

default port for mesos cluster

Front

5050

Back

To install Spark Standalone mode, you simply place a compiled version of Spark on ________________

Front

To install Spark Standalone mode, you simply place a compiled version of Spark on each node on the cluster.

Back

A lambda is a type of function, defined _____________.

Front

A lambda is a type of function, defined inline.

Back

scala REPL = spark ______

Front

scala REPL IS the spark shell

Back

what percentage of spark programs are written in scala?

Front

95% You tube video posted by new century trainer 10/2015 he mentions support for R added, so video cant be that old.

Back

where do you keep a list of hostnames that run worker nodes

Front

conf/slaves To launch a Spark standalone cluster with the launch scripts, you should create a file called conf/slaves in your Spark directory, which must contain the hostnames of all the machines where you intend to start Spark workers, one per line

Back

where is spark-submit script?

Front

bin

Back

Databricks's product?

Front

Notebooks

Back

sbt is used to build spark applications in spark ______

Front

shell

Back

what are "code shorting" advantages of lambda functions?

Front

except that you don't need to bother with: Declaring the function somewhere else (and having to look for it when you revisit the code later). Naming something that you're only using once. Once you're used to function values, having to do without them seems as silly as being required to name every expression, such as: NORMAL WAY: int tempVar = 2 * a + b ... println(tempVar) LAMBDA WAY: println(2 * a + b) (Stackoverflow)

Back

Lambda comes from ____________ calculus

Front

Lambda

Back

pyspark: how to display the content of collection a

Front

a (just a enter)

Back

when spark runs on hadoop, data node is typically same as __________________

Front

worker node

Back

Cluster Manager and _______________ are typically the same device

Front

Spark Master

Back

Collect action causes ________, filter, and map transformations to be executed

Front

parallelize

Back

spark uses ______________ to parallelize and create pipelines for the sub-processes needed to execute a task.

Front

DAG

Back

if the client does, what happens to currently running apps and new apps that have not been submitted (assume no zookeeper)

Front

currently running, no impact new: no longer possible

Back

Lambda allows you to write quick throw away functions without _____________

Front

allows you to write quick throw away functions without naming them

Back

when add to accumulator's initial vale using the add command, who has access to it?

Front

only the driver program

Back

An anonymous function allows you to create_______ from functions

Front

An anonymous function allows you to create functions from functions

Back

accumulators are great for _________ and ___________ (examples)

Front

sums and counters

Back

Simply put, a lambda is a function without a __________

Front

Simply put, a lambda is a function without a name

Back

when is "transformation" is actually done?

Front

when an action is performed on it

Back

what you set "master" to point to standalone spark cluster ?

Front

spark://host:port

Back

3 ways to construct a RDD

Front

1. transform another RDD 2. from files (HDFS or otherwise) 3. parallelize python collections (lists)

Back

example of defining a collection (list) in python

Front

>>> a = [1 2 3 4 5] (>>> is the pyspark prompt)

Back

3 components of worker node

Front

executor cache task

Back

where are the start/stop scripts stored?

Front

sbin

Back

iPython and programs must use ________________ to create a new sparkContext

Front

Constructor

Back

Which company employs most of Spark committers?

Front

Databricks

Back

True/False: pySpark shell and Databricks cloud automatically create sc variable

Front

True!

Back

sbt is built from ____

Front

scala

Back

which has more data science tools? Python or Scala

Front

Python!

Back

True/False: RDD enables operations on collection of elements in parallel.

Front

True

Back

if you have 2 cores in your local box, what should you set yous "master" to?

Front

local[2]

Back

Name 2 examples of transformations

Front

map filter

Back

Scala example of a lambda function:

Front

args.foreach(arg => println(arg)) where the argument to the foreach method is an expression for an anonymous function. (Stackoverflow)

Back

Section 4

(50 cards)

Lambda function that returns arg - 1

Front

Define minusone(foo): return foo -1

Back

if your scala file is test.scala and object inside is HelloWorld, which files will be created when you compile it?

Front

HelloWorld.class HelloWorld$.class

Back

once driver gets slots from the worker nodes, how much communication happens between driver and worker AND between driver and master

Front

NONE

Back

scala is lot like java, so if you know java, learning curve for scala would be less. T/F

Front

True

Back

define a function fsq that will return square of the argument passed to it

Front

def fsq(x: Double) = x*x (just like on the REPL; D has to capitalize)

Back

which method to call to get number of partitions for a RDD?

Front

Rdd1.getnumpartitions

Back

Which language uses lambdas heavily?

Front

LISP!

Back

Spark SQL provides new abstraction called _____________

Front

schemaRDD

Back

If you see: val sc = new SparkContext(conf) what does mean for conf , in terms of dependency

Front

conf has to be defined with certain parameters which will turn be used in creating SparkContext sc

Back

how to print the string hello?

Front

println("hello") quotes needed. Otherwise it tries to find variable hello

Back

Scala lacks the same amount of Data Science libraries and ______ as Python

Front

tools

Back

Close cousin of lambdas in C?

Front

function pointer

Back

Transformation actually does nothing (like a file pointer)

Front

nothing really happens until an action is called

Back

You can cache to many types of locations

Front

Memory, disk and more

Back

python API vs scala API (performance)

Front

neglizible, because its almost all scala under the hood

Back

How much faster is MLib compared to Mahout (disk-based) before it got spark interface?

Front

9 times

Back

if your file is named test.scala , but object inside is called HelloWorld, will that work?

Front

Yes!

Back

If you do 5 transformations on 5gb RDD and do 1 simple action, data is finally pulled and network traffic ensues. Then you do one more small action, then what happens?

Front

data is pulled over again!!

Back

how to add 5 and 6 and save the result into variable x

Front

val x = 5 + 6

Back

Scala is both object oriented and functional. T/F

Front

True

Back

How do you compile helloworld.scala?

Front

scalac helloworld.scala

Back

you do: (scala REPL) 2+2 is the answer saved anywhere?

Front

yes , it is stored in variable res0

Back

if function fsq is define, how do you pass variable a to it and save the result in variable b?

Front

val b = fsq(a)

Back

Once a user application is bundled, it can be launched using the _____________ script

Front

Once a user application is bundled, it can be launched using the bin/spark-submit script

Back

if you write a class in scala, what extension should you give it?

Front

.scala (e.g. HelloWorld.scala)

Back

If you install scala and cygwin on windows , can you run scala from cygwin shell?

Front

yes just add scala's bin to windows PATH

Back

One difference between lambdas and anonymous inner class in Java

Front

The first one is a formal difference, lambdas usually define closures - which means they can refer to variables visible at the time they were defined.

Back

Close cousin of lambdas in Java?

Front

anonymous inner class http://martinfowler.com/bliki/Lambda.html

Back

Can we easily call R directly from python?

Front

People have ported most of the core parts of R to Python and there's easy ways to call R directly from Python (source: quora)

Back

Rdd1.todebugstring(RDD)

Front

public String toDebugString() A description of this RDD and its recursive dependencies for debugging.

Back

what was the appeal of Spark besides speed?

Front

interactive algorithms and queries were not served enough by the batch frameworks (i.e. hadoop oriented programmatic tools)

Back

Map

Front

function played on each element of the RDD

Back

4 Items on the TOP layer of Spark Stack

Front

1. Spark SQL 2. Spark Streaming 3. Mlib 4. GraphX

Back

Is the name "worker" misleading?

Front

yes. because worker manages all the resources (slots)

Back

3 Items on the bottom layer of spark stack

Front

1. Spark Stand Alone 2. Mesos 3. YARN

Back

Example of defining parameters using conf

Front

val conf = new SparkConf() .setMaster("local[2]") .setAppName("CountingSheep") (apache apark web site)

Back

How do you execute a complied class from command prompt?

Front

scala HelloWorld,class OR scala HelloWorld both works!

Back

If you do 5 transformations on 5gb RDD and do 1 simple action, when is data pulled

Front

only when the first action is called

Back

Executors

Front

JVM that actually runs tasks (on worker nodes)

Back

3 (examples) of what scala lacks

Front

1. visualization 2. local tools 3. good local transformations (source: quora)

Back

in scala: how to read a file into RDD

Front

val foofile = sc.textFile(foofile.txt)

Back

2 ways to exit scala shell?

Front

:q and sys.exit

Back

Way to make sure data is not thrown away

Front

Caching

Back

3 layers of Spark Stack:

Front

TOP: Spark SQL +3 others MIDDLE: Spark CORE BOTTOM: Mesos and 2 others

Back

what does spark-submit script take care of:

Front

This script takes care of: setting up the classpath with Spark and its dependencies

Back

Filter

Front

take out only certain elements on RDD and make a new one

Back

Does Ruby have lambdas?

Front

Yes

Back

how to print variable res0?

Front

println(res0)

Back

T/F: C/C++/Java/C# also have lambdas

Front

NO! http://martinfowler.com/bliki/Lambda.html

Back

example of spark-submit script

Front

./bin/spark-submit \ --class <main-class> --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key>=<value> \ ... # other options <application-jar> \ [application-arguments]

Back

Section 5

(46 cards)

How do you see what version of java you have installed in cygwin?

Front

java -version (from cygwin shell)

Back

Members of packages, classes, or objects can have which kinds of access?

Front

public private protected (just like java!!)

Back

how to get the next re-presentable value after 2.0 towards 3

Front

val z = Math.nextAfter(2, 3)) THIS worked.

Back

scala: difference between private and protected?

Front

private: only that class, not even its subclasses protected: that class and its subclasses

Back

you want to calculate e^a (exponential) and assign to b

Front

b = exp(a)

Back

what does * operator do to a string?

Front

append N-1 times val foo = abc bar = foo * 2 so, bar will be abcabc

Back

Clojure is a dialect of ______________

Front

LISP

Back

can you specify absolute path when using sc.textfile to read a file into RDD?

Front

yes this worked for me

Back

how to split a string based on empty space as a delimeter

Front

scala> "hello world".split(" ") res0: Array[java.lang.String] = Array(hello, world) (This is useful when we read in a text file)

Back

reverse a string named foo

Front

foo.reverse (works!)

Back

Clojure

Front

Clojure is a dynamic programming language that targets the Java Virtual Machine (and the CLR, and JavaScript). It is designed to be a general-purpose language, combining the approachability and interactive development of a scripting language with an efficient and robust infrastructure for multithreaded programming

Back

do to import scala math package?

Front

import scala.math._

Back

Default caching mechanism

Front

Memory

Back

if you run sc.textfile on a non-existing file , will you get an error?

Front

NO! only when you call an action, will it give an error This actually happened to me!

Back

get absolute value of a variable foo

Front

foo.abs

Back

example of simple procedural scala program

Front

def cubed (x: Double) = x x x val a = 10 val b = cubed(a) println (b)

Back

How do you which version of scala you have installed in cygwin

Front

scala -version (from cygwin shell)

Back

how to convert a string to Int?

Front

val foo = "123" (this is a string) val bar =foo.Int (bar is an integer)

Back

you want to see the value of the variable r3

Front

r3 OR println(r3)

Back

Define a list

Front

Val foo = List(1,2,4)

Back

how to split a string using soace as a delimiter and then print each word out

Front

scala> "hello world".split(" ").foreach(println) hello world

Back

if you write simple procedural scala program (e.g. a bunch statements, no class), how to do you run it? assume filename test2.scala

Front

from OS command line, run: scala test2.scala

Back

what happens if you do: (in REPL) 2 + 5.5

Front

you 7.5 , which is a double

Back

if you want to read a text file using sc.textfile as soon as you enter scala REPL using spark-shell, where can you put the file?

Front

put in the last directory you were in before launching spark-shell This worked for me!

Back

why val z = math.Nextafter(2, 3) gives error in Unix?

Front

case sensitivity!

Back

does scala have the concept of "this" like java?

Front

YES!

Back

To add up all the list elements in one line of code

Front

List1.foreach (sum += _)

Back

Do a foreach over a list

Front

Foo.foreach {println}

Back

val abc = sc.texFile(_______)

Front

path to file name to read from

Back

how to get more info on spark-shell command?

Front

spark-shell --help

Back

how to create a variable z with range of 1 through 10 (z will have values 1 through 10)

Front

val z = 1 to 10 (wow)

Back

T/F: scala has logical operators like || && and !

Front

True

Back

T/F: null is special value compatible with all types

Front

True scala> val a = 1 a: Int = 1 scala> val a = "hello" a: String = hello scala> val a = null a: Null = null scala>

Back

what happens if a variable is assign too big number (double) to a variable? e.g. val b = exp(10000000)

Front

b gets value "infinity" lol, I actually did this :-)

Back

You have read a file into an RDD called myfile. How do you count the lines in that file?

Front

myfile.count

Back

create a range r2 from 1 to 9999 , but skip by 10s

Front

val r2 = 1 to 9999 by 10 (its like speaking English :-)

Back

single line comment in scala

Front

// this is comment

Back

Does spark support clojure?

Front

Yes http://www.infoq.com/articles/apache-spark-introduction

Back

scala: what is "default" access level (if not specified)

Front

public

Back

When you run spark-shell, you get the follwoing: what does that mean? > Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/ FSDataInputStream at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSpa rkProperties$1.apply(SparkSubmitArguments.scala:111) at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSpa rkProperties$1.apply(SparkSubmitArguments.scala:111)

Front

it can't find hadoop. Once I installed hadoop 2.6, it went away.

Back

Example of Collaborative filtering

Front

N users M products. Based of previous data, trying to make more matches between the two sides.

Back

There is a concept of driver memory

Front

Yes

Back

convert r3 (a range) to a list?

Front

r3.toList (watch the capital L)

Back

multi-line comment in scala

Front

/* comment still comment still no code */

Back

How do you see what version of java complier you have installed in cygwin?

Front

javac -version

Back

you have RDD called myfile. You want to filter out only the lines that has the word "Sadia" in it. Which code?

Front

val Sadia_only = myfile.filter(line => line.contains("Sadia"))

Back