Go on. Explore!

lvl 0 • 0 xp

lvl 1

Home Market Decks Interests

Scala and Spark Quizz

memorize.ai (lvl 286)

(0)

149 cards

()

Section 1

Preview this deck

Which of the following is not a component of Spark Ecosystem?

Front

1 / 149

Scala

0.0

0 reviews

5			0
4			0
3			0
2			0
1			0

Active users

All-time users

Favorites

Last updated

6 years ago

Date created

Mar 1, 2020

Cards (149)

Section 1

(50 cards)

Which of the following is not a component of Spark Ecosystem?

Front

Sqoop

Back

What are the benefits of lazy evaluation?

Front

Increase the manageability of the program. Saves computation overhead and increases the speed of the system. Reduces the time and space complexity. provides the optimization by reducing the number of queries.

Back

What are the parameters defined to specify window operation

Front

Window length, sliding interval

Back

What are Paired RDD?

Front

Are the RDD-containing key-value pair. A key.value pair KYP contains two linked data item

Back

How is fault tolerance achieved in Apache Spark?

Front

If any partition of an RDD is lost due to a worker node failure, then that partition can be re-computed from the original fault-tolerant dataset using the lineage of operations.

Back

What are the cases where Apache Spark surpasses Hadoop

Front

The data processing speed increases in the Apache Spark because apache spark run in memory computation The performance of the system increase by 10x - 1000x times Apache spark uses various languages for distributed application development We can use Streaming, SQL, graph and machine leaning (mLib)

Back

cache()

Front

we can cache RDD using the cache() or persist() method. In cache() method all the RDD are in-memory. The dissimilarity between cache() and persist() is the default storage level. For cache() it is MEMORY_ONLY. While in persist() there are various storage levels like: MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER MEMORY_AND_DISK_SER DISK_ONLY

Back

Persistence

Front

is an optimization technique which saves the result of RDD evaluation. Using this we save the intermediate result for further use. It reduces the computation overhead. We can make persisted RDD through cache() and persist() methods.

Back

What´s Apache Spark?

Front

It´s a framework´s open source, wid range data procesing engine It allow data worker to execute streaming, maching lerning or SQLworklads Spark dont have storage system

Back

Actions

Front

returns final result of RDD computations. It triggers execution using lineage graph to load the data into original RDD. After application of all the intermediate transformation, it gives the final result to driver program or writes it out to file system. Upon applying Actions on an RDD non-RDD values gets generate.

Back

What is lazy evaluation in Spark?

Front

Significa que SPARK evalua las transformaciones de manera perezosa ya que no ejecuta las transformaciones a los RDDs hasta que se ejecuta una acción en el RDD resultante

Back

DataSet

Front

is an extension of DataFrame API wich provides type safe, object-oiented programming interface

Back

Which of the following is not the feature of Spark?

Front

t is cost efficient

Back

Which of the following algorithm is not present in MLlib?

Front

Tanimoto distance

Back

The mains abstraction of Spark is its RDDs

Front

Also, we can cache RDD using the cache() or persist() method

Back

Spark Limitations

Front

Does ont have it´s file management system In-memory capability can become a bottleneck Memory consumption is very high, and the issues for the same are not handled in a user-friendly manner Mlib lack in some available algorihms, fore example, Tanimoto distance

Back

Exist two types of data that should recover in the event of failure:

Front

Data received and replicated Data received but buffered for replication

Back

Narrow transformation

Front

Is the esult of map, filter and such that the data is from a single partition only

Back

Can we add or setup new string computation after SparkContext starts

Front

no

Back

The default storage level of cache() is?

Front

MEMORY_ONLY

Back

Examples of transformations

Front

Map, Filter, ReduceByKey, etc

Back

Wide transformation

Front

are the result of groupByKey and reduceByKey

Back

Dataset was introduced in which Spark release?

Front

Spark 1.6

Back

En qué versión de spark se implementó SQL

Front

version 1.1

Back

The basic abstraction of Spark Streaming is

Front

Dstream

Back

How is data represented in Spark

Front

RDD DataFrame DataSet.

Back

What is Spark Core

Front

It provides parallel and distributed processing for large data sets. Spark core provides speed through in-memory computation RDD is the basic data structure of Spark Core. RDDs are immutable, a partitioned collection of record that can operate in parallel.

Back

How many types of Transformation are there?

Front

Narrow Transformation Wide transformations

Back

Transformations

Front

are lazy operations on an RDD that create one or many new RDDs

Back

RDD

Front

Resilient Distribuited DataSets RDD is the fundamental data structure of spark It is also a read only oartition collection of records The RDD can only be created through deterministic operation: Data in stable storage Parallelizing already existing collecition in driver program Other RDDs

Back

Which Cluster Manager do Spark Support?

Front

Standalone Cluster Manager, MESOS YARN

Back

Which is the abstraction of Apache Spark?

Front

shared variable and RDD

Back

Apache Spark was made open-source in which year?

Front

2010

Back

Which of the following is not output operation on DStream

Front

ReduceByKeyAndWindow

Back

diference between mapReduce and Spark

Front

Map reduce tabaja en disco y spark RDD trabaha en memoria

Back

which languages support Apache spark

Front

Spark provides API in various languages Pyhto, R, Scala and Java

Back

Explain the operations of Apache SPark RDD

Front

Apache Spark RDD supports two typesof operations: Transformations and actions

Back

What are the features of spark

Front

The processing speed of Apache Spark is very high Apache spark is dynamyx in nature We can reuse code for join stream against historical data recoviery is possible in RDD Spark support many languajes It can run independently and also on other cluster manager like Hadoop YARN

Back

In how many ways RDDs can be created?

Front

Parallelized collection External Datasets (Referencing a dataset) Creating RDD from existing RDD

Back

Where we can run spark

Front

We can run spark by itself or on varius existing cluster manager For example: Standalone Deploy Mode, Apache mesos, Hadoop, Yarn

Back

DataFrame

Front

The data organizes into named columns, like a table in a relational database It is also an inmutable distributed collecion of data Allows developers to impose a structure onto a distributed colleciont of data

Back

Which is not a component on the top of Spark Core?

Front

Spark RDD

Back

In Spark Streaming the data can be from what all sources?

Front

Kafka, Flume, Kinesis

Back

Spark is developed in which language

Front

Scala

Back

What are the components of spark ecosystem

Front

Spark Core Spark Streaming Spark SQL Spark MLlib Spark GraphX SparkR

Back

In addition to stream processing jobs, what all functionality do Spark provides?

Front

Machine learning Graph processing Batch-processing

Back

Dstream internally is

Front

Continuous Stream of RDD

Back

Hadoop and spark

Front

Cost Efficient during replication, a large number of servers, huge amount of storage, and the large data center is required. Apache spark is a cost effective solution for big data environment Performance: The basic idea behind Spark was to improve the performance of data processing. And Spark did this to 10x-100x times. And all the credit of faster processing in Spark goes to in-memory processing of data. In Hadoop, the data processing takes place in disc while in Spark the data processing takes place in memory. It moves to the disc only when needed Ease of development The core in Spark is the distributed execution engine Hadoop also supports some of these workloads but Spark eases the development by combining all into the same application.

Back

Is Spark included in every major distribution of Hadoop?

Front

Yes

Back

Apache Spark has API's in

Front

Java, python, scala

Back

Section 2

(50 cards)

Which of the following is the entry point of Spark Application -

Front

Spark Context

Back

Which of the following is true about DataFrame?

Front

DataFrames provide a more user-friendly API than RDDs.

Back

For Multiclass classification problem which algorithm is not the solution?

Front

Decision Trees

Back

You can connect R program to a Spark cluster from -

Front

RStudio R Shell Rscript

Back

In how many ways RDD can be created?

Front

Three

Back

Fault Tolerance in RDD is achieved using

Front

DAG(Directed Acyclic Graph)

Back

In Structured Streaming from file-based sources do we need to specify the schema?

Front

Yes

Back

Which of the following is true about wide transformation -

Front

The data required to compute resides on multiple partitions.

Back

For Regression problem which algorithm is not the solution?

Front

logistic Regression

Back

How many tasks does Spark run on each partition?

Front

one

Back

Which of the following is the entry point of Spark SQL?

Front

b. SparkSession

Back

Which of the following is true for RDD?

Front

We can operate Spark RDDs in parallel with a low-level API

Back

Which of the following is true for Spark core?

Front

It is the kernel of Spark

Back

Can you combine the libraries of Apache Spark into the same Application, for example, MLlib, GraphX, SQL and DataFrames etc.

Front

Yes

Back

Which of the following is false for Apache Spark?

Front

Spark is an open source framework which is written in Java

Back

When we want to work with the actual dataset, at that point we use Transformation?

Front

Flase

Back

Which of the following is true for RDD?

Front

RDD in Apache Spark is an immutable collection of objects

Back

Which of the following is true for Spark R?

Front

It allows data scientists to analyze large datasets and interactively run jobs

Back

Which of the following is true for Spark MLlib?

Front

It is the scalable machine learning library which delivers efficiencies

Back

Which of the following is not true for Hadoop and Spark?

Front

Both have their own file system

Back

The shortcomings of Hadoop MapReduce was overcome by Spark RDD by

Front

a. Lazy-evaluation b. DAG c. In-memory processing

Back

Which of the following is not a function of Spark Context in Apache Spark?

Front

Entry point to Spark SQL

Back

Can we edit the data of RDD, for example, the case conversion?

Front

No

Back

What is action in Spark RDD?

Front

The ways to send result from executors to the driver

Back

Which of the following is true about narrow transformation -

Front

The data required to compute resides on the single partition.

Back

How many Spark Context can be active per JVM?

Front

Only one

Back

Which of the following provide the Spark Core's fast scheduling capability to perform streaming analytics.

Front

Spark Streaming

Back

Which of the following is not an action?

Front

map

Back

Caching is optimizing the technique

Front

true

Back

Which of the following is true for Spark Shell?

Front

It helps Spark applications to easily run on the command line of the system It runs/tests application code interactively It allows reading from many types of data sources

Back

Which of the following is true for Spark SQL?

Front

It enables users to run SQL / HQL queries on the top of Spark.

Back

How much faster can Apache Spark potentially run batch-processing programs when processed in memory than MapReduce can?

Front

100 times faster

Back

Which of the following is the reason for Spark being Speedy than MapReduce?

Front

DAG execution engine and in-memory computation

Back

What does Spark Engine do?

Front

a. Scheduling b. Distributing data across a cluster c. Monitoring data across a cluster

Back

For binary classification problem which algorithm is not the solution?

Front

Naive Bayes

Back

Which of the following is a tool of Machine Learning Library?

Front

Persistence Utilities like linear algebra, statistics Pipelines

Back

In which mode you can better launch the task?

Front

Both Standalone and coarse-grained mode.

Back

What is a transformation in Spark RDD?

Front

Takes RDD as input and produces one or more RDD as output.

Back

The write operation on RDD is

Front

Coarse-grained

Back

RDD are fault-tolerant and immutable True

Front

True

Back

Which of the following is not a transformation?

Front

reduce

Back

When does Apache Spark evaluate RDD?

Front

Upon action

Back

Can you recover Accumulator and Broadcast variable from checkpointing in Spark Streaming?

Front

YES

Back

What are the features of Spark RDD?

Front

In-memory computation Lazy evaluations Fault Tolerance

Back

Does Spark R make use of MLlib in any aspect?

Front

Yes

Back

The read operation on RDD is

Front

Either fine-grained or coarse-grained

Back

Is MLlib deprecated?

Front

No

Back

Is it possible to mitigate stragglers in RDD?

Front

Yes

Back

In which of the following cases do we keep the data in-memory?

Front

a. Iterative algorithms. b. Interactive data mining tools.

Back

SparkContext guides how to access the Spark cluster.

Front

TRUE

Back

Section 3

(49 cards)

Which of the following is good for low-level transformation and actions.

Front

RDD

Back

Which of the following technology is good for Stream technology?

Front

Apache Flink

Back

Which of the following is open-source?

Front

a. Apache Spark b. Apache Hadoop c. Apache Flink

Back

Which of the following is true for Spark SQL?

Front

a. Hive transactions are not supported by Spark SQL. b. No support for time-stamp in Avro table. c. Even if the inserted value exceeds the size limit, no error will occur.

Back

Which of the following is true for the rule in Catalyst optimizer?

Front

a. We can manipulate tree using rules. b. We can define rules as a function from one tree to another tree. c. Using rule we get the pattern that matches each pattern to a result.

Back

In aggregate function can we get the data type different from as that input data type?

Front

Yes

Back

When SQL run from the other programming language the result will be

Front

Either DataFrame or Dataset

Back

Apache Spark supports -

Front

a. Batch processing b. Stream processing c. Graph processing

Back

Which of the following is not true for Catalyst Optimizer?

Front

There are no specific libraries to process relational queries.

Back

Dataset API is not supported by Python. But because of the dynamic nature of Python, many benefits of Dataset API are available.

Front

TRue

Back

Which of the following organized a data into a named column?

Front

b. DataFrame c. Dataset

Back

DataFrame API has provision for compile-time type safety.

Front

false

Back

Which of the following are uses of Apache Spark SQL?

Front

a. It executes SQL queries. b. We can read data from existing Hive installation using SparkSQL. c. When we run SQL within another programming language we will get

Back

Which of the following is true for Catalyst optimizer?

Front

The optimizer helps us to run queries much faster than their counter RDD part.

Back

Which of the following is action?

Front

CountByValue()

Back

In the Physical planning phase of Query optimization we can use both Coast-based and Rule-based optimization.

Front

True

Back

Which of the following is not true for Apache Spark Execution?

Front

Using SQL we can query data,only from inside a Spark program and not from external tools.

Back

SparkSQL translates commands into codes. These codes are processed by

Front

Executor Nodes

Back

RDD allows Java serialization

Front

true

Back

Which of the following is a transformation?

Front

mapPartitionWithIndex()

Back

Which of the following are the common feature of RDD and DataFrame?

Front

a. Immutability b. In-memory c. Resilient

Back

The primary Machine Learning API for Spark is now the _____ based API

Front

DataFrame

Back

In the analysis phase which is the correct order of execution after forming unresolved logical plan

Front

acbd

Back

After transforming into DataFrame one cannot regenerate a domain object

Front

true

Back

In which of the following Action the result is not returned to the driver.

Front

foreach()

Back

Which of the following is not true for map() Operation?

Front

Map allows returning 0, 1 or more elements from map function.

Back

With the help of Spark SQL, we can query structured data as a distributed dataset (RDD).

Front

True

Back

FlatMap transforms an RDD of length N into another RDD of length M. which of the following is true for N and M.

Front

a. N>M b. N<M

Back

We can create DataFrame using:

Front

a. Tables in Hive b. Structured data files c. External databases

Back

Spark SQL can connect through JDBC or ODBC.

Front

TRUE

Back

Which of the following is true for stateless transformation?

Front

The processing of each batch has no dependency on the data of previous batches.

Back

oes Dataset API support Python and R.

Front

no

Back

Which of the following make use of an encoder for serialization.

Front

Dataset

Back

In Spark SQL optimization which of the following is not present in the logical plan -

Front

Abstract syntax tree

Back

Which of the following is not a Spark SQL query execution phases?

Front

Execution

Back

Which of the following is the fundamental data structure of Spark

Front

DataFrame

Back

Using Spark SQL, we can create or read a table containing union fields.

Front

false

Back

Which of the following is true for stateful transformation?

Front

Uses data or intermediate results from previous batches and computes the result of the current batch.

Back

DataFrame in Apache Spark prevails over RDD and does not contain any feature of RDD.

Front

false

Back

In Dataframe in Spark Once the domain object is converted into a data frame, the regeneration of domain object is not possible.

Front

true

Back

Catalyst Optimizer supports either rule-based or cost-based optimization.

Front

false

Back

Spark SQL plays the main role in the optimization of queries.

Front

True

Back

Apache Spark is presently added in all major distribution of Hadoop

Front

True

Back

The Dataset API is accessible in

Front

Java and Scala

Back

Which of the following is a module for Structured data processing?

Front

Spark SQL

Back

Which of the following provide the object-oriented programming interface

Front

Dataset

Back

Which of the following is not true for DataFrame?

Front

DataFrame in Apache Spark is behind RDD

Back

Which of the following is true for the tree in Catalyst optimizer?

Front

a. A tree is the main data type in the catalyst. b. New nodes are defined as subclasses of TreeNode class. c. A tree contains node object.

Back

This optimizer is based on functional programming construct in

Front

Scala

Back

Scala and Spark Quizz

Preview this deck

Which of the following is not a component of Spark Ecosystem?

0.0

Cards (149)

Which of the following is not a component of Spark Ecosystem?

Sqoop

What are the benefits of lazy evaluation?

Increase the manageability of the program. Saves computation overhead and increases the speed of the system. Reduces the time and space complexity. provides the optimization by reducing the number of queries.

What are the parameters defined to specify window operation

Window length, sliding interval

What are Paired RDD?

Are the RDD-containing key-value pair. A key.value pair KYP contains two linked data item

How is fault tolerance achieved in Apache Spark?

If any partition of an RDD is lost due to a worker node failure, then that partition can be re-computed from the original fault-tolerant dataset using the lineage of operations.

What are the cases where Apache Spark surpasses Hadoop

The data processing speed increases in the Apache Spark because apache spark run in memory computation The performance of the system increase by 10x - 1000x times Apache spark uses various languages for distributed application development We can use Streaming, SQL, graph and machine leaning (mLib)

cache()

Persistence

is an optimization technique which saves the result of RDD evaluation. Using this we save the intermediate result for further use. It reduces the computation overhead. We can make persisted RDD through cache() and persist() methods.

What´s Apache Spark?

It´s a framework´s open source, wid range data procesing engine It allow data worker to execute streaming, maching lerning or SQLworklads Spark dont have storage system

Actions

What is lazy evaluation in Spark?

Significa que SPARK evalua las transformaciones de manera perezosa ya que no ejecuta las transformaciones a los RDDs hasta que se ejecuta una acción en el RDD resultante

DataSet

is an extension of DataFrame API wich provides type safe, object-oiented programming interface

Which of the following is not the feature of Spark?

t is cost efficient

Which of the following algorithm is not present in MLlib?

Tanimoto distance

The mains abstraction of Spark is its RDDs

Also, we can cache RDD using the cache() or persist() method

Spark Limitations

Does ont have it´s file management system In-memory capability can become a bottleneck Memory consumption is very high, and the issues for the same are not handled in a user-friendly manner Mlib lack in some available algorihms, fore example, Tanimoto distance

Exist two types of data that should recover in the event of failure:

Data received and replicated Data received but buffered for replication

Narrow transformation

Is the esult of map, filter and such that the data is from a single partition only

Can we add or setup new string computation after SparkContext starts

no

The default storage level of cache() is?

MEMORY_ONLY

Examples of transformations

Map, Filter, ReduceByKey, etc

Wide transformation

are the result of groupByKey and reduceByKey

Dataset was introduced in which Spark release?

Spark 1.6

En qué versión de spark se implementó SQL

version 1.1

The basic abstraction of Spark Streaming is

Dstream

How is data represented in Spark

RDD DataFrame DataSet.

What is Spark Core

It provides parallel and distributed processing for large data sets. Spark core provides speed through in-memory computation RDD is the basic data structure of Spark Core. RDDs are immutable, a partitioned collection of record that can operate in parallel.

How many types of Transformation are there?

Narrow Transformation Wide transformations

Transformations

are lazy operations on an RDD that create one or many new RDDs

RDD

Resilient Distribuited DataSets RDD is the fundamental data structure of spark It is also a read only oartition collection of records The RDD can only be created through deterministic operation: Data in stable storage Parallelizing already existing collecition in driver program Other RDDs

Which Cluster Manager do Spark Support?

Standalone Cluster Manager, MESOS YARN

Which is the abstraction of Apache Spark?

shared variable and RDD

Apache Spark was made open-source in which year?

2010

Which of the following is not output operation on DStream

ReduceByKeyAndWindow

diference between mapReduce and Spark

Map reduce tabaja en disco y spark RDD trabaha en memoria

which languages ​​support Apache spark

Spark provides API in various languages Pyhto, R, Scala and Java

Explain the operations of Apache SPark RDD

Apache Spark RDD supports two typesof operations: Transformations and actions

What are the features of spark

The processing speed of Apache Spark is very high Apache spark is dynamyx in nature We can reuse code for join stream against historical data recoviery is possible in RDD Spark support many languajes It can run independently and also on other cluster manager like Hadoop YARN

In how many ways RDDs can be created?

which languages support Apache spark