Go on. Explore!

lvl 0 • 0 xp

lvl 1

Home Market Decks Interests

MapReduce & Scala

memorize.ai (lvl 286)

Section 1

Preview this deck

Describe the abstract IR architecture Bonus question: Where is the representation of the data created?

Front

1 / 56

Scala

0.0

0 reviews

5			0
4			0
3			0
2			0
1			0

Active users

All-time users

Favorites

Last updated

1 year ago

Date created

Mar 1, 2020

Cards (56)

Section 1

(50 cards)

Describe the abstract IR architecture Bonus question: Where is the representation of the data created?

Front

Representation of the data is created in the offline portion

Back

Which of the following is considered when counting words? Case folding Syntax Tokenization Stopword removal Semantics Stemming Word knowledge

Front

Case folding Tokenization Stopword removal Stemming

Back

What is the primary way MapReduce achieves fault tolerance?

Front

Restarting tasks

Back

Pros and cons of Boolean retrieval method

Front

Pros - Precise, if know right strategies - Precise, if have an idea of what you're looking for - Fast/ efficient implementation Cons - Users must learn Boolean logic - Boolean logic insufficient to capture richness of language - No control over size of result set: can have too many hits or none - When do you stop reading? All docs in results are equally good - What about partial matches? Those docs could still be useful

Back

Describe the inverted index used in Boolean retrieval

Front

Each word is mapped to a linked list of doc numbers the word is present in EX: blue -> 2 cat -> 3 -> 2

Back

What is a key feature of functional programming?

Front

Higher order functions: functions that accept other functions as arguments

Back

What is TaskTracker and where is it? What is a JobTracker and where is it?

Front

- TaskTracker is a process that monitors tasks and communicates results with a JobTracker; runs on each Datanode - JobTracker handles scheduling, progress tracking, fault tolerance, and resource management; runs on Namenode

Back

Describe the K-V Pair data structure for MapReduce

Front

(K,V) where K = key, V = value - Mapper takes data with multiple keys as input - Outputs data in a meaningful K-V pair - Reducer takes the data with only a single key and compacts/ aggregates values of the key

Back

Which strategy does MapReduce use: pull scheduling or push scheduling? Also define what those terms mean

Front

MapReduce = pull scheduling; TTs pull tasks by making requests Pull = worker asks for more work once done Push = master keeps sending stuff

Back

Define heartbeat message

Front

Every TT sends an update signal periodically to JT encompassing a request for a map or a reduce task to run

Back

What are the functions of a partition?

Front

- Want to control how keys are partitioned - System uses a default partition function: hash(key) mod R - Sometimes useful to override the has functions to ensure URLs from a host end up in the same output file

Back

Where is the following stored: - Input - Intermediate results - Final output

Front

Input: on a distributed file system (FS) Intermediate results: stored on LOCAL FS of map and reduce workers - NOT written to Hadoop Final output: stored on distributed FS, often input to another MapReduce task - written to Hadoop

Back

What are partitions and why are they important?

Front

Partitions = subsets of intermediate results that have the same key - all values with the same key are presented to a single Reducer together in a partition

Back

What does the MapReduce environment take care of?

Front

- Partitioning the input data - Scheduling program's execution across a set of machines - Performing the groupByKey step - Handling machine failures - Managing required inter-machine communication

Back

What are challenges with using commodity hardware with large-scale computing?

Front

- How do you distribute computation? - How can we make it easy to write distributed programs? - What do you do when machines fails?

Back

How does Hadoop locate stragglers?

Front

- Hadoop monitors each task progress using a progress score between 0 and 1 - If task's progress score is less than 0.2 (default in Hadoop) and the task has run for at least 1 min, marked as straggler

Back

Who takes care of coordination and how?

Front

Master node - Task status tracking - Scheduling idle tasks as workers become available - Pushes map task results location and sizes to reducers - Pings periodically to detect failures

Back

What is the general flow of the Map operation?

Front

1) Define a function 2) Apply on a list 3) Get another list

Back

Describe ranked retrieval

Front

- Order docs by how likely they are to be relevant - User model - Can estimate relevance

Back

Describe the solution for slow workers

Front

Near end of phase, spawn backup copies of tasks and first to finish "wins"

Back

Do Map and Reduce tasks run in parallel or in isolation? Why?

Front

In isolation: once map tasks finish, reduces will start - this saves bandwidth so no waiting on other nodes

Back

What assumption is made in the vector space model?

Front

Docs that are "close together" in vector space "talk about" the same things, so retrieve docs based on how close the doc is to the query

Back

Problem: copying data over a network takes time and can slow down distributed computation Soln: ?

Front

Bring computation close to the data -> chunk servers also serve as compute servers Store files multiple times for reliability - MapReduce encompasses these solutions

Back

What is the Map function and what does it do to lists of input data?

Front

- a higher order function that applies a function element-wise to a list of elements - transforms lists of elements into new lists of output data

Back

Describe the method for TF-IDF term weighting

Front

Back

Describe the Cosine Similarity method Bonus question: If two vectors result in dot product of 0, are they the same doc or completely different?

Front

No similarity b/w docs

Back

How does MapReduce help HDFS?

Front

- Acts as processing engine of HDFS - Helps with concept of "moving computation" instead of "moving data" => locality of computation - Cluster consists of nodes that have storage and processing power - Have multiple nodes perform computation in parallel

Back

How does task granularity help pipelining?

Front

Fine granularity tasks: map tasks >> machines > minimizes time for fault recovery > can do pipeline shuffling with map execution > better dynamic load balancing

Back

What is the general approach to MapReduce procedure?

Front

1) Identify key 2) Identify mapper function 3) Identify reducer function 4) System does the rest!

Back

Describe the overall process for fault tolerance

Front

- JT detects failure via heartheat - JT re-executes complete + in-progress map tasks (since output on local disk and unaccessible) - JT re-executes in-progress reduce tasks - Task complete committed through master

Back

How is a Boolean query executed in Boolean retrieval?

Front

NOTE: Traversal of postings is linear assuming sorted postings and start with shortest posting first

Back

How many Map and Reduce jobs should there be?

Front

Give M = num map tasks and R = num reduce tasks: - Make M much larger than the number of nodes in the cluster - One DFS chunk per map is common - Improves dynamic load balancing and speeds up recovery from worker failures - Usually R < M b/c output is spread across R files

Back

What architecture is used for task scheduling in MapReduce? Describe what is the JT and TT

Front

Uses master-slave architecture JT = Job Tracker; Master node TT = Task Tracker; Slave node

Back

Give map and reducer functions for applying MapReduce to the Word Count problem

Front

Back

Describe the workflow of MapReduce from input to results

Front

Back

Define and describe the schedulers that come with MapReduce in Hadoop

Front

FIFO Scheduler: the default that schedules jobs in order of submission Fair Scheduler: a multi-user scheduler which aims to give every user a fair share of the cluster capacity over time

Back

If a TT fails to communicate with JT for a period of time (1 min by default), JT does what for map jobs and reduce jobs?

Front

JT assumes the TT has crashed - For map phase job, JT asks another TT to re-execute ALL Mappers previously ran at failed TT - For reduce phase job, JT asks another TT to re-execute ALL Reducers that were in progress on the failed TT

Back

Describe the IR Cycle

Front

Back

What is the job of the Hadoop framework in MapReduce?

Front

Sort and shuffle data from mappers to reducers so each reducer gets data from only one key - hidden phase between mappers and reducers: groups all similar keys from all mappers, sorts, and passes them to a certain reducer

Back

Describe the input of Mappers, Shuffling/ Sorting, and Reducers

Front

Mappers - Input: <key, value> pairs - Output: <key, value> pairs that can be grouped by key Shuffling/ Sorting - Input: <key, value> pairs that can be grouped by key - Output: <key, <list of values>> Reducers - Input: <key, <list of values>> - Output: <key, value> where list in the K-V pair is compacted to scalar

Back

How are applications represented in MapReduce? What does it encompass?

Front

- App represented as a job - Job encompasses multiple map and reduce tasks

Back

Define Map Task Scheduling and Reduce Task Scheduling

Front

Map Task Scheduling: JT satisfies requests for map tasks via attempting to schedule mappers in the vicinity of their input splits - So: considers locality Reduce Task Scheduling: JT simply assigns the next yet-to-run reduce task to a requesting TT regardless of TT's network location and its implied effect on the reducer's shuffle time - So: does NOT consider locality

Back

What is the Reduce function and what does it do to lists of input data?

Front

- also known as fold, a higher order function that processes a list of elements by applying a function pairwise and finally returning a scalar -transforms/ compacts a list into a scalar

Back

Define stragglers, speculative tasks, and speculative execution

Front

Stragglers = slow tasks Speculative tasks = redundant tasks Speculative execution = MapReduce locates stragglers and runs a speculative task for each straggler that will hopefully finish before the straggler > the first to commit becomes the definitive copy and the other task is killed

Back

What is the general flow of the Reduce operation?

Front

1) Define an operator like + 2) Give initial value like 0 3) Apply on a list 4) Get a scalar

Back

How is text represented? Name and describe the process along with assumptions made

Front

"Bag of Words" - Treat all words in a doc as index terms - Assign a "weight" to each term based on importance (simplest: absence/ presence of word) - Disregard everything else Assumptions: - Term occurrence is indep - Doc relevance is indep - "Words" are well-defined

Back

How are the following failures dealt with: - Map worker failure - Reduce worker failure - Master failuer

Front

Map worker failure - Map tasks completed or in-progress at worker are reset to idle - Reduce workers are notified when task is rescheduled on another worker Reduce worker failure - Only in-progress tasks are reset to idle - Reduce task is restarted Master failure - MapReduce task is aborted and client notified

Back

Define the following in terms of MapReduce: - Information retrieval - What do we search? - What do we find?

Front

- Information retrieval > Focus on textual info, but can be image, video, etc. - What do we search? > Search collections - What do we find? > Find documents

Back

Describe Boolean retrieval

Front

- Users express queries as Boolean expr using AND, OR, NOT - Retrieval based on sets

Back

Name design considerations of MapReduce

Front

- process vasts amounts of data - parallel processing - large clusters of commodity hardware - fault-tolerant - should be able to increase processing power by adding more nodes -> "scale-out" not up - sharing data or processing between nodes is bad -> ideally want "shared-nothing" architecture - want batch processing -> process entire dataset and not random seeks

Back

Section 2

(6 cards)

How can Positional Indexes be used in the Inverted Index for TF-IDF?

Front

Positional indexes can be added to the tuple to represent the word is the nth word in the doc EX: blue -> 2 -> (2,1, [3]) -> (3,2,[2,4]) cat -> 1 -> (4,3,[1,5,7])

Back

Describe the generic retrieval process

Front

- Look up postings lists corresponding to query terms - Traverse postings for each query term - Store partial query-doc scores in accumulators - Select top k results to return

Back

How is MadReduce used in Index Construction?

Front

- Map over all docs > Emit term as <key, (docno, tf)> as value > Emit other info as needed - Sort/ shuffle: group postings by term - Reduce > Gather and sort postings (by docno or tf) > Write postings to disk

Back

When should you use TF-IDF over Cosine Similarity?

Front

TF-IDF: 1 term and n docs Cosine sim: k terms and n docs

Back

Describe the inverted index used in TF-IDF

Front

Each word is mapped to a linked list of tuples (docNum, numOccurrences) with head = number of docs the word appears in EX: blue -> 2 -> (2,1) -> (3,2) cat -> 1 -> (4,3)

Back

How is MapReduce helpful with info retrieval? How is it not so good?

Front

Helpful for indexing - Requires scalability - Fast - Batch operation - Incremental updates may not be important - Crawling the web Not so helpful for retrieval problem - Must have sub-second response time - Only need relatively few results

Back