Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?
A. itemsDf.persist(StorageLevel.MEMORY_ONLY)
B. itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
C. itemsDf.store()
D. itemsDf.cache()
E. itemsDf.write.option('destination', 'memory').save()
The code block shown below should add column transactionDateForm to DataFrame transactionsDf. The column should express the unix-format timestamps in column transactionDate as string type like Apr 26 (Sunday). Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__(__2__, from_unixtime(__3__, __4__))
A. 1. withColumn
2.
"transactionDateForm"
3.
"MMM d (EEEE)"
4.
"transactionDate"
B. 1. select
2.
"transactionDate"
3.
"transactionDateForm"
4.
"MMM d (EEEE)"
C. 1. withColumn
2.
"transactionDateForm"
3.
"transactionDate"
4.
"MMM d (EEEE)"
D. 1. withColumn
2.
"transactionDateForm"
3.
"transactionDate"
4.
"MM d (EEE)"
E. 1. withColumnRenamed
2.
"transactionDate"
3.
"transactionDateForm"
4.
"MM d (EEE)"
Which of the following code blocks stores a part of the data in DataFrame itemsDf on executors?
A. itemsDf.cache().count()
B. itemsDf.cache(eager=True)
C. cache(itemsDf)
D. itemsDf.cache().filter()
E. itemsDf.rdd.storeCopy()
Which of the following code blocks reads in parquet file /FileStore/imports.parquet as a DataFrame?
A. spark.mode("parquet").read("/FileStore/imports.parquet")
B. spark.read.path("/FileStore/imports.parquet", source="parquet")
C. spark.read().parquet("/FileStore/imports.parquet")
D. spark.read.parquet("/FileStore/imports.parquet")
E. spark.read().format('parquet').open("/FileStore/imports.parquet")
Which of the following code blocks removes all rows in the 6-column DataFrame transactionsDf that have missing data in at least 3 columns?
A. transactionsDf.dropna("any")
B. transactionsDf.dropna(thresh=4)
C. transactionsDf.drop.na("",2)
D. transactionsDf.dropna(thresh=2)
E. transactionsDf.dropna("",4)
The code block shown below should return the number of columns in the CSV file stored at location filePath. From the CSV file, only lines should be read that do not start with a # character. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Code block:
__1__(__2__.__3__.csv(filePath, __4__).__5__)
A. 1. size
2.
spark
3.
read()
4.
escape='#'
5.
columns
B. 1. DataFrame
2.
spark
3.
read()
4.
escape='#'
5.
shape[0]
C. 1. len
2.
pyspark
3.
DataFrameReader
4.
comment='#'
5.
columns
D. 1. size
2.
pyspark
3.
DataFrameReader
4.
comment='#'
5.
columns
E. 1. len
2.
spark
3.
read
4.
comment='#'
5.
columns
Which of the following describes the conversion of a computational query into an execution plan in Spark?
A. Spark uses the catalog to resolve the optimized logical plan.
B. The catalog assigns specific resources to the optimized memory plan.
C. The executed physical plan depends on a cost optimization from a previous stage.
D. Depending on whether DataFrame API or SQL API are used, the physical plan may differ.
E. The catalog assigns specific resources to the physical plan.
Which of the following describes Spark's standalone deployment mode?
A. Standalone mode uses a single JVM to run Spark driver and executor processes.
B. Standalone mode means that the cluster does not contain the driver.
C. Standalone mode is how Spark runs on YARN and Mesos clusters.
D. Standalone mode uses only a single executor per worker per application.
E. Standalone mode is a viable solution for clusters that run multiple frameworks, not only Spark.
Which of the following code blocks returns a copy of DataFrame itemsDf where the column supplier has been renamed to manufacturer?
A. itemsDf.withColumn(["supplier", "manufacturer"])
B. itemsDf.withColumn("supplier").alias("manufacturer")
C. itemsDf.withColumnRenamed("supplier", "manufacturer")
D. itemsDf.withColumnRenamed(col("manufacturer"), col("supplier"))
E. itemsDf.withColumnsRenamed("supplier", "manufacturer")
Which of the following describes Spark's way of managing memory?
A. Spark uses a subset of the reserved system memory.
B. Storage memory is used for caching partitions derived from DataFrames.
C. As a general rule for garbage collection, Spark performs better on many small objects than few big objects.
D. Disabling serialization potentially greatly reduces the memory footprint of a Spark application.
E. Spark's memory usage can be divided into three categories: Execution, transaction, and storage.
Which of the following code blocks returns only rows from DataFrame transactionsDf in which values in column productId are unique?
A. transactionsDf.distinct("productId")
B. transactionsDf.dropDuplicates(subset=["productId"])
C. transactionsDf.drop_duplicates(subset="productId")
D. transactionsDf.unique("productId")
E. transactionsDf.dropDuplicates(subset="productId")
Which of the following code blocks reads the parquet file stored at filePath into DataFrame itemsDf, using a valid schema for the sample of itemsDf shown below?
Sample of itemsDf:
1.+------+-----------------------------+-------------------+
2.|itemId|attributes |supplier |
3.+------+-----------------------------+-------------------+
4.|1 |[blue, winter, cozy] |Sports Company Inc.|
5.|2 |[red, summer, fresh, cooling]|YetiX |
6.|3 |[green, summer, travel] |Sports Company Inc.|
7.+------+-----------------------------+-------------------+
A. 1.itemsDfSchema = StructType([
2.
StructField("itemId", IntegerType()),
3.
StructField("attributes", StringType()),
4.
StructField("supplier", StringType())])
5.
6.itemsDf = spark.read.schema(itemsDfSchema).parquet(filePath)
B. 1.itemsDfSchema = StructType([
2.
StructField("itemId", IntegerType),
3.
StructField("attributes", ArrayType(StringType)),
4.
StructField("supplier", StringType)])
5.
6.itemsDf = spark.read.schema(itemsDfSchema).parquet(filePath)
C. 1.itemsDf = spark.read.schema('itemId integer, attributes
D. 1.itemsDfSchema = StructType([
2.
StructField("itemId", IntegerType()),
3.
StructField("attributes", ArrayType(StringType())),
4.
StructField("supplier", StringType())])
5.
6.itemsDf = spark.read.schema(itemsDfSchema).parquet(filePath)
E. 1.itemsDfSchema = StructType([
2.
StructField("itemId", IntegerType()),
3.
StructField("attributes", ArrayType([StringType()])),
4.
StructField("supplier", StringType())])
5.
6.itemsDf = spark.read(schema=itemsDfSchema).parquet(filePath)
Which of the following code blocks returns the number of unique values in column storeId of DataFrame transactionsDf?
A. transactionsDf.select("storeId").dropDuplicates().count()
B. transactionsDf.select(count("storeId")).dropDuplicates()
C. transactionsDf.select(distinct("storeId")).count()
D. transactionsDf.dropDuplicates().agg(count("storeId"))
E. transactionsDf.distinct().select("storeId").count()
The code block shown below should return a column that indicates through boolean variables whether rows in DataFrame transactionsDf have values greater or equal to 20 and smaller or equal to 30 in column storeId and have the value 2 in column productId. Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__((__2__.__3__) __4__ (__5__))
A. 1. select
2.
col("storeId")
3.
between(20, 30)
4.
and
5.
col("productId")==2
B. 1. where
2.
col("storeId")
3.
geq(20).leq(30)
4.
and
5.
col("productId")==2
C. 1. select
2.
"storeId"
3.
between(20, 30)
4.
andand
5.
col("productId")==2
D. 1. select
2.
col("storeId")
3.
between(20, 30)
4.
andand
5.
col("productId")=2
E. 1. select
2.
col("storeId")
3.
between(20, 30)
4.
and
5.
col("productId")==2
The code block displayed below contains an error. The code block should use Python method find_most_freq_letter to find the letter present most in column itemName of DataFrame itemsDf and return it in a new column most_frequent_letter. Find the error.
Code block:
1.
find_most_freq_letter_udf = udf(find_most_freq_letter)
2.
itemsDf.withColumn("most_frequent_letter", find_most_freq_letter("itemName"))
A. Spark is not using the UDF method correctly.
B. The UDF method is not registered correctly, since the return type is missing.
C. The "itemName" expression should be wrapped in col().
D. UDFs do not exist in PySpark.
E. Spark is not adding a column.