Given an input vector of features, a Random Forests model performs a classification task and ends in a tie. How does the model handle this outcome?
A. The model will be rebuilt
B. A winner is chosen at random
C. The tree that caused the tie is discarded
D. One more tree is added to the forest
A data engineer is asked to process several large datasets using MapReduce. Upon initial inspection the engineer realizes that there are complex interdependencies between the datasets.
Why is this a problem?
A. MapReduce works best on unstructured data
B. There is no problem; MapReduce accommodates all the data
C. MapReduce can only parse one file at a time.
D. MapReduce is not ideal when the processing of one dataset depends on another.
What is a characteristic of stop words?
A. Used in term frequency analysis
B. Include words such as "a", "an", and "the"
C. Meaningful words requiring a parser to stop and examine them
D. Don't occur often in text
What is the most likely reason for an HBase table to contain millions of columns?
A. Data is imported from a relational database table
B. Data is stored in the column qualifier
C. There are thousands of columns families
D. The column names are randomly generated
Which metric would be most helpful in identifying a node that may cause network disruption if the node were removed?
A. Degree
B. Closeness
C. Betweenness
D. PageRank
A hotel chain runs a simul-ation on room pricing. They want to estimate revenue, per hotel, within +/- $10 with 95% confidence (Za/2=1.96). The estimated revenue standard deviation is $5000 based on previous booking data.
What is the optimal number of simulation trials to run?
A. A 32-bit operating system was used
B. The same number of trials was used
C. A linear congruential generator (LCG) was used (or pseudo-random number generation
D. Different seeds tor the random number generator were used.
What is NOT a category of a NoSQL data store?
A. Columnar
B. Document
C. Key/Value
D. Flat File
What is a typical use of a UDF in Pig?
A. Creating functionality outside of what is provided by the built-in functions
B. Providing Functional access to user-defined data in HDFS
C. Providing advanced analytics to Hadoop
D. Providing an interface from Pig to Microsoft Excel for easier data manipulation
You develop a Python script "logisticpy" to evaluate the logistic function denoted as f(y) for a given value y that includes the following Pig code:
Register 'logistic.py' using jython as udf;
z = FOREACH y GENERATE $0, udf.logistic ($0);
DUMP z;
What is the expected output when the Pig code is executed?
A. 0
B. Jython is not a supported language
C. Value of f(y) for ally
D. Tuples (y, f(y))
You conduct a TFIDF analysis on 3 documents containing raw text and derive TFIDF ("data", document y) = 1.908. You know that the term "data" only appears in document 2.
What is the TF of "data" in document 2?
A. 2 based on the following reasoning: TFIDF = TF1DF = 1 908 You then know that IDF will equal LOG (32)=0.954 Therefore, TFIDF=TF*0.954 = 1.908 TF will then round to 2
B. 4 based on the following reasoning: TFIDF = TF1DF = 1.908 You then know that IDF will equal LOG (3/1 )=0.477 Therefore, TFIDF=TF'0 477 = 1.908 TF will then round to 4
C. 6 based on the following reasoning: TFIDF = TF1DF = 1.908 You then know that IDF will equal 3/1=3 Therefore, TFIDF=TF/3 = 1.908 TF will then round to 6
D. 11 based on the following reasoning: TFIDF = TF1DF = 1908 You then know that IDF will equal LOG(3/2)=0.176 Therefore, TFIDF=TF"0.176 = 1.908 TF will then round to 11
Which problem type is best suited for simulation?
A. One with a few. non-random input variables
B. One that has a closed-form solution
C. One with numerous, non-random Input-variables
D. One that compares "what-if scenarios
In multinomial logistic regression, what is used to calculate the probability of outcome occurring?
A. Logistic function applied to a linear combination of the input and outcome variables
B. Linear regression applied to a combination of input variables
C. Linear regression applied to a combination of input and outcome variables
D. Logistic function applied to a linear combination of the input variables
What are the major components of the YARN architecture?
A. ResourceManager and NodeManager
B. Task Tracker and NameNode
C. HDFS, Tez, and Spark
D. Avro, ZooKeeper, and HDFS
If two of the communities are re-designated to be one community, how does that change the network characteristics?
Refer to the exhibit.
A. Neighborhood overlap would increase
B. Network diameter would decrease
C. Modularity would increase
D. Modularity would decrease
What best describes the meaning behind the phrase "Six Degrees of Separation'"?
A. Ability to use about six hops to reach any other node in an extremely large social network
B. Erdos number of all scholars having written papers with Paul Erdos
C. Maximum number of edges between nodes in a graph with a diameter of six
D. Typical distance between nodes that are connected by triadic closure