Overview of the Spark APIs
Introduction
The platform supports the following standard Apache Spark APIs and custom extensions and APIs for working with data over the Spark engine:
-
Spark Datasets — you can consume and update data in the platform by using Apache Spark SQL Datasets/DataFrames. You can also extend the standard Spark DataFrames functionality by using the platform's custom NoSQL Spark DataFrame data source. See Spark Datasets API Reference.
-
Spark Streaming API — you can use the platform's Spark-Streaming Integration Scala API to map platform streams to Spark input streams, and then use the Apache Spark Streaming API to consume data and metadata from these streams. See Spark-Streaming Integration API Reference.
Note that the platform's NoSQL Web API extends the functionality provided by the Spark APIs and related platform extensions. This API supports various item update modes, conditional-update logic and the use of update expressions, and the ability to define counter attributes. For more information, see NoSQL Web API Reference.
You can run Spark jobs in the platform using standard industry tools.
For example, you can run SPARK_HOME
environment variable that maps to the Spark installation directory.
The spark-installation binaries directory ($PATH
) to simplify execution from any directory.
The installation also includes the required library files for using the platform's Spark APIs and the built-in Spark examples.
spark
variable) and stop the session at the end of the flow to release its resources (for example, by calling spark.stop()
).Running Spark Jobs with spark-submit
You can run Spark jobs by executing
The master URL of the Spark cluster is preconfigured in the environments of the platform web-based shell and Jupyter Notebook services.
Do not use the
The library files for the built-in Spark examples are found at
spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples*.jar 10
When the command succeeds, the output should contain the following line:
Pi is roughly 3.1432911432911435
To refer v3io
cluster data mount — /v3io/<container name>/<path to file>
.
For example, the following command runs a
spark-submit /v3io/projects/pyspark_apps/myapp.py
Deploy Modes
Client Deployment
By default,
Cluster Deployment
You can optionally submit Spark jobs using the
Cluster deployment provides a variety of advantages such as the ability to automate jobs execution and run Spark jobs remotely on the cluster — which is useful, for example, for running ongoing Spark jobs, such as streaming.
Running Spark Jobs from a Web Notebook
One way to run Spark jobs is from a web notebook for interactive analytics. The platform comes preinstalled with an open-source web-notebook application — Jupyter Notebook. (See Support and Certification Matrix and The Platform's Application Services). For more information about these tools and how to use them to run Spark jobs, see the respective third-party product documentation.