Overview of Spark Datasets
Using Spark DataFrames
A Spark Dataset is an abstraction of a distributed data collection that provides a common way to access a variety of data sources. A DataFrame is a Dataset that is organized into named columns ("attributes" in the platform's unified data model). See the Spark SQL, DataFrames and Datasets Guide. You can use the Spark SQL Datasets/DataFrames API to access data that is stored in the platform.
In addition, the platform's Iguazio Spark connector defines a custom data source that enables reading and writing data in the platform's NoSQL store using Spark DataFrames — including support for table partitioning, data pruning and filtering (predicate pushdown), performing "replace" mode and conditional updates, defining and updating counter table attributes (columns), and performing optimized range scans. For more information, see The NoSQL Spark DataFrame.
See Spark DataFrame Data Types for the data types that are currently supported in the platform.
Spark DataFrames and Tables
A DataFrame consumes and updates data in a table, which is a collection of data objects — items (rows) — and their attributes (columns). The attribute name is the column name, and its value is the data stored in the relevant item (row). See also Working with NoSQL Data. As with all data in the platform, the tables are stored within data containers.
When writing (ingesting) data to a table with a Spark DataFrame, you need to set the
Data Paths
When using Spark DataFrames to access data in the platform's data containers, provide the path to the data as a fully qualified v3io
path of the following format — where <container name>
is the name of the parent data container and <data path>
is the relative path to the data within the specified container:
v3io://<container name>/<data path>
You pass the path as a string parameter to the relevant Spark method for the operation that you're performing — such as save("v3io://mycontainer/mytable")
csv("v3io://mycontainer/mycoldata.csv")