Working with NoSQL (Key-Value) Data

Note
The V3IO NoSQL functionality will be deprecated in the next major release.

Overview

The platform provides a NoSQL database service, which supports storage and consumption of data in a tabular format. A table is a collection of data objects known as items (rows), and their attributes (columns). For example, items can represent people with attribute names such as Name, Age, and PhoneNumber.

You can manage and access NoSQL data in the platform by using the NoSQL Frames, Spark DataFrame, or web APIs.

Terminology Comparison

The following table compares the Iguazio AI Platform's NoSQL DB terminology with that of similar third-party tools and the platform's file interface:

NoSQL terminology-comparison table

Creating Tables

NoSQL tables in the platform don't need to be created prior to ingestion. When writing data to a NoSQL table, if the table doesn't exit, it's automatically created in the specified path as part of the write operation.

Deleting Tables

Currently, most platform APIs don't have a dedicated method for deleting a table. An exception to this is the delete method of the V3IO Frames client's NoSQL backend. However, you can use the file-system interface to delete a table directory from the relevant data container:

rm -r <path to table>

Copy

 Copied to clipboard

The following examples delete a "mytable" table from a "mycontainer" container:

  • Local file-system command —

    rm -r /v3io/mycontainer/mytable

    Copy

     Copied to clipboard

  • Hadoop FS command —

    hadoop fs -rm -r v3io://mycontainer/mytable

    Copy

     Copied to clipboard

Partitioned Tables

Table partitioning is a common technique for optimizing physical data layout and related queries. In a partitioned table, some item attributes (columns) are used to create partition directories within the root table directory using the format <table path>/<attribute>=<value>[/<attribute>=<value>/...], and each item is then saved to the respective partition directory based on its attribute values. For example, for a "mytable" table with year and month attribute partitions, an item with attributes year = 2018 and month = 1 will be saved to a mytable/year=2018/month=1/ directory. This allows for more efficient data queries that search for the data only in the relevant partition directories instead of scanning the entire table. This technique is used, for example, by Hive, and is supported for all the built-in Spark Dataset file-based data sources (such as Parquet, CSV, and JSON). The platform includes specific support for partitioning NoSQL tables with Spark DataFrames [Tech Preview], and for querying partitioned tables with Frames, Spark DataFrames [Tech Preview], or Trino. For more information, see the Frames, Spark, and Trino references.

Partitioning Best Practices

When defining table partitions, follow these guidelines:

  • Partition the table according to the queries that you expect to be run most often. For example, if you expect most queries to be by day, define a single day partition. However, if you expect most queries to be based on other attributes, partition your table accordingly; for example, if most queries will be by country, state, and city, create country/state/city partition directories.
  • Bear in mind that the number of partitions and their sizes affect the performance of the queries: to optimize performance, avoid partitions that are too small (less than 10Ks of items) by consolidating partitions, especially if you have many partitions. For example, partitioning a table by seconds or minutes will result in a huge amount of very small partitions and therefore querying the table is likely to be inefficient, especially if you expect to have occasional scans of a larger period of time (such as a year).

Read Optimization

By default, read requests (queries) on a NoSQL table are processed by scanning all items in the table, which affects performance. However, the platform's NoSQL APIs support two types of read optimizations that can be used to improve performance:

Both optimizations involve queries on the table's primary key or its components and rely on the way that data objects (such as table items) are stored in the platform: the name of the object is the value of its primary-key attribute; the object is mapped to a specific data slice according to the value of its sharding key (which is also the primary key for simple object names); and objects with a compound primary key that have the same sharding-key value are sorted on the data slice according to the value of their sorting key. See Object Names and Primary Keys.

It's recommended that you consider these optimizations when you select your table's primary key and plan your queries. For more information and best-practice guidelines, see Best Practices for Defining Primary Keys and Distributing Data Workloads.

Faster Item-Specific Queries

The fastest table queries when using the platform's NoSQL Web API or the Iguazio Trino connector are those that uniquely identify a specific item by its primary-key value. Such queries are processed more efficiently because the platform searches the names of the object files only on the relevant data slice and stops the search when the requested item is located. See the web-API GetItem operation and the Trino CLI references.

Range Scans

A range scan is a query for specific sharding-key values and optionally also for a range of sorting-key values, which is processed by searching the sorted object names on the data slice that is associated with the specified sharding-key value(s). Such processing is more efficient than the standard full table scan, especially for large tables.

When using Frames, NoSQL Spark DataFrames, or Trino, the platform executes range scans automatically for compatible queries. When using the NoSQL Web API, you can select whether to perform a range scan, a parallel scan, or a standard full table scan.

Note that to support and use range scans effectively, you need to have a table with a compound <sharding key>.<sorting key> primary key that is optimized for your data set and expected data-access patterns. When using a NoSQL Spark DataFrame or Trino, the table must also contain sharding- and sorting-key user attributes. Note that range scans are more efficient when using a string sorting-key attribute. For more information and best-practice guidelines, see the Best Practices for Defining Primary Keys and Distributing Data Workloads guide, and especially the Using a Compound Primary Key for Faster NoSQL Table Queries and Using a String Sorting Key for Faster Range-Scan Queries guidelines.

See Also