ArrowDataset

Background & Context

    • Efficient multi-file, column-oriented data format.
    • Developed by the Apache Software Foundation.

Import & Export

  • Import["dir","ArrowDataset"] imports an ArrowDataset directory as a Tabular object.
  • Import["dir",{"ArrowDataset",elem,}] imports the specified elements.
  • Import["dir",{"ArrowDataset",elem,subelem1,}] imports subelements subelemi, useful for partial data import.
  • Export["dir",expr,"ArrowDataset"] creates an ArrowDataset directory from expr.
  • Supported expressions expr include:
  • {v1,v2,}a single column of data
    {{v11,v12,},{v21,v22,},}lists of rows of data
    arrayan array such as SparseArray, QuantityArray, etc.
    dataseta Dataset or a Tabular object
  • See the following reference pages for full general information:
  • Import, Exportimport from or export to a file
    CloudImport, CloudExportimport from or export to a cloud object
    ImportString, ExportStringimport from or export to a string
    ImportByteArray, ExportByteArrayimport from or export to a byte array

Import Elements

  • General Import elements:
  • "Elements" list of elements and options available in this file
    "Summary"summary of the file
    "Rules"list of rules for all available elements
  • Data representation elements:
  • "Data"two-dimensional array
    "Dataset"table data as a Dataset
    "Tabular"a Tabular object
  • Additional elements can be specified depending on the "Format" option. See "Parquet", "ArrowIPC", "ORC", "CSV", or "TSV" for detailed element descriptions.
  • Import by default uses the "Tabular" element.
  • Subelements for partial data import for the "Tabular" element can take row and column specifications in the form {"Tabular",rows,cols}, where rows and cols can be any of the following:
  • nnth row or column
    -ncounts from the end
    n;;mfrom n through m
    n;;m;;sfrom n through m with steps of s
    {n1,n2,}specific rows or columns ni
  • Data descriptor elements:
  • "ColumnLabels"names of columns
    "ColumnTypes"association with data type for each column
    "Schema"TabularSchema object

Options

  • General Import options:
  • "Format"Automaticunderlying format to use
    "Partitioning"Nonepartitioning scheme
  • General Export options:
  • "Format""Parquet"underlying format to use
    "MaxPartitions"4096maximal number of partitions
    "MaxRowsPerFile"Infinitymaximal number of rows per file
    "NameTemplate""part{i}"file name template
    "Partitioning""Hive"partitioning scheme
    "SplitColumns"Automaticcolumns used for partitioning
  • Import supports the following settings for "Partitioning":
  • Noneno partitioning
    "Hive"Hive partitioning
    {col1,col2,}directory partitioning with partition keys
    {"Directory", {col1,col2,}}directory partitioning with partition keys
  • Export supports the following settings for "Partitioning":
  • "Directory"directory partitioning
    "Hive"Hive partitioning
  • Additional options can be specified depending on the "Format" option. See "Parquet", "ArrowIPC", "ORC", "CSV", or "TSV" for detailed options descriptions.

Examples

open allclose all

Basic Examples  (2)

Export Arrow dataset:

Import Arrow dataset:

Scope  (3)

Import  (3)

Show all elements available in the file:

By default, a Tabular object is returned:

Import column types:

Import Elements  (14)

"ColumnCount"  (1)

Get the number of columns:

"ColumnLabels"  (1)

Read column names:

"ColumnTypes"  (1)

Import column types:

"Data"  (2)

Get the data from a file:

Import only selected rows:

Import only selected columns:

"Dataset"  (2)

Get the data as a Dataset:

Import only selected rows:

Import only selected columns:

"Dimensions"  (1)

Import data dimensions:

"MetaInformation"  (1)

Import metadata:

"RowCount"  (1)

Get the number of rows:

"Schema"  (1)

Get the TabularSchema object:

"Summary"  (1)

Get the file summary:

"Tabular"  (2)

Get the data from a file as a Tabular object:

Import only selected rows:

Import only selected columns:

Import Options  (2)

"Format"  (1)

By default, the format of ArrowDataset is inferred from files stored in the input directory:

Use "Format" option to specify underlying format to use:

"Partitioning"  (1)

By default, "Partitioning"None is used. Notice that the column used for partitioning is not imported:

Use "Partitioning" option with correct setting to get all columns:

Export Options  (6)

"Format"  (1)

By default, Export uses "Parquet" format:

Use "ArrowIPC" format:

"MaxPartitions"  (1)

When the number of unique elements in the split column is larger than the default value of "MaxPartitions" option, then Export will fail:

Increase allowed number of partitions:

"MaxRowsPerFile"  (1)

By default, the number of rows per file is unlimited:

Limit the number of rows per file:

"NameTemplate"  (1)

By default, "part{i}" is used as the name template for ArrowDataset files:

Use different name template:

"Partitioning"  (1)

By default, Export uses "Hive" partitioning:

Use "Directory" partitioning:

"SplitColumns"  (1)

Export requires "SplitColumns" option:

Only column keys from a Tabular object can be the values of "SplitColumns" option:

Possible Issues  (1)

Export requires "SplitColumns" option: