ArrowDataset

背景

    • Efficient multi-file, column-oriented data format.
    • Developed by the Apache Software Foundation.

Import & Export

  • Import["dir","ArrowDataset"] imports an ArrowDataset directory as a Tabular object.
  • Import["dir",{"ArrowDataset",elem,}] imports the specified elements.
  • Import["dir",{"ArrowDataset",elem,subelem1,}] imports subelements subelemi, useful for partial data import.
  • Export["dir",expr,"ArrowDataset"] creates an ArrowDataset directory from expr.
  • Supported expressions expr include:
  • {v1,v2,}a single column of data
    {{v11,v12,},{v21,v22,},}lists of rows of data
    arrayan array such as SparseArray, QuantityArray, etc.
    dataseta Dataset or a Tabular object
  • See the following reference pages for full general information:
  • Import, Exportimport from or export to a file
    CloudImport, CloudExportimport from or export to a cloud object
    ImportString, ExportStringimport from or export to a string
    ImportByteArray, ExportByteArrayimport from or export to a byte array

Import Elements

  • General Import elements:
  • "Elements" list of elements and options available in this file
    "Summary"summary of the file
    "Rules"list of rules for all available elements
  • Data representation elements:
  • "Data"two-dimensional array
    "Dataset"table data as a Dataset
    "Tabular"a Tabular object
  • Additional elements can be specified depending on the "Format" option. See "Parquet", "ArrowIPC", "ORC", "CSV", or "TSV" for detailed element descriptions.
  • Import by default uses the "Tabular" element.
  • Subelements for partial data import for the "Tabular" element can take row and column specifications in the form {"Tabular",rows,cols}, where rows and cols can be any of the following:
  • nnth row or column
    -ncounts from the end
    n;;mfrom n through m
    n;;m;;sfrom n through m with steps of s
    {n1,n2,}specific rows or columns ni
  • Data descriptor elements:
  • "ColumnLabels"names of columns
    "ColumnTypes"association with data type for each column
    "Schema"TabularSchema object

Options

  • General Import options:
  • "Format"Automaticunderlying format to use
    "Partitioning"Nonepartitioning scheme
  • General Export options:
  • "Format""Parquet"underlying format to use
    "MaxPartitions"4096maximal number of partitions
    "MaxRowsPerFile"Infinitymaximal number of rows per file
    "NameTemplate""part{i}"file name template
    "Partitioning""Hive"partitioning scheme
    "SplitColumns"Automaticcolumns used for partitioning
  • Import supports the following settings for "Partitioning":
  • Noneno partitioning
    "Hive"Hive partitioning
    {col1,col2,}directory partitioning with partition keys
    {"Directory", {col1,col2,}}directory partitioning with partition keys
  • Export supports the following settings for "Partitioning":
  • "Directory"directory partitioning
    "Hive"Hive partitioning
  • Additional options can be specified depending on the "Format" option. See "Parquet", "ArrowIPC", "ORC", "CSV", or "TSV" for detailed options descriptions.

范例

打开所有单元关闭所有单元

基本范例  (2)

Export Arrow dataset:

Import Arrow dataset:

Scope  (3)

Import  (3)

Show all elements available in the file:

By default, a Tabular object is returned:

Import column types:

Import Elements  (14)

"ColumnCount"  (1)

Get the number of columns:

"ColumnLabels"  (1)

Read column names:

"ColumnTypes"  (1)

Import column types:

"Data"  (2)

Get the data from a file:

Import only selected rows:

Import only selected columns:

"Dataset"  (2)

Get the data as a Dataset:

Import only selected rows:

Import only selected columns:

"Dimensions"  (1)

Import data dimensions:

"MetaInformation"  (1)

Import metadata:

"RowCount"  (1)

Get the number of rows:

"Schema"  (1)

Get the TabularSchema object:

"Summary"  (1)

Get the file summary:

"Tabular"  (2)

Get the data from a file as a Tabular object:

Import only selected rows:

Import only selected columns:

Import Options  (2)

"Format"  (1)

By default, the format of ArrowDataset is inferred from files stored in the input directory:

Use "Format" option to specify underlying format to use:

"Partitioning"  (1)

By default, "Partitioning"None is used. Notice that the column used for partitioning is not imported:

Use "Partitioning" option with correct setting to get all columns:

Export Options  (6)

"Format"  (1)

By default, Export uses "Parquet" format:

Use "ArrowIPC" format:

"MaxPartitions"  (1)

When the number of unique elements in the split column is larger than the default value of "MaxPartitions" option, then Export will fail:

Increase allowed number of partitions:

"MaxRowsPerFile"  (1)

By default, the number of rows per file is unlimited:

Limit the number of rows per file:

"NameTemplate"  (1)

By default, "part{i}" is used as the name template for ArrowDataset files:

Use different name template:

"Partitioning"  (1)

By default, Export uses "Hive" partitioning:

Use "Directory" partitioning:

"SplitColumns"  (1)

Export requires "SplitColumns" option:

Only column keys from a Tabular object can be the values of "SplitColumns" option:

Possible Issues  (1)

Export requires "SplitColumns" option: