ArrowDataset

Import and Export support ArrowDataset for the "Parquet", "ArrowIPC", "ORC", "CSV", and "TSV" formats.

背景

- Efficient multi-file, column-oriented data format.
- Developed by the Apache Software Foundation.

Import & Export

Import["dir","ArrowDataset"] imports an ArrowDataset directory as a Tabular object.
Import["dir",{"ArrowDataset",elem,…}] imports the specified elements.
Import["dir",{"ArrowDataset",elem,subelem₁,…}] imports subelements subelem_i, useful for partial data import.
Export["dir",expr,"ArrowDataset"] creates an ArrowDataset directory from expr.
Supported expressions expr include:

	{v₁,v₂,…}	a single column of data
	{{v₁₁,v₁₂,…},{v₂₁,v₂₂,…},…}	lists of rows of data
	array	an array such as SparseArray, QuantityArray, etc.
	dataset	a Dataset or a Tabular object

See the following reference pages for full general information:

	Import, Export	import from or export to a file
	CloudImport, CloudExport	import from or export to a cloud object
	ImportString, ExportString	import from or export to a string
	ImportByteArray, ExportByteArray	import from or export to a byte array

Import Elements

General Import elements:
"Elements" list of elements and options available in this file

"Summary" summary of the file

"Rules" list of rules for all available elements
Data representation elements:
"Data" two-dimensional array

"Dataset" table data as a Dataset

"Tabular" a Tabular object
Additional elements can be specified depending on the "Format" option. See "Parquet", "ArrowIPC", "ORC", "CSV", or "TSV" for detailed element descriptions.
Import by default uses the "Tabular" element.
Subelements for partial data import for the "Tabular" element can take row and column specifications in the form {"Tabular",rows,cols}, where rows and cols can be any of the following:

	n	n^th row or column
	-n	counts from the end
	n;;m	from n through m
	n;;m;;s	from n through m with steps of s
	{n₁,n₂,…}	specific rows or columns n_i

Data descriptor elements:
"ColumnLabels" names of columns

"ColumnTypes" association with data type for each column

"Schema" TabularSchema object

Options

General Import options:
"Format" Automatic underlying format to use

"Partitioning" None partitioning scheme
General Export options:

"Format"	"Parquet"	underlying format to use
"MaxPartitions"	4096	maximal number of partitions
"MaxRowsPerFile"	Infinity	maximal number of rows per file
"NameTemplate"	"part{i}"	file name template
"Partitioning"	"Hive"	partitioning scheme
"SplitColumns"	Automatic	columns used for partitioning

Import supports the following settings for "Partitioning":

	None	no partitioning
	"Hive"	Hive partitioning
	{col₁,col₂,…}	directory partitioning with partition keys
	{"Directory", {col₁,col₂,…}}	directory partitioning with partition keys

Export supports the following settings for "Partitioning":
"Directory" directory partitioning

"Hive" Hive partitioning
Additional options can be specified depending on the "Format" option. See "Parquet", "ArrowIPC", "ORC", "CSV", or "TSV" for detailed options descriptions.

范例

打开所有单元关闭所有单元

基本范例 (2)

Export Arrow dataset:

Import Arrow dataset:

Scope (3)

Import (3)

Show all elements available in the file:

By default, a Tabular object is returned:

Import column types:

Import Elements (14)

"ColumnCount" (1)

Get the number of columns:

"ColumnLabels" (1)

Read column names:

"ColumnTypes" (1)

Import column types:

"Data" (2)

Get the data from a file:

Import only selected rows:

Import only selected columns:

"Dataset" (2)

Get the data as a Dataset:

Import only selected rows:

Import only selected columns:

"Dimensions" (1)

Import data dimensions:

"MetaInformation" (1)

Import metadata:

"RowCount" (1)

Get the number of rows:

"Schema" (1)

Get the TabularSchema object:

"Summary" (1)

Get the file summary:

"Tabular" (2)

Get the data from a file as a Tabular object:

Import only selected rows:

Import only selected columns:

Import Options (2)

"Format" (1)

By default, the format of ArrowDataset is inferred from files stored in the input directory:

Use "Format" option to specify underlying format to use:

"Partitioning" (1)

By default, "Partitioning"None is used. Notice that the column used for partitioning is not imported:

Use "Partitioning" option with correct setting to get all columns:

Export Options (6)

"Format" (1)

By default, Export uses "Parquet" format:

Use "ArrowIPC" format:

"MaxPartitions" (1)

When the number of unique elements in the split column is larger than the default value of "MaxPartitions" option, then Export will fail:

Increase allowed number of partitions:

"MaxRowsPerFile" (1)

By default, the number of rows per file is unlimited:

Limit the number of rows per file:

"NameTemplate" (1)

By default, "part{i}" is used as the name template for ArrowDataset files:

Use different name template:

"Partitioning" (1)

By default, Export uses "Hive" partitioning:

Use "Directory" partitioning:

"SplitColumns" (1)

Export requires "SplitColumns" option:

Only column keys from a Tabular object can be the values of "SplitColumns" option:

Possible Issues (1)

Export requires "SplitColumns" option:

顶部

	"Elements"	list of elements and options available in this file
	"Summary"	summary of the file
	"Rules"	list of rules for all available elements

	"Data"	two-dimensional array
	"Dataset"	table data as a Dataset
	"Tabular"	a Tabular object

	"ColumnLabels"	names of columns
	"ColumnTypes"	association with data type for each column
	"Schema"	TabularSchema object

ArrowDataset

背景

Import & Export

Import Elements

Options

范例

基本范例 (2)

Scope (3)

Import (3)

Import Elements (14)

"ColumnCount" (1)

"ColumnLabels" (1)

"ColumnTypes" (1)

"Data" (2)

"Dataset" (2)

"Dimensions" (1)

"MetaInformation" (1)

"RowCount" (1)

"Schema" (1)

"Summary" (1)

"Tabular" (2)

Import Options (2)

"Format" (1)

"Partitioning" (1)

Export Options (6)

"Format" (1)

"MaxPartitions" (1)

"MaxRowsPerFile" (1)

"NameTemplate" (1)

"Partitioning" (1)

"SplitColumns" (1)

Possible Issues (1)

参见

相关指南

历史