ArrowIPC (.arrow, .arrows, .feather, .ftr)
14.2的新功能
背景
-
- Registered MIME types: application/vnd.apache.arrow.file, application/vnd.apache.arrow.stream
- Arrow IPC columnar data format.
- Used for efficient serialization of large columnar datasets.
- The primitive unit of serialized data in the columnar format is called record batch.
- Arrow IPC file format is used for serializing a fixed number of record batches and supports random access.
- Arrow IPC streaming format is used for sending an arbitrary-length sequence of record batches.
- Feather version 2 is a file format represented as the Arrow IPC file on disk.
- Feather version 1 is a legacy file format distinct from Arrow IPC files.
- Developed by the Apache Software Foundation.
- Binary file format.
- Supports multiple compression methods.
Import & Export
- Import["file.arrow"] imports an ArrowIPC file as a Tabular object.
- Import["file.arrow",elem] imports the specified elements.
- Import["file.arrow",{elem,subelem1,…}] imports subelements subelemi, useful for partial data import.
- The import format can be specified with Import["file","ArrowIPC"] or Import["file",{"ArrowIPC",elem,…}].
- Export["file.arrow",expr] exports a Tabular object to ArrowIPC file format.
- Supported expressions expr include:
-
{v1,v2,…} a single column of data {{v11,v12,…},{v21,v22,…},…} lists of rows of data array an array such as SparseArray, QuantityArray, etc. dataset a Dataset or a Tabular object - See the following reference pages for full general information:
-
Import, Export import from or export to a file CloudImport, CloudExport import from or export to a cloud object ImportString, ExportString import from or export to a string ImportByteArray, ExportByteArray import from or export to a byte array
Import Elements
- General Import elements:
-
"Elements" list of elements and options available in this file "Summary" summary of the file "Rules" list of rules for all available elements - Data representation elements:
-
"Data" two-dimensional array "Dataset" table data as a Dataset "Tabular" a Tabular object - Import by default uses the "Tabular" element.
- Subelements for partial data import for the "Tabular" element can take row and column specifications in the form {"Tabular",rows,cols}, where rows and cols can be any of the following:
-
n nth row or column -n counts from the end n;;m from n through m n;;m;;s from n through m with steps of s {n1,n2,…} specific rows or columns ni - Data descriptor elements:
-
"ColumnLabels" names of columns "ColumnTypes" association with data type for each column "Schema" TabularSchema object - Metadata elements:
-
"ColumnCount" number of columns stored in file "Dimensions" data dimensions "RowCount" number of rows stored in file "MetaInformation" metadata
Options
- General Import options:
-
IncludeMetaInformation All metadata types to import "UseMemoryMappedFile" True whether to use memory-mapped reader - General Export options:
-
"Compression" None compression method CompressionLevel Automatic compression level "Schema" Automatic schema used to construct Tabular object "Streamable" False if true, then Arrow IPC streaming format is used - The following settings for "Compression" are supported:
-
None no compression "LZ4Frame" LZ4 Frame compression "ZSTD" ZSTD compression
范例
打开所有单元关闭所有单元基本范例 (3)常见实例总结
Scope (3)
Import (3)
Show all elements available in the file:
By default, a Tabular object is returned:
Import Elements (14)
"Dataset" (2)
"Schema" (1)
Get the TabularSchema object:
"Tabular" (2)
Get the data from a file as a Tabular object:
Import Options (3)
IncludeMetaInformation (1)
By default, all metadata stored in a file is imported and embedded in the Tabular object:
"Schema" (1)
Export Options (6)
CompressionLevel (2)
By default, Automatic value of CompressionLevel is used. It corresponds to a different default value for each compression method.
"Streamable" (2)
By default, Export uses Arrow IPC file format:
Use "Streamable" option to generate Arrow IPC streaming format: