Skip to content

Types and schemas

Minarrow separates broad type categories from the complete Arrow logical type.

Every Array exposes:

  • .dtype — the Minarrow type family
  • .dtype.group — the broader TypeClass
  • .bit_width — the physical width of the stored values or indices
  • .arrow_type — the complete Arrow logical type
import minarrow as ma

array = ma.Array([1, 2, 3])

array.dtype        # DType.Integer
array.dtype.group  # TypeClass.Numeric
array.bit_width    # 64
array.arrow_type   # Int64

Type taxonomy

.dtype Values
Integer Signed and unsigned integers
Float 32-bit and 64-bit floating-point values
String UTF-8 strings with 32-bit or 64-bit offsets
Categorical Dictionary-encoded strings
Datetime Dates, times, timestamps and durations
Boolean Bit-packed Boolean values
Null All-null arrays

The corresponding TypeClass groups are:

TypeClass Included dtypes
Numeric Integer, Float
Text String, Categorical
Temporal Datetime
Boolean Boolean
Null Null

TypeClass is useful when code needs to handle related types together:

if array.dtype.group is ma.TypeClass.Numeric:
    process_numeric(array)

Use .arrow_type when the exact logical type matters, such as distinguishing Int32 from Int64 or inspecting timestamp units and time zones.

Supported types

Minarrow exposes one Python Array class. Its .arrow_type identifies the logical Arrow type and determines the physical representation.

Some types depend on build features.

.dtype .arrow_type Physical representation PyArrow equivalent
Integer Int8() ¹ i8, 8-bit pa.int8()
Integer Int16() ¹ i16, 16-bit pa.int16()
Integer Int32() i32, 32-bit pa.int32()
Integer Int64() i64, 64-bit pa.int64()
Integer UInt8() ¹ u8, 8-bit pa.uint8()
Integer UInt16() ¹ u16, 16-bit pa.uint16()
Integer UInt32() u32, 32-bit pa.uint32()
Integer UInt64() u64, 64-bit pa.uint64()
Float Float32() f32, 32-bit pa.float32()
Float Float64() f64, 64-bit pa.float64()
Boolean Boolean() Bit-packed pa.bool_()
String String() 32-bit offsets pa.string()
String LargeString() ² 64-bit offsets pa.large_string()
Categorical Dictionary(index=CategoricalIndexType.UInt8) ³ u8 indices pa.dictionary(pa.uint8(), pa.utf8())
Categorical Dictionary(index=CategoricalIndexType.UInt16) u16 indices pa.dictionary(pa.uint16(), pa.utf8())
Categorical Dictionary(index=CategoricalIndexType.UInt32) u32 indices pa.dictionary(pa.uint32(), pa.utf8())
Categorical Dictionary(index=CategoricalIndexType.UInt64) u64 indices pa.dictionary(pa.uint64(), pa.utf8())
Datetime Date32() i32 days pa.date32()
Datetime Date64() i64 milliseconds pa.date64()
Datetime Time32(unit) i32 pa.time32(unit)
Datetime Time64(unit) i64 pa.time64(unit)
Datetime Timestamp(unit, tz) i64 pa.timestamp(unit, tz)
Datetime Duration64(unit) i64 pa.duration(unit)
Null Null() No value buffer pa.null()

If you are compiling from source, the following Rust feature-flags are required: ¹ = Requires extended_numeric_types. ² = Requires large_string, enabled by default. ³ = Requires default_categorical_8. ⁴ = Requires extended_categorical. ⁵ = Requires datetime, enabled by default.

The default Python package on Pypi.org includes these by default.

The Arrow Interval type is represented in the type system but does not have a dedicated Minarrow array implementation.

Container types

Minarrow provides four columnar containers:

Minarrow type Arrow equivalent Description
Array pa.Array One contiguous logical column
Table pa.RecordBatch One batch of equal-length columns
ChunkedArray pa.ChunkedArray One logical column stored as multiple arrays
ChunkedTable pa.Table One logical table stored as multiple batches

A Table represents one record batch rather than an arbitrarily chunked dataset. Use ChunkedTable when data arrives or is stored as multiple batches.

Why use chunked containers

Chunked containers preserve existing allocation boundaries.

This avoids combining multiple arrays or record batches into one contiguous allocation when consolidation is unnecessary or expensive. The trade-off is additional bookkeeping and potentially less contiguous memory access.

Chunking is useful when:

  • Data arrives incrementally from a stream or transport
  • Batches correspond to processing windows
  • Files are read in row groups or record batches
  • A dataset is larger than available memory
  • Another library already uses a chunked internal representation

For example, a dataframe library may expose one logical column while retaining several Arrow arrays underneath it.

Chunking does not itself provide out-of-core execution. It allows consumers that support incremental or out-of-core processing to operate on one portion of the dataset at a time.

ArrowType

ArrowType mirrors the minarrow::ArrowType enum from the Rust core.

Because many enum variants carry values, they are constructed as callable Python variants:

ma.ArrowType.Int64()
ma.ArrowType.Float64()
ma.ArrowType.String()

ma.ArrowType.Timestamp(
    ma.TimeUnit.Microseconds,
    "UTC",
)

ma.ArrowType.Time32(
    ma.TimeUnit.Milliseconds,
)

ma.ArrowType.Dictionary(
    index=ma.CategoricalIndexType.UInt32,
)

The available variants follow the features enabled when the Python extension was built.

Fields

A Field describes one column. It contains:

  • A name
  • An Arrow logical type
  • A nullability flag
  • Optional metadata
amount = ma.Field(
    "amount",
    ma.ArrowType.Float64(),
    nullable=False,
    metadata={"unit": "USD"},
)

amount.name        # "amount"
amount.arrow_type  # Float64
amount.dtype       # DType.Float
amount.nullable    # False
amount.metadata    # {"unit": "USD"}

Field metadata is stored as string key-value pairs and is preserved through compatible Arrow interchange paths.

Schemas

A Schema is an ordered collection of fields with optional schema-level metadata.

schema = ma.Schema(
    [
        ma.Field(
            "id",
            ma.ArrowType.Int64(),
            nullable=False,
        ),
        amount,
        ma.Field(
            "symbol",
            ma.ArrowType.String(),
        ),
    ],
    metadata={"dataset": "prices"},
)

schema.names        # ["id", "amount", "symbol"]
len(schema)         # 3
schema.metadata     # {"dataset": "prices"}

schema["amount"]    # Lookup by name
schema[0]           # Lookup by position

Every Table exposes its schema:

table = ma.Table(
    {
        "id": [1, 2],
        "price": [1.5, 2.5],
    }
)

table.schema
# Schema([id: Int64, price: Float64])

For a ChunkedTable, the schema applies to every batch. A user-provided schema may also carry metadata that is not present on the individual batches.