🧭 PyArrow Guide

What is PyArrow?

PyArrow is the Python interface to Apache Arrow, a language-independent columnar memory format. It allows fast data interchange between systems and efficient in-memory data processing.

Installation

pip install pyarrow

Core Modules

Create Arrays

import pyarrow as pa

arr = pa.array([1, 2, 3, None, 5])
print(arr)
# Output: [1, 2, 3, null, 5]

Arrow arrays support missing values (nulls) and strong typing.

Tables

table = pa.table({
    "name": ["Alice", "Bob", "Charlie"],
    "age": pa.array([25, 30, 35])
})

print(table.schema)
print(table.to_pandas())

Write Parquet Files

import pyarrow.parquet as pq

pq.write_table(table, "people.parquet")

Read Parquet Files

table2 = pq.read_table("people.parquet")
print(table2.to_pandas())

CSV I/O

Using Arrow to read CSVs more efficiently:

import pyarrow.csv as pv

csv_table = pv.read_csv("people.csv")
print(csv_table)

Conversion to Pandas

df = table.to_pandas()
table_again = pa.Table.from_pandas(df)

Compute Module (Fast Operations)

import pyarrow.compute as pc

result = pc.add(pa.array([1, 2]), pa.array([3, 4]))
print(result)  # Output: [4, 6]

This is fast and vectorized at the Arrow level.

Filesystem Access

Interact with files on local, S3, or HDFS:

from pyarrow import fs

local = fs.LocalFileSystem()
info = local.get_file_info("people.parquet")
print(info)

Use with Pandas and Dask

PyArrow can be used as the backend for pandas or dask to read/write Parquet efficiently.

import pandas as pd

df = pd.read_parquet("people.parquet", engine="pyarrow")

Tips

Resources