PyArrow is the Python interface to Apache Arrow, a language-independent columnar memory format. It allows fast data interchange between systems and efficient in-memory data processing.
pip install pyarrow
pyarrow.array
: Efficient array representationspyarrow.table
: Columnar data tablespyarrow.parquet
: Parquet read/write interfacepyarrow.csv
: CSV reader/writerpyarrow.fs
: Filesystem abstraction (S3, HDFS, local)pyarrow.compute
: Fast Arrow-native vectorized functionsimport pyarrow as pa
arr = pa.array([1, 2, 3, None, 5])
print(arr)
# Output: [1, 2, 3, null, 5]
Arrow arrays support missing values (nulls) and strong typing.
table = pa.table({
"name": ["Alice", "Bob", "Charlie"],
"age": pa.array([25, 30, 35])
})
print(table.schema)
print(table.to_pandas())
import pyarrow.parquet as pq
pq.write_table(table, "people.parquet")
table2 = pq.read_table("people.parquet")
print(table2.to_pandas())
Using Arrow to read CSVs more efficiently:
import pyarrow.csv as pv
csv_table = pv.read_csv("people.csv")
print(csv_table)
df = table.to_pandas()
table_again = pa.Table.from_pandas(df)
import pyarrow.compute as pc
result = pc.add(pa.array([1, 2]), pa.array([3, 4]))
print(result) # Output: [4, 6]
This is fast and vectorized at the Arrow level.
Interact with files on local, S3, or HDFS:
from pyarrow import fs
local = fs.LocalFileSystem()
info = local.get_file_info("people.parquet")
print(info)
PyArrow can be used as the backend for pandas
or dask
to read/write Parquet efficiently.
import pandas as pd
df = pd.read_parquet("people.parquet", engine="pyarrow")
pyarrow
for large, structured data pipelines.compute
for batch processing instead of Python loops.