Parquet is a columnar storage file format. It is highly efficient for storing large datasets because it:
pandas
and pyarrow
.
Compared to CSV or JSON, Parquet files:
pip install pandas pyarrow
You can also use fastparquet
as an alternative to pyarrow
.
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35]
})
df.to_parquet("people.parquet", engine="pyarrow")
df = pd.read_parquet("people.parquet", engine="pyarrow")
print(df)
Suppose you only want to read the name
column:
names = pd.read_parquet("people.parquet", columns=["name"], engine="pyarrow")
This is faster and uses less memory than reading the full file.
Feature | CSV | Parquet |
---|---|---|
Compression | Low | High |
Read/Write Speed | Slow for large files | Fast, especially column-wise |
Data Types | All strings by default | Preserves native types |
Partial Loading | No | Yes |
Parquet is ideal for large, complex datasets where performance and space efficiency are key. It's widely used in big data tools like Apache Spark
, Dask
, and Hadoop
, and integrates seamlessly with pandas
in Python.