blue
), highlights key methods (green), and embeds notes (orange) for deeper insights.
pip install pandas[pyarrow] # single‑wheel + Arrow backend (≥2.2)
The optional [pyarrow] extra pulls in Apache Arrow to unlock the Arrow extension array engine.
import pandas as pd
import numpy as np
Most examples in this guide assume the conventional pd
and np
aliases.
pd.Series([1,2,3], index=["a","b","c"], name="scores")
df = pd.DataFrame(
{"city": ["Joburg","Cape Town","Durban"],
"pop": [5.6, 4.9, 3.2]})
The DataFrame constructor accepts dict‑of‑arrays, dict‑of‑Series, list‑of‑dicts, and more.
df.set_index("city", inplace=True)
MultiIndex lets you represent higher‑dimensional data within 2‑D tabular form.
df = pd.read_csv("sales.csv", parse_dates=["date"], dtype_backend="pyarrow")
read_csv supports hundreds of parameters—compression, chunked iteration, Arrow dtypes, remote URLs, and more.
df.to_parquet("sales.parquet", engine="pyarrow", compression="zstd")
Parquet preserves schema and nullability; Arrow engine makes round‑trips ~30× faster than CSV.
df.loc["Joburg", "pop"] # label
df.iloc[0, 1] # position
df[df["pop"] > 4]
df.query("pop > 4 & city == 'Joburg'")
df.isna().sum()
df.fillna({"pop": df["pop"].median()})
The isna / fillna duo forms pandas’ core missing‑data toolkit.
df["rating"] = df["rating"].astype("Int64") # preserves NA
df.convert_dtypes(dtype_backend="pyarrow")
Arrow‑backed extension arrays deliver first‑class nullability and memory savings.
g = df.groupby("city")
g["pop"].mean()
GroupBy orchestrates splitting, applying, and combining in a single fluent chain.
g.agg(pop_mean=("pop","mean"), pop_sd=("pop","std"))
ts = (pd.read_csv("prices.csv", parse_dates=["date"])
.set_index("date")
.asfreq("D")
.ffill())
Pandas wraps numpy.datetime64
with resampling (resample), shifting (shift), and rolling windows (rolling).
pd.merge(df_a, df_b, on="id", how="inner")
pd.concat([df1, df2, df3], axis=0, ignore_index=True)
concat is faster than append (deprecated) for stacking objects.
wide = df.pivot(index="date", columns="city", values="pop")
tidy = wide.reset_index().melt(id_vars="date", var_name="city", value_name="pop")
The reshape API covers pivot, pivot_table, stack / unstack, and melt for Tidy‑data transformations.
df["pop"].plot(kind="bar") # thin wrapper around matplotlib
For interactive plots try plotly.express
or hvplot
—both accept DataFrames seamlessly.
pandas delegates to Matplotlib; specify kind
and other keyword args for quick EDA.
df["ratio"] = df["sales"] / df["cost"] # no Python loop
numba
UDFs@numba.jit(nopython=True)
def fast_z(x): return (x - x.mean()) / x.std()
df["z"] = fast_z(df["pop"].to_numpy())
pd.options.mode.dtype_backend = "pyarrow"
The Arrow backend reduces memory and speeds arithmetic by ~30 % on medium‑wide tables.
import duckdb, pandas as pd
duckdb.query("SELECT city, SUM(pop) FROM df GROUP BY 1").to_df()
DuckDB executes SQL directly on in‑memory DataFrames without conversion overhead.
df.info(memory_usage="deep")
Downcast numeric columns (pd.to_numeric(..., downcast="integer")) to trim RAM usage, as community benchmarks show >50 % savings.
pd.Series([1, 2, None], dtype="Int64")
Nullable integer / boolean types move NA handling into native semantics.
df["7d_mean"] = df["pop"].rolling(window="7D").mean()
df["city"] = df["city"].astype("category")
Categoricals accelerate groupby/group‑wise stats on low‑cardinality columns.
df.to_sql("table", "sqlite:///db.sqlite", if_exists="replace")
df.to_duckdb("db.duckdb", "table")
You can move between pandas and Spark, Polars, RAPIDS, or Excel via to_* / read_* helpers.