Pandas – Complete Guide (2025 Edition)

0 · Overview & Philosophy

Pandas (Python ≥ 3.9) is the de‑facto tabular‑data toolkit for analytical pipelines. It offers two primary labelled containers—Series and DataFrame—designed around:

Explicit axis labels ⇒ alignment‑aware operations.
Rich IO connectors (read_csv, to_parquet …).
Relational algebra (join, merge, groupby) + split‑apply‑combine.
Integrated NumPy vectorisation for speed.
Seamless time‑series handling (offset‑aware indices, resampling).

Note — Current stable: pandas 2.2 (2025‑03). Install/upgrade: pip install --upgrade pandas

1 · Installation & Environment

# classic
python -m pip install pandas numpy pyarrow          # parquet/feather

# conda (recommended for scientific stack)
conda install -c conda-forge pandas

If you require polars‑style back‑end speed, enable the experimental pyarrow engine for string & list dtypes:

pd.set_option("future.no_setting_with_copy_warning", True)
pd.options.mode.dtype_backend = "pyarrow"

2 · Core Data Structures (Series & DataFrame)

2.1 Series

One‑dimensional labelled array of homogeneous dtype.

import pandas as pd
s = pd.Series([4, 7, -3], index=["a","b","c"], name="score")

2.2 DataFrame

Two‑dimensional tabular data structure with labelled axes.

df = pd.DataFrame({
    "name": ["Ada","Ben","Cara"],
    "age": [32, 34, 29],
    "join": pd.to_datetime(["2023‑01‑10","2022‑11‑23","2024‑07‑05"])
}, index=[1001,1002,1003])

2.3  Index & MultiIndex

df.set_index("city", inplace=True)

MultiIndex lets you represent higher‑dimensional data within 2‑D tabular form.

3 · Data Input / Output (IO)

Format	Reader read_*	Writer to_*
CSV / TSV	read_csv	to_csv
Parquet	read_parquet	to_parquet
SQL	read_sql	to_sql
Excel	read_excel	to_excel
JSON	read_json	to_json

Example – CSV ↔ Parquet

df = pd.read_csv("sales_2024.csv", parse_dates=["timestamp"])
df.to_parquet("sales_2024.parquet", index=False)

3.1  Reading

df = pd.read_csv("sales.csv", parse_dates=["date"], dtype_backend="pyarrow")

read_csv supports hundreds of parameters—compression, chunked iteration, Arrow dtypes, remote URLs, and more.

Tip — Parquet is columnar & compressed; expect 5‑10× smaller files + faster load.

3.2  Writing

df.to_parquet("sales.parquet", engine="pyarrow", compression="zstd")

Parquet preserves schema and nullability; Arrow engine makes round‑trips ~30× faster than CSV.

4 · Indexing & Selection

4.1 Label vs. Positional

loc ⇒ label‑based (df.loc["row_label", "col"]).
iloc ⇒ integer‑position (df.iloc[3, 0]).

4.2 Setting Values

				
					4.2.1 Single cell
					df.at[1002, "age"] = 35
					4.2.2 Conditional assignment
					df.loc[df["age"] < 30, "segment"] = "young"

4.3  Boolean masks

				df[df["pop"] > 4]

4.4  Query strings

				df.query("pop > 4 & city == 'Joburg'")

Avoid SettingWithCopyWarning by chaining safely or using loc.

5  Data Cleaning & Preparation

5.1  Missing values

df.isna().sum()
df.fillna({"pop": df["pop"].median()})

The isna / fillna duo forms pandas’ core missing‑data toolkit.

5.2  Type conversion (+ nullable dtypes)

df["rating"] = df["rating"].astype("Int64")   # preserves NA
df.convert_dtypes(dtype_backend="pyarrow")

Arrow‑backed extension arrays deliver first‑class nullability and memory savings.

6 · Descriptive Statistics

df.describe(numeric_only=True)

Core quick‑stats:

mean, median, std, var
quantile(0.25)
value_counts(dropna=False)
corr, cov

7 · GroupBy (“split‑apply‑combine”)

GroupBy orchestrates splitting, applying, and combining in a single fluent

# average age by segment
df.groupby("segment", observed=True)["age"].mean()

# multi‑agg with named columns
g = df.groupby("segment", observed=True).agg(
	avg_age = ("age","mean"),
	n = ("age","size")
)

With observed=True categorical groups of unused categories are omitted.

8 · Merging, Joining, Concatenation

8.1 Concatenate (stack)

				
					pd.concat([q1_df, q2_df], axis=0, ignore_index=True)

8.2 SQL‑style Merge

				orders = pd.read_csv("orders.csv")
customers = pd.read_csv("customers.csv")

merged = orders.merge(
	customers,
	on="customer_id",
	how="left",
	indicator=True)       # _merge column shows join result

pd.merge(df_a, df_b, on="id", how="inner")
		pd.concat([df1, df2, df3], axis=0, ignore_index=True)

concat is faster than append (deprecated) for stacking objects.

9 · Reshaping (Pivot, Melt, Stack)

# wide → long
tidy = pd.melt(df,
	id_vars=["name"],
	value_vars=["jan","feb","mar"],
	var_name="month",
	value_name="sales")

# long → wide
wide = tidy.pivot(index="name", columns="month", values="sales")

10 · Time‑Series Toolkit

# DatetimeIndex
ts = df.set_index("join").sort_index()

# resample to monthly start
monthly = ts.resample("MS")["age"].mean()

# window rolling mean
ts["age"].rolling(window="30D", min_periods=5).mean()

Time zone localisation: ts.tz_localize("Africa/Johannesburg")

11 · Handling Missing Data

Detect → isna(), notna()
Drop NA → dropna(how="any", subset=[…])
Impute → fillna({col: value}, method="ffill")
Interpolation → interpolate(method="linear")

# forward‑fill within group
df["close"].fillna(method="ffill", inplace=True)

12 · Vectorised String Operations

emails = pd.Series(["a@x.co","b@y.org ",None])
emails.str.strip().str.split("@", expand=True)

Common string methods: contains(), replace(), extract(),  len(), upper()

13 · Performance Tips

Use built‑in vectorised ops (avoid Python loops).
Prefer eval() & query() for complex expressions.
Enable pd.set_option("compute.use_bottleneck", True).
Chunk large CSVs: read_csv(chunksize=100_000).
Profile with %%timeit (Jupyter) & df.info(memory_usage="deep").

13.1  Vectorise & Avoid Loops

df["ratio"] = df["sales"] / df["cost"]   # no Python loop

13.2  `numba` UDFs

@numba.jit(nopython=True)
def fast_z(x): return (x - x.mean()) / x.std()
df["z"] = fast_z(df["pop"].to_numpy())

13.3  Arrow engine

pd.options.mode.dtype_backend = "pyarrow"

The Arrow backend reduces memory and speeds arithmetic by ~30 % on medium‑wide tables.

13.4  DuckDB queries

import duckdb, pandas as pd
duckdb.query("SELECT city, SUM(pop) FROM df GROUP BY 1").to_df()

DuckDB executes SQL directly on in‑memory DataFrames without conversion overhead.

13.5  Memory profiling

df.info(memory_usage="deep")

Downcast numeric columns (pd.to_numeric(..., downcast="integer")) to trim RAM usage, as community benchmarks show >50 % savings.

14 · Ecosystem/Export Interoperability

NumPy — vector operations are backed by ndarray.
Matplotlib/Seaborn — plotting (df.plot()).
Polars — from_pandas for Lightning‑fast queries.
DuckDB — SQL over Pandas with duckdb.query().
scikit‑learn — directly consumes DataFrames (fit/transform).

df.to_sql("table", "sqlite:///db.sqlite", if_exists="replace")
		df.to_duckdb("db.duckdb", "table")

You can move between pandas and Spark, Polars, RAPIDS, or Excel via to_* / read_* helpers.

15 · Best Practices & Common Pitfalls

Prefer immutable operations → assign to new columns.
Use categorical dtype for low‑cardinality strings.
Avoid chained indexing—use loc/iloc.
Keep indices meaningful; otherwise reset with reset_index().
Document dtypes early to prevent silent coercion.

16 · Quick‑Reference Cheat Sheet

# create
pd.DataFrame(data)                # constructor
pd.read_csv("file.csv")           # IO

# inspect
df.head(); df.tail()
df.info(); df.describe()

# select
df["col"]; df.loc[r, c]; df.iloc[i, j]

# filter
df[df["age"] > 30]

# mutate
df.assign(bonus = df["salary"]*0.1)

# group / aggregate
df.groupby("team")["pts"].sum()

# join
df1.merge(df2, on="id", how="inner")

# reshape
pd.melt(...); df.pivot(...)

# time‑series
ts.resample("W").mean()

# output
df.to_parquet("out.parquet")

0 · Overview & Philosophy

1 · Installation & Environment

2 · Core Data Structures (Series & DataFrame)

2.1 Series

2.2 DataFrame

2.3 Index & MultiIndex

3 · Data Input / Output (IO)

Example – CSV ↔ Parquet

3.1 Reading

3.2 Writing

4 · Indexing & Selection

4.1 Label vs. Positional

4.2 Setting Values

4.2.1 Single cell

4.2.2 Conditional assignment

4.3 Boolean masks

4.4 Query strings

5 Data Cleaning & Preparation

5.1 Missing values

5.2 Type conversion (+ nullable dtypes)