Complete pandas (Python) Guide

is a high‑level Python library that brings fast, flexible, and expressive tabular‑style data structures to the Python scientific stack. Since the 2.2 series (January 2024 → April 2024), pandas has tightened Arrow integration, accelerated common operations, and laid groundwork for a future 3.0 API shift. This guide distills those capabilities into practical, example‑driven sections you can expand or collapse on demand. Each section shows idiomatic code (blue), highlights key methods (green), and embeds notes (orange) for deeper insights.

1  Install & Setup

1.1  Minimal install

pip install pandas[pyarrow]  # single‑wheel + Arrow backend (≥2.2)

The optional [pyarrow] extra pulls in Apache Arrow to unlock the Arrow extension array engine.

1.2  Import conventions

import pandas as pd
import numpy as np

Most examples in this guide assume the conventional pd and np aliases.

2  Core Data Structures

2.1  Series

pd.Series([1,2,3], index=["a","b","c"], name="scores")

2.2  DataFrame

df = pd.DataFrame(
    {"city": ["Joburg","Cape Town","Durban"],
     "pop":  [5.6,        4.9,        3.2]})

The DataFrame constructor accepts dict‑of‑arrays, dict‑of‑Series, list‑of‑dicts, and more.

2.3  Index & MultiIndex

df.set_index("city", inplace=True)

MultiIndex lets you represent higher‑dimensional data within 2‑D tabular form.

3  Data I/O

3.1  Reading

df = pd.read_csv("sales.csv", parse_dates=["date"], dtype_backend="pyarrow")

read_csv supports hundreds of parameters—compression, chunked iteration, Arrow dtypes, remote URLs, and more.

3.2  Writing

df.to_parquet("sales.parquet", engine="pyarrow", compression="zstd")

Parquet preserves schema and nullability; Arrow engine makes round‑trips ~30× faster than CSV.

4  Selection, Filtering & Indexing

4.1  Label vs position

df.loc["Joburg", "pop"]   # label
df.iloc[0, 1]             # position

4.2  Boolean masks

df[df["pop"] > 4]

4.3  Query strings

df.query("pop > 4 & city == 'Joburg'")

5  Data Cleaning & Preparation

5.1  Missing values

df.isna().sum()
df.fillna({"pop": df["pop"].median()})

The isna / fillna duo forms pandas’ core missing‑data toolkit.

5.2  Type conversion (+ nullable dtypes)

df["rating"] = df["rating"].astype("Int64")   # preserves NA
df.convert_dtypes(dtype_backend="pyarrow")

Arrow‑backed extension arrays deliver first‑class nullability and memory savings.

6  Aggregation & GroupBy

6.1  Split‑Apply‑Combine

g = df.groupby("city")
g["pop"].mean()

GroupBy orchestrates splitting, applying, and combining in a single fluent chain.

6.2  Multi‑aggregation

g.agg(pop_mean=("pop","mean"), pop_sd=("pop","std"))

7  Time‑Series Tools

ts = (pd.read_csv("prices.csv", parse_dates=["date"])
        .set_index("date")
        .asfreq("D")
        .ffill())

Pandas wraps numpy.datetime64 with resampling (resample), shifting (shift), and rolling windows (rolling).

8  Merging, Joining & Concatenation

			
				pd.merge(df_a, df_b, on="id", how="inner")
				pd.concat([df1, df2, df3], axis=0, ignore_index=True)

concat is faster than append (deprecated) for stacking objects.

9  Reshaping, Pivot, Melt

wide = df.pivot(index="date", columns="city", values="pop")
tidy = wide.reset_index().melt(id_vars="date", var_name="city", value_name="pop")

The reshape API covers pivot, pivot_table, stack / unstack, and melt for Tidy‑data transformations.

10  Visualisation

df["pop"].plot(kind="bar")      # thin wrapper around matplotlib

For interactive plots try plotly.express or hvplot—both accept DataFrames seamlessly.

pandas delegates to Matplotlib; specify kind and other keyword args for quick EDA.

11  Performance Tips

11.1  Vectorise & Avoid Loops

df["ratio"] = df["sales"] / df["cost"]   # no Python loop

11.2  `numba` UDFs

@numba.jit(nopython=True)
def fast_z(x): return (x - x.mean()) / x.std()
df["z"] = fast_z(df["pop"].to_numpy())

11.3  Arrow engine

pd.options.mode.dtype_backend = "pyarrow"

The Arrow backend reduces memory and speeds arithmetic by ~30 % on medium‑wide tables.

11.4  DuckDB queries

import duckdb, pandas as pd
duckdb.query("SELECT city, SUM(pop) FROM df GROUP BY 1").to_df()

DuckDB executes SQL directly on in‑memory DataFrames without conversion overhead.

11.5  Memory profiling

df.info(memory_usage="deep")

Downcast numeric columns (pd.to_numeric(..., downcast="integer")) to trim RAM usage, as community benchmarks show >50 % savings.

12  Advanced Features

12.1  Extension Arrays

pd.Series([1, 2, None], dtype="Int64")

Nullable integer / boolean types move NA handling into native semantics.

12.2  Window Aggregation

df["7d_mean"] = df["pop"].rolling(window="7D").mean()

12.3  Categorical Dtype

df["city"] = df["city"].astype("category")

Categoricals accelerate groupby/group‑wise stats on low‑cardinality columns.

13  Export & Interoperability

df.to_sql("table", "sqlite:///db.sqlite", if_exists="replace")
df.to_duckdb("db.duckdb", "table")

You can move between pandas and Spark, Polars, RAPIDS, or Excel via to_* / read_* helpers.

1 Install & Setup

1.1 Minimal install

1.2 Import conventions

2 Core Data Structures

2.1 Series

2.2 DataFrame

2.3 Index & MultiIndex

3 Data I/O

3.1 Reading

3.2 Writing

4 Selection, Filtering & Indexing

4.1 Label vs position

4.2 Boolean masks

4.3 Query strings

5 Data Cleaning & Preparation

5.1 Missing values

5.2 Type conversion (+ nullable dtypes)

6 Aggregation & GroupBy

6.1 Split‑Apply‑Combine

6.2 Multi‑aggregation

7 Time‑Series Tools

8 Merging, Joining & Concatenation

9 Reshaping, Pivot, Melt

10 Visualisation

11 Performance Tips

11.1 Vectorise & Avoid Loops

11.2 numba UDFs

11.3 Arrow engine

11.4 DuckDB queries

11.5 Memory profiling

12 Advanced Features

12.1 Extension Arrays

12.2 Window Aggregation

12.3 Categorical Dtype

13 Export & Interoperability

14 Further Resources