Apache Spark

Large-Scale Data Migration with Apache Spark

Large-Scale Data Migration with Apache Spark — cover illustration

Sooner or later every data platform needs a big move — a legacy database to the cloud, one warehouse to another, a schema overhaul. Done with hand-written scripts, these migrations are slow, fragile, and terrifying to restart after a failure halfway through. Apache Spark turns the job into a distributed, partitioned pipeline that processes massive datasets in parallel — and, done right, is idempotent and restartable.

Spark batch migration pipeline: read from a source database in partitions, transform and clean, validate row counts, then bulk-write to the target store
A Spark migration: read in partitions, transform, validate, then bulk-write.

Why Spark for migration

Spark's core advantage is parallelism. It splits the source into partitions and processes them across a cluster, so throughput scales with hardware rather than being bottlenecked on a single thread. It reads and writes almost anything — JDBC databases, Parquet, CSV, cloud object storage — and its DataFrame API expresses complex transformations clearly. For a multi-terabyte move, that's the difference between hours and days.

1. Read the source in partitions

The first rule of a big JDBC read: never pull a giant table down a single connection. Tell Spark how to partition the read by a numeric or date column, and it opens parallel connections, each fetching a slice.

Dataset<Row> src = spark.read()
  .format("jdbc")
  .option("url", sourceUrl)
  .option("dbtable", "orders")
  .option("partitionColumn", "id")
  .option("lowerBound", "1")
  .option("upperBound", "50000000")
  .option("numPartitions", "64")
  .load();

2. Transform and clean

Migrations are rarely a straight copy — schemas differ, data needs cleaning, types need mapping. The DataFrame API makes these transformations explicit and testable: rename and recast columns, deduplicate, fix encodings, and enrich. Keep transformations as pure functions on DataFrames so you can unit-test them on small samples before running the full job.

3. Validate — don't trust, verify

The scariest migration is one that "succeeded" but silently dropped or corrupted rows. Build validation into the pipeline, not as an afterthought:

  • Row counts per partition, source vs target, must reconcile.
  • Checksums or aggregates (sums of key numeric columns) compared end to end.
  • Null and constraint checks on critical fields.
  • Sample-level spot checks on a random subset.

A migration isn't done when it finishes — it's done when validation proves it's correct.

4. Make it idempotent and restartable

At terabyte scale, something will fail mid-run — a network blip, a node dying. If a restart re-inserts already-migrated rows, you've made things worse. Design for restart from the start: write in idempotent batches keyed by a natural ID (upserts, or write-to-staging-then-swap), and track which partitions have completed so a rerun skips them. This is exactly the systematic, custom-conversion approach we use to cut fragile manual migrations into repeatable, observable pipelines.

5. Tune for throughput

  • Right-size partitions — too few underuses the cluster, too many adds overhead.
  • Bulk-write to the target (batched inserts or native bulk loaders), never row-by-row.
  • Watch for data skew — one giant partition stalls the whole job.
  • Cache reused DataFrames; avoid recomputing expensive stages.

Key takeaways

  • Spark parallelises big migrations so throughput scales with the cluster.
  • Partition the source read, express transformations as testable DataFrame functions, and bulk-write the target.
  • Build in validation and idempotent, restartable batches — a migration is done only when it's proven correct.

We engineer large-scale data migrations that don't lose rows. See our case studies or talk to an architect.

Keep reading

Related articles

Need this built, not just blogged?

We engineer Java, Spring Boot and cloud-native systems for a living. Let's talk.

Talk to an architect