Large-Scale Data Migration with Apache Spark

Sooner or later every data platform needs a big move — a legacy database to the cloud, one warehouse to another, a schema overhaul. Done with hand-written scripts, these migrations are slow, fragile, and terrifying to restart after a failure halfway through. Apache Spark turns the job into a distributed, partitioned pipeline that processes massive datasets in parallel — and, done right, is idempotent and restartable.
Why Spark for migration
Spark's core advantage is parallelism. It splits the source into partitions and processes them across a cluster, so throughput scales with hardware rather than being bottlenecked on a single thread. It reads and writes almost anything — JDBC databases, Parquet, CSV, cloud object storage — and its DataFrame API expresses complex transformations clearly. For a multi-terabyte move, that's the difference between hours and days.
1. Read the source in partitions
The first rule of a big JDBC read: never pull a giant table down a single connection. Tell Spark how to partition the read by a numeric or date column, and it opens parallel connections, each fetching a slice.
Dataset<Row> src = spark.read()
.format("jdbc")
.option("url", sourceUrl)
.option("dbtable", "orders")
.option("partitionColumn", "id")
.option("lowerBound", "1")
.option("upperBound", "50000000")
.option("numPartitions", "64")
.load();
2. Transform and clean
Migrations are rarely a straight copy — schemas differ, data needs cleaning, types need mapping. The DataFrame API makes these transformations explicit and testable: rename and recast columns, deduplicate, fix encodings, and enrich. Keep transformations as pure functions on DataFrames so you can unit-test them on small samples before running the full job.
3. Validate — don't trust, verify
The scariest migration is one that "succeeded" but silently dropped or corrupted rows. Build validation into the pipeline, not as an afterthought:
- Row counts per partition, source vs target, must reconcile.
- Checksums or aggregates (sums of key numeric columns) compared end to end.
- Null and constraint checks on critical fields.
- Sample-level spot checks on a random subset.
A migration isn't done when it finishes — it's done when validation proves it's correct.
4. Make it idempotent and restartable
At terabyte scale, something will fail mid-run — a network blip, a node dying. If a restart re-inserts already-migrated rows, you've made things worse. Design for restart from the start: write in idempotent batches keyed by a natural ID (upserts, or write-to-staging-then-swap), and track which partitions have completed so a rerun skips them. This is exactly the systematic, custom-conversion approach we use to cut fragile manual migrations into repeatable, observable pipelines.
5. Tune for throughput
- Right-size partitions — too few underuses the cluster, too many adds overhead.
- Bulk-write to the target (batched inserts or native bulk loaders), never row-by-row.
- Watch for data skew — one giant partition stalls the whole job.
- Cache reused DataFrames; avoid recomputing expensive stages.
Key takeaways
- Spark parallelises big migrations so throughput scales with the cluster.
- Partition the source read, express transformations as testable DataFrame functions, and bulk-write the target.
- Build in validation and idempotent, restartable batches — a migration is done only when it's proven correct.
We engineer large-scale data migrations that don't lose rows. See our case studies or talk to an architect.
Related articles

Building Intelligent Java Apps with Spring AI and OpenAI
Spring AI brings LLMs into the Spring ecosystem with the same abstractions Java teams already know. Here's how to wire up chat, prompts, and retrieval-augmented generation in a real Spring Boot service.

Event-Driven Architecture with Apache Kafka and Spring Boot
Event-driven architecture decouples services and lets them scale independently — but only if you get topics, idempotency, and ordering right. A practical field guide with Spring Boot.