Beyond Pandas: How to stream TBs of Data between MySQL and SQL Server
The most common mistake I see in Python data engineering is the addiction to fetchall().
It works fine for 10,000 rows. But when you need to migrate 50 million records from MySQL to SQL Server, loading everything into a list—or even a Pandas DataFrame—is a guaranteed way to crash your server with an Out of Memory (OOM) error.
The solution isn't a bigger server. It's Python Generators.
The "RAM Explosion" Anti-Pattern
Here is the code that kills servers:
# DON'T DO THIS
cursor.execute("SELECT * FROM massive_table")
data = cursor.fetchall() # <--- CRASH!
# Python tries to create a list containing 50M tuples in RAM.
The Solution: The `yield` Keyword
Instead of holding the data, we create a function that streams it. The `yield` keyword turns a standard function into a generator, which pauses execution and "yields" a chunk of data only when the consumer asks for it.
The "Server-Side Cursor" Pattern
To make this work, you must tell the database not to send everything at once. In pymysql or psycopg2, this is often done with server-side cursors (SSCursor).
import pymysql
def stream_from_mysql(query, batch_size=1000):
conn = pymysql.connect(...)
# SSCursor keeps the data on the server, not in Python's RAM
with conn.cursor(pymysql.cursors.SSCursor) as cursor:
cursor.execute(query)
while True:
# Fetch only 1,000 rows at a time
chunk = cursor.fetchmany(batch_size)
if not chunk:
break
# Hand this chunk to the caller, then PAUSE here
yield chunk
conn.close()
Connecting the Pipes
Now we can loop over this generator. The beauty is that whether the table has 1,000 rows or 10 billion rows, your Python script will never use more than a few MBs of RAM (the size of one batch).
# The Generator Pipeline
data_stream = stream_from_mysql("SELECT * FROM orders", batch_size=5000)
for batch in data_stream:
# Process 5,000 rows
transformed_batch = transform_currency(batch)
# Write to SQL Server
sql_server_cursor.executemany(insert_query, transformed_batch)
sql_server_conn.commit()
print(f"Migrated batch...")
# The previous batch is now garbage collected from RAM automatically
Why This Matters
This pattern decouples Data Volume from System Resources.
- Cost: You can migrate Terabytes of data using a tiny $5/month VPS.
- Stability: No random crashes when data grows unexpectedly.
- Speed: Network IO overlaps with processing (especially if you use
threadingfor the writer).