Converting a MySQL dump to CSV files

I needed to migrate a set of tables from MySQL/MariaDB to a different SQL database recently. My source was not an actual live connection to the source database, but a SQL dump created by the mysqldump backup utility.

Converting a MySQL dump to DDL and CSV files

The trouble with such SQL dumps is that they are solely intended to be consumed by MySQL during a restore operation. As a consequence these files contain a lot of MySQL-specific syntax to make a restore operation as efficient as possible. Compatibility with other SQL systems is not a priority. There was no way to feed that dump file into my target database directly.

The file was 30GB gzipped, which translated to roughly 300GB of uncompressed text.

In this situation I had two choices:

  1. set up a local MySQL instance, restore from dump, then migrate tables over using a script
  2. convert the SQL text into a commonly importable text format like CSV

I was working on a remote linux server. My account had no root/sudo privileges. Setting up a local MySQL instance would entail jumping through some hoops. I decided to try option 2 first.

A MySQL dump file contains a header with preliminaries like a CREATE DATABASE statement, followed by all tables of the database. Each table is represented by the following sequence of commands:

  • One DROP TABLE IF EXISTS statement
  • One CREATE TABLE statement
  • Zero ore more extended INSERT statements

Some additional non-data statements like table locking are bracketed around the data inserts.

Have a look at the mysql-dump of the mondial database to get the idea. It looks something like this:

-- MySQL dump 10.13  Distrib 5.5.25a, for Win64 (x86)
--
-- Host: localhost Database: mondial
-- ------------------------------------------------------
-- Server version 5.5.25a

/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
... more hints in comments here ...

drop database if exists mondial;
create database mondial;
use mondial;

--
-- Table structure for table `borders`
--
DROP TABLE IF EXISTS `borders`;
/*!40101 SET @saved_cs_client = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `borders` (
`Country1` varchar(4) NOT NULL DEFAULT '',
`Country2` varchar(4) NOT NULL DEFAULT '',
`Length` int(11) DEFAULT NULL,
... more field properties here ...
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
/*!40101 SET character_set_client = @saved_cs_client */;

--
-- Dumping data for table `borders`
--

LOCK TABLES `borders` WRITE;
/*!40000 ALTER TABLE `borders` DISABLE KEYS */;
INSERT INTO `borders` VALUES ('A','D',784),...,('ZW','Z',797);
/*!40000 ALTER TABLE `borders` ENABLE KEYS */;
UNLOCK TABLES;
... more tables follow ...

I decided to try and slice this file into individual pieces. One global header file, and a set of two files for each table: a SQL file containing the table definition statements, and a CSV file containing the data.

You can count on awk being a universally available text processing tool on any linux environment, so I decided to give an awk-based approach a shot. I was hoping a few regular expressions might do the trick.

The awk file contains three major sections. The BEGIN section is executed first, then for each line of the input file, the body section is executed. Awk makes the current line available as $0. Finally, the END section is executed.

The strategy of the script is to save the first couple of lines into header.sql then look for DROP TABLE IF EXISTS statements to pick up the name of the upcoming table, and save the DDL into table_name.sql Finally, split each extended INSERT statement that follows into individual CSV lines and place them in table_name.csv.

If you’d like to run the splitter on a sample file yourself:

  • get the mondial dump
  • make sure you have gawk installed (brew install gawk if you’re on macOS, darwin awk is not compatible)
  • install the dos2unix utility: it converts windows style \r\n line endings in the mondial dump to unix style \n line endings

Assuming then you have split.awk and mysql-mondial.dmp in your current directory:

# create a directory for the result files and change into it
$ mkdir mondial
$ cd mondial
# pipe the dump through dos2unix and into gawk
$ cat ../mysql-mondial.dmp | dos2unix | gawk -f ../split.awk
# inspect the output files
$ ls
borders.csv is_member.csv
borders.sql is_member.sql
city.csv island.csv
city.sql island.sql
continent.csv lake.csv
continent.sql lake.sql
country.csv language.csv
country.sql language.sql
desert.csv located.csv
desert.sql located.sql
economy.csv merges_with.csv
economy.sql merges_with.sql
encompasses.csv mountain.csv
encompasses.sql mountain.sql
ethnic_group.csv organization.csv
ethnic_group.sql organization.sql
geo_desert.csv politics.csv
geo_desert.sql politics.sql
geo_island.csv population.csv
geo_island.sql population.sql
geo_lake.csv province.csv
geo_lake.sql province.sql
geo_mountain.csv religion.csv
geo_mountain.sql religion.sql
geo_river.csv river.csv
geo_river.sql river.sql
geo_sea.csv sea.csv
geo_sea.sql sea.sql
header.sql

Recklessly splitting SQL files like this is not without caveats. If table data contains text fields that happen to contain the sequence of characters ),( awk will split that field across multiple record lines. You may be tempted to try and come up with a better regular expression for the split pattern. One that takes into account the possibility that ),( might appear inside a quoted string. But then you’d end up facing a whole new set of issues. You’d have to deal with the way quotes are escaped within a string, to reliably tell whether you’re within string boundaries. Had I encountered that issue, I’d probably have tried a different approach. Luckily my dataset did not contain such strings, and I could get away with it.

I converted the DDL to be compatible with the target database and took some final steps to automate the whole process, so that I could migrate the SQL dump with one command. It ended up being worth it, because it turned out that multiple versions of MySQL dumps had to be migrated during that project.

If you’re reading this, you’re probably trying something similar, and I hope the awk script above is of some help.

If you’re looking for professional help on a data-migration project, get in touch!
One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.