Home > Uncategorized > Upgrades

Upgrades

26 June 2013 Comments Off on Upgrades

Today I thought I would start to catch everyone up to whats going on now rather than staying in the past. So I thought I would write a bit about how I am doing the upgrades of hundreds of MySQL 5.0 boxes. Well, let me tell you, it’s not all sunshine and rainbow farts.

Let’s start with what didn’t work….. binary upgrades. Let me start off saying that this may be a feasible option for you if you only have a hand full of servers and databases and are running a stock build of MySQL, but for me this would not work. The 5.0 binary that I was working with was in fact a custom build  with a subset of Percona patches pulled in. Our 5.1 binary is a full blown Percona 5.1 build. With the hundreds of servers that we had to upgrade it was not reasonable to expect an unknown amount of failures that would require manual intervention that deviated from a standard process. In fact, the DBA before me had actually tried this and, on the first production box we attempted this on, we hit a bug with temp tables not getting cleaned up on shutdown and offsetting every innodb tablespace id causing mass corruption. This was a 3 day recovery….

So, armed with this, and coming into this “new roll” I searched and searched for a reliable way to make binary upgrades happen safely. Using intermediate versions…. bringing the new box up in innodb recovery…. and other things that are escaping me at the moment. (Ohh, and did I mention that the existing backups solution was unreliable at best, the slaves have no verification of data drift, and there is currently no mechanism for file system snapshots in place?) But what I learned was that this just is flat out unsafe. In fact the docs completely recommend against it! So, it was decided that the upgrade process would be a dump and import, but how can we do this with hundreds of servers and each server containing hundreds of databases ensuring up-time and consistency?

Fortunately for me, around the same time, we had also decided to convert our slaves to VM’s and consolidate them 3 to a box. Doing this freed up ~1/3 of the hardware. This came in very handy when it was decided that our process would be to stand up a new 5.1 box and dump all of the 5.0 data into the 5.1 and promote it to the new master. I had all the hardware I needed to play with.

Let me give you a bit of an idea of the process that was used:

  • Kick a new HW box with MySQL 5.1 (with lvm)
  • set up the master as a slave to the 5.0 master ignoring all db’s
  • lock a user out of their db on the 5.0 master
  • stop the slave thread on 5.1
  • dump the given db
  • unlock the user
  • import the db
  • remove the ignore db line
  • restart mysql5.1
  • rinse and repeat through all the databases on the server (this was all scripted)

Once all the data was copied, pt-table-checksum was used to verify consistency of the new master. Reconciling any issues was a different story though. This was very open ended and ranged from manual corrections, to re dump/importing the database, or pt-table-sync (which by the way I dislike as this tool, by default, overwrites your *known good* masters data to replicate through to the slave so you potentially destroy your good data, if you use it, use –print and throw it to a file to run directly on the slave). Then we just wait till maintenance time…

Before the maintenance work let me explain one more thing about how we run MySQL. A MySQL pair is set up with three IPs: the master primary, the slave primary, and a floating “virtual” IP (VIP) as a secondary. All of the traffic is routed to the VIP and the primary IPs are used only for replication and management. ONWARD!

  • flip the VIP to the 5.0 slave in read-only
  • verify the slave thread on 5.1 is caught up
  • reset slave on 5.1
  • run mysql_upgrade (needed to fix permissions and UDFs as the mysql DB was imported from the 5.0 schema)
  • reset master
  • shutdown mysql5.1
  • take an lvm snapshot
  • start mysql
  • flip the vip to 5.1

At this point, the impacting portion of the maintenance was over, all that was left was to set up a new 5.1 slave!

  • re-kick the old 5.0 slave as 5.1 (with lvm)
  • rsync the data from the 5.1 snapshot created during the maintenance
  • change master to ….
  • slave start
  • go back and delete the lvm snapshot from the 5.1 master

DONE!

Now this may seem like a fair amount of work, and it is, but its much less work than trying to go through recovery with a bunch of unknowns as well as leaving our old 5.0 master completely untouched if the need for a rollback or any sort of recovery should arise. This also gave us the chance to resolve some technical debt that we once had such as servers that were set up without innodb_file_per_table so the ibdata1 file was unnecessarily taking up large amounts of disk space.

I would also like to note, that with a slight modification to the process, this could be a nearly zero downtime maintenance, loosing only current sessions. The differences being that you would run mysql_upgrade and take your lvm snapshot prior to the maint and not reset slave. Then when MySQL comes back up, you would need to re-import all MEMORY tables and flip the VIP strait to 5.1. Completely reasonable to do a no downtime upgrade, though we decided that the extra effort of finding all of the MEMORY tables and ensuring their consistency last minute wasn’t worth the work, 5 minutes of read only time was acceptable.

So, now that you know the process, and you know the volume of servers to upgrade, how many do you think we did a week? 1? 3? 5? 10? 20?…. All wrong! At the end of our first DC we had done 42 servers in 1 day…. 2 of us. Now, in all fairness, lessons we learned from this is that 42 a day is unsustainable. That amount of irons in the fire is to many to track for 2 people. What is sustainable is closer to 5-10 per person. And that brings us to where we are today. A third of the infrastructure is upgraded with two thirds to go and as soon as the slaves have been consolidated for the remaining boxes, we will be back to 2-3 of us doing 5-10 boxes per person, per day, Monday through Thursday.

So there you have it, how we do upgrades on a large MySQL environment.

Let me know if you have any thoughts, questions, ideas!

Scott

Comments are closed.