Opteron HOWTO

From Hashmysql
Jump to: navigation, search

Contents

Overview

Opterons are currently by far the most popular 64 bit plaforms for MySQL, with a large number of people using them successfully.

Currently some people running MySQL on the Opteron have problems, while others have none. Discovering known issues, selecting the best Operating system, tuning parameters, they all become issues.

This is a central HOWTO for building an optimal MySQL install on Opteron and EM64T platforms.

Currently this HOWTO is designed for Linux but hopefully it can be updated for other platforms (*BSD?) in the future.

If you add experience reports, please uniquely identify them so they don't give the impression that everyone is experiencing the same problems, or even any problems. Reports of systems running without problem, likely the vast majority, are welcome, along with any tips about operating system versions you had problems with. If you have experienced a problem on a system with an Opteron CPU but haven't also tried an effectively identical system with an alternative CPU type, to show that there's at least some chance that it's CPU-related rather than OS or general hardware, please say so, to help people identify proved AMD64 issues rather than general issues which might not be due to CPU type.

Why 64bit?

32 bit processors have a maximum 2G memory limit. While you can physically install more memory into a 32bit box due to PAE extensions only one process can see 2G. This means that even if you have an 8G box that your mysql process can only see 2G.

While there are config options/patches to recent 2.6 kernels to support 3G per process this is really just postponing a painful migration.

Running a 64bit configuration means you're on hardware that allows you to grow by simply adding more physical memory.

EM64T Overview

Opteron Overview

Hardware Vendors

Operation System Choices

Debian

Debian's 64bit support seems to be a moving along nicely. The apt line is outside of the official sources but currently 95% or so of packages build fine.

One downside is that CVSup doesn't appear to work on Opteron just yet so if you're using this for cluster version control you may be out of luck.

Dan Udey, myVirtualCard

Here's our story. After a database crash due to faulty hardware, we invested in a dual-Opteron machine, an HP DL385. The machine has been incredibly stable, and my only complaint is that we have a lot of wasted CPU - the machine is more powerful than we need. Our report generation cronjobs which used to take an hour on a dual-Xeon 3.06 now take less than five minutes (likely due in large part to a faster disk array as well).

The server setup was the same as any other Debian install, with the exception of my favourite repository (the one run by our colocation provider) not having AMD64. Debian-AMD64 currently has glibc-2.3.5, and appears to be working great so far. I would like very much to upgrade us to MySQL 5.0.x from Debian's Testing distribution, but we would like to do more tests before we perform such a major change. With the incredible performance from this system, however, restoring from backups, which takes over an hour on our development machine, would be a trivial task on our new machine.

If you are a Debian user and are thinking about purchasing an Opteron system, don't let the 'development' status of the port hold you back. For MySQL use, Opteron machines are absolutely the way to go, and I wouldn't even consider going back to 32-bit. It just doesn't make sense.

Just don't do like I did and accidentally grab the x86 Netinstall CD from the binder instead of the x86-64 CD. Day of work down the tubes.

Suse

{{{ Dathan Pattishall on the MySQL mailing list wrote:

Suse 8.1 2.4.21-215-smp #1 SMP Tue Apr 27 16:05:19 UTC 2004

Suse 9.x is messed up (paraphrased)

We run RedHat with a Suse Kernel and pure Suse.

}}}

Gentoo

Gentoo's amd64 / x86_64 port has been very solid for me (despite the fact I am a gentoo n00b). Releases 2004.3 and earlier were a bit of a chore, but by the 2005.0 release (available for some months now), Everything Just Works. We've got several dual Opterons, most with 4GB of RAM, 2.6.11.N kernels, glibc 2.3.5, gcc 3.4.3, MySQL 4.0.24.

Fedora Core 1

Greg/meetup

Running latest kernel (2.4.22-1.2199.nptlsmp). This release performed well, but to avoid immediate deadlock, we had to run w/ NPTL disabled (added LD_ASSUME_KERNEL=2.4.19 to init script). We still see large IO operations (deletes in large tables and table alters) running 2 to 3 times slower than our slower/older Xeons.

WARNING!: Under this config, we experienced data loss when running REPAIR table after a large DELETE operation. See [Bug 10437]

  • UPDATE:

[We] are now running under FC1 but using a custom compiled kernel (2.4.30 w/ no patches). We are also using our own compiled version of mysql and our performance is blazing. However, we still see the above [bug ] if we use REPAIR TABLE tablename; from mysql client (myisamchk is safe).

Our compile flags for mysql (4.1.11):

 CFLAGS="-O2" CXX=gcc ./configure --prefix=/usr/local/mysql --with-extra-charsets=complex \
 --enable-thread-safe-client --enable-local-infile --disable-shared --enable-assembler \
 --with-client-ldflags=-all-static --with-mysqld-ldflags=-all-static
 compiler: gcc (GCC) 3.3.2 20031022 (Red Hat Linux 3.3.2-1)
 libs: glibc 2.3.2

Jamesday/Wikimedia

Note that some builds of FC1 experience random disk corruption under high load, while others don't, due to operating system bugs. Unless you like chance, try a later OS version... FC2 and FC3 mostly worked well for us.:)

Fedora Core 2

Greg/meetup

This install did not perform well at all. Replication typically lagged behind our much slower/older Xeon replicants, large IO operations (large deletes and alters) were typically between 3-4 times slower than our much slower/older Xeon replicants (w/ all traffic disabled, so no potential contention), and we experienced frequent index corruptions after alters or large updates (requiring us to shut down and repair w/ myisamchk). In addition, we experienced 100% reproducable index corruption when running myisamchk on tables w/ FullText indexes. See [Bug 7437]

Jamesday/Wikimedia

We operate roughly four dual Opterons on FC2 in a replication setup without major issues. Currently using MySQL 4.0.22. Issues we've seen include:

  • Late FC2 builds have a bug in kswapd which causes it to eat all CPU in system time if the system gets close to swapping when swap is turned off... workaround was to turn on swap and decrease memory use. Will eventually switch to an FC3 build to avoid the issue. Earlier FC2 builds don't have the issue - it spans only late FC2 and early FC3.
  • One system shows corrupton in the relay log due to an apparent operating system bug. Probably present in only some builds and anotherwise identical system with different disk controller doesn't have the problem (with is SATA, without is SCSI). Easy to work around by having the slave get the logs from the maser again and we have a tool which does it automatically. We'll switch the box to FC3 at some point.

Fedora Core 3

Fix for incorrect perl package

Richard Dale / Norgate Investor Services

If you install FC3 x86-64 and install the Perl packages, you need to do the following to counteract a big in the distribution:

 yum remove perl.i386

if you do not, you cannot use up2date/yum to upgrade perl to the latest release (it will show conflicts).

More info on this mistake at: http://fedoranews.org/blog/?p=631

FC3 with MySQL 3.23

FC3 has MySQL 3.23 bundled with it. All interfaces for PHP v4.3 are also provided.


FC3 with MySQL 4.1

If you want to install MySQL 4.1.12, download the 4 RPMs from mysql.com (client, devel, server, shared) then:

 rpm --nodeps -e mysql.i386
 rpm --nodeps -e mysql
 rpm -i MySQL*

After you have done this, go to /usr/share/mysql and copy one of he my-*.cnf files to /etc/my.cnf and edit/tweak according to how you want your server to perform.

If you use PHP v4.3 (as installed with FC3) then it requires the MySQL v3.23 interface libraries (aka libmysqlclient.so.10). By removing the pre-installed MySQL it will remove these libraries. As of 31 May 2005 For some reason these were not compiled into the MySQL 4.1.12 shared library RPMs as downloaded from the MySQL site. However, they are available from any Fedora mirror in the FC4 testing area. You will need to install a package called mysqlclient10-3.23.58-6.x86_64.rpm - I found such a package at my local mirror here: http://mirror.pacific.net.au/linux/redhat/fedora/test/3.92/x86_64/os/Fedora/RPMS/mysqlclient10-3.23.58-6.x86_64.rpm Once you have downloaded it then you need to install it:

 rpm -i mysqlclient10-3.23.58-6.x86_64.rpm

Jamesday/Wikimedia

A box with an early FC3 build experienced the bug in kswapd descriped under FC2. Later builds seem fine. We operate about 3 servers in a replication setup with FC3. Currently using MySQL 4.0.22.

Fedora Core 4

Richard Dale reports he is using FC4 X86-64 successfully, using the Fedora MySQL packages and all works fine. FC4 bundles MySQL v4.1 in there. At 16 Nov 2005, FC4 did not have any MySQL v5 updates in testing.

Redhat Advanced Server

{{{ Jeremiah Gowdy on the MySQL mailing list wrote:

I use Redhat Advanced Server v4 (2.6 kernel) on my four dual opteron systems. I've had no real performance issues with the I/O scheduler, but that's because I run 8GB of ram with a huge key cache.

}}}

{{{ Dathan Pattishall on the MySQL mailing list wrote:

RedHat AS is messed up

}}}

Ubuntu

Joe Kislo has reported some out-of-memory problems with his Ubuntua Hoary on AMD64 using the 64-bit GCC 4.0.24-standard MySQL binary, running 2.6.10-5-amd64-k8-smp. He has run into libc problems (and worked around them) - see pthread_rwlock_wrlock problem, which affected all Debian-based distros for longer than most.

Solaris

One of our (Wikimedia) database servers runs Solaris 10 on a Sun V40z. We've had no Opteron-specific problems with this using 4.1 and 5.0.17 (although a couple of apparently unrelated MySQL bugs in previous version were causing issues). Take note of forcedirectio in the Solaris notes (although I haven't done any testing with/without this option).

Compiler

GCC 3.4 or 3.3?

GCC 3.4 is more experimental but has better support for opteron and 64bit processors.

File system type

reiserfs - any comments?

ext3 - You can't use O_DIRECT on ext3 and a linux kernel 2.4 there is a bug in EXT3 when used under heavy load the volume will lock.


Linux Kernel 2.6 or 2.4

If you're running on Linux the kernel choice is a difficult one. The 2.6 kernel is more modern and supports more features but some people have reported better success with 2.4.

{{{ Dathan Pattishall on the MySQL mailing list wrote:

You can't use O_DIRECT on ext3 and linux kernel 2.4 there is a bug in EXT3 when used under heavy load the volume will lock.

}}}

Glibc Issues

There are some significant GLIBC issues which may be encountered on 64 bit platforms.

pthread_rwlock_wrlock hang with nptl

Some distributions have a bug in pthread_rwlock_wrlock which causes server hangs when nptl threading is used. There are two MySQL bug reports about this, 8555 and 7254. Reported on both Opteron and Xeon EM64T.

In Feb 2004 a patch for pthread_rwlock_wrlock was submitted to Redhat and incorporated almost immediately.

Other distributions, notably Debian, have not yet incorporated this patch. If you know the first version of each distribution / glibc which includes this bug fix please note it here to help people to get only working OS versions.

  • Debian: wasn't included in Sarge, so avoid the Sarge/NPTL/64 bit combination. Bug reported present in at least libc6 2.3.2.ds1-22 with Xeon EM64T w/ 32G ram. Fixed starting from Etch.
  • Redhat: Given the patch was sent in Feb 2004, and libc v2.3.4 was released on Jan 2005, it's safe to assume any update after Jan 2005 would have incorporated this version.
  • Ubuntu Hoary Hedgehog: apparently fix work in progress [1], bug present in glibc2.3.2.ds1-20ubuntu13 , presumably still bugged because it's Debian-based.
  • SuSE: apparently OK
  • Fedora Core 4: As at Sep 2005, FC4 is using glibc v2.3.5 (glibc-2.3.5-10.3) so is OK.


Many people have reported issues with NPTL which might be related in bugs with < 2.3.4 versions of glibc:


{{{

Another point I wanted to note.

What version of glibc were you running. We were running Debian with glibc 2.3.2 (libc6-i686-2.3.2) and were running into deadlocks with another piece of code.

2.3.2 has a number of known issues and we had to migrate to an experimental 2.3.4 build. I've been considering moving our databases to 2.3.4 but they weren't having any problems.

It might be that opteron is raising these issue more than Xeon.

}}}


pthread_rwlock_wrlock debian-fix

Added by Malte Hübner on 2006/01/27:

The description in the next section (setting LD_ASSUME_KERNEL=2.4.19) did not work for us. But I have another solution for this problem. When using sarge stable you should do the following:

Do an

apt-get update

Then edit your /etc/apt/sources.list and change stabe to testing. IE comment our

#deb http://ftp.de.debian.org/debian-amd64/debian/ stable main

and add

deb http://ftp.de.debian.org/debian-amd64/debian/ testing main

Then save your sources.list. Next do an

apt-get update
apt-get install libc6

which will only install glibc2.3.5 (where the mysql-locking bug is fixed) and 2 other components

Then remove "testing" again from your sources.list-file so that your system is based on stable like it was before. Next do an

apt-get update

Now you are finished and the locking-bug should be history! :-)

Threading Model

Linux supports both NPTL and LinuxThreads threading models. Some people have reported problems with NPTL even though MySQL recommends using NPTL:


{{{

We have seen the exact same thing here. We used the deadline scheduler and saw an immediate improvement. However, we still saw much worse performance on our Opteron's (compared to our older Xeon boxes). We ended up rolling back to Fedora Core 1 2.4.22-1.2199.nptlsmp kernel and shut down NPTL and now our Opteron's are much much faster than our Xeons.

The thing I find strange about this is that our experience (@ Meetup) seems to match that of Friendsters (I know of a few other high traffic sites that have mentioned similar issues), in that Mysql on Opteron and Linux 2.6 is not a good solution. Yet, Mysql recommends exactly this config and in fact, does not seem to even support (via support contract) a 2.4 solution for Opteron + Mysql.

}}}

You can disable NTPL by adding LD_ASSUME_KERNEL=2.4.19 to your init scripts.

Disk IO

RAID Config

{{{

>I know of a site that encountered a similar performance issue: >> The OS was reading in a lot more data from the disk than the >> database really needed. >> >> The culprit turned out to be the stripe size on a 4-disk RAID. >> By reducing the stripe size from 768K to 32K, they obtained a >> 200% increase in mysql throughput.


Hi,

This is actually interesting point, as we typically recommend large stripes with MySQL (RAID 10 best)

This may sounds like contradiction but it is not. You need to have large stripe set (256-1024K+) but small RAID controller cache line (16K for Innodb tables)

The thing is by default many RAID controllers would put cacheline size = stripe size, some may not even allow to change it.

If it is the case MySQL will have to read a lot of unnecessary data which will kill performance.

}}}

A somewhat related story from a true database crash scenario on LiveJournal.

Disk cache issues: We have battery-backed RAID cards with write-back caches. That means the RAID card immediately acknowledges writes and tells the OS (and thus DB) that they're done immediately, before they're on disk. This speeds up the DB. But if you lose power, those writes would normally be lost, which is why we have battery-backups on all the cards, and we even monitor the battery health w/ our automated checks. But unknown to us, the raid cards didn't disable the write caching on the drives themselves.... which is frickin' useless! If the controller is already lying to the OS (but doing it safely!) why should the disks behind the controller also lie, but unsafely, for minimal benefit? Our bad there. We should've had that right. So a couple machines were just gibberish afterwards and had to be restored from backup and had their binlogs replayed to catch the backups up to present.

So, disable disk caching behind all our RAIDs. (bleh... wanna kick ourselves for this, but also the raid vendors for even defaulting and/or allowing it in a BBU write-back setup) but also testing all existing and new hardware to make sure data makes it to disk and pulling power in the middle of write-heavy operations, then verifying the resulting disk image later with the expected result.

IO Scheduler

The Anticipatory IO scheduler is what Linux uses by default the the Deadline IO scheduler seems to have better overall performance on MySQL (about 20% or so).

You can change your scheduler by adding:

elevator=deadline

To your Linux kernel init commandline (via grub or lilo) and then rebooting.

{{{ Dathan Pattishall on the MySQL mailing list wrote:

2.6 the IO sceduler is still messed up.

}}}

However, this opinion differs with Jeremiah Gowdy:

{{{ Jeremiah Gowdy on the MySQL mailing list wrote:

I use Redhat Advanced Server v4 (2.6 kernel) on my four dual opteron systems. I've had no real performance issues with the I/O scheduler, but that's because I run 8GB of ram with a huge key cache.

}}}

Random Bugs

Random bugs to watch out for... (which hopefully will be fixed)

  • Under 2.6, if we took the db offline and ran myisamchk on a table w/ fulltext indexes, and then started back up again, the table would nearly instantly crash (upon first writes to it). Running repair table would seg fault. Shutting down to run myisamchk would only cause the table to crash again upon 1st write. Only solution ... alter table tablename engine=myisam; Then the table would run fine. We have since dropped all fulltext indexes and moved to Apache Lucene (much more flexible and waaaaay faster anyhow).
  • Under 2.4 (just happened to me tonight and this is a scary one), we routinely archive and cleanup large tables w/ seldom used old data. After doing a "DELETE FROM table WHERE ctime < '2005-05-01'", we would see a select count(*) show around 160k rows remaining (from 1st of the month). I would call repair table on the table, and the remaining rows would be deleted. Repair would make mention of dropping row count from 165k to 0. Yikes! This happened on both Opterons and did not happen on the Xeons (thank god ... was able to save the data).

Significant Opteron implementations

The following sites have successfully implemented Opteron systems with MySQL:

Norgate Investor Services has 3 backend database servers running MySQL with dual opteron 250s, 4GB RAM, 8 x 15K RPM SCSI disks in RAID10, Intel SRCU42X PCI-X RAID controller (this is equivalent to the LSI Megaraid 320-2X controller), battery-backed SCSI controller cache, Fedora Core 4 X86-64 with kernel 2.6.14. This is used to service their financial database requirements (approximately 90GB of stock, futures and forex market prices). Happy to answer any questions about it via email - richard - premiumdata.net or on the MySQL mailing lists.

Friendster Inc. has around 200 servers, at least 4GB of memory, running in 64 bit mode. Around 30 databases running, dual opterons with 8GB of RAM, connected to Hitatchi 9980 SAN through McData Switch. Around 70K qps at peak, around 1 billion queries per day on 30 servers. Dathan Pattishall is the Senior Database Engineer there.

Wikimedia (they run Wikipedia) has 7 database servers, all of them run dual Opterons with 4-16GB RAM and with either SCSI RAID (10 and 0) or SATA RAID (10 and 0). 4 of them run Fedora Core 2 (64 bit) and 3 run Fedora Core 3 (64 bit) - current OS versions. More info at: Wikipedia server info and an overview at Wikimedia Hardware info. Additional info on their LSI MegaRaid SCSI RAID controller configuration is at Wikimedia Ariel Database server configuration. We see occasional crashes and the known issues described under some OS builds with known bugs but are happy and still buying Opterons. Simply, they get the job done without us seeing major issues related to the CPU - the beta OS issues have been more painful. Peak loads in June were around 11,000qps for a few minutes, more typically around 5,000. Daily average for the master server is around 2500qps.

Benchmarks

Benchmarks of running Opteron systems (hopefully vs other platforms)

Personal tools
Development Guides
Documentation
Installing MySQL
Other Information