Getting the Most out of your Database with Ruby on Rails

alt

Ruby on Rails makes it easy to get your app up and running. However, there are several ways to make it it even better. This article outlines tips, tricks, and previously hidden gems that help you create a secure, future-proof, and high performance Rails-based application.

Configuring your database connection

Managing your database connection goes beyond credentials and a hostname. We can use additional configuration settings to tune how our application integrates with our database:

Default encoding : Most application that I've built need to use UTF-8 to ensure we can handle any character set used during data entry. A simple configuration setting will ensure data stored and retrieved is properly encoded.

Connection pool size : The pool size defines the maximum number of database connections managed by the Active Record database connection pool. If all connections are used, the incoming request must wait until a connection is placed back into the pool before proceeding. Therefore, it's important to tune this value to the maximum number of threads you'll use.

I use Puma to improve the performance of my Rails and Rack-based apps. Puma uses a default of 16 threads, so I use this as my connection pool size so every thread will be able to process an incoming request, without being forced to wait on an available connection.

Secure connections : This setting ensures our client connects using a secure connection, if available on the server. Use sslmode: prefer to first try SSL and then fallback to plain text; use sslmode: require to fail if SSL isn't supported by the server.

Example PostgreSQL database.yml:

development:
  adapter: postgresql
  encoding: utf8
  database: sample_postgres_development
  pool: 16
  sslmode: prefer
  username: sample_postgres
  password: mypassword

Example MySQL database.yml:

development:
  adapter: mysql2
  encoding: utf8
  database: sample_postgres_development
  pool: 16
  sslmode: prefer
  username: sample_mysql
  password: mypassword

The Configuring Rails Guide provides insights for all available options. More details for configuring a secure connection can be found for both PostgreSQL and MySQL.

Configuring Active Record

Once we have our connection configured, we then want to control Active Record's settings. I tend to focus on the following settings:

Default Timezone : Unless there's a specific reason to change this, I prefer to default to UTC and then convert to the local timezone of the user or the system.

Concurrency Locking : Database clients must select a strategy on how they want to retrieve and lock data. The default for Rails is optimistic locking, meaning database connections are allowed to read a model at the same time, but only one connection is allowed to write a change using the lock! method. Pessimistic locking restricts other connections from being able to read the model after locking it, requiring them to retry the read until the lock has been released and reading is allowed. I tend to default to optimistic locking, however, knowing that this option is available to provide flexibility in how you build your Rails app. This guide from Chien Tran provides more insight into when to use optimistic vs. pessimistic locking.

Here's an example snippet from my config/application.rb, which stores all times in UTC, defaults to the Eastern timezone when displaying the values, and applies optimistic locking:

config.active_record.default_timezone = :utc
config.time_zone = 'Eastern Time (US & Canada)'
config.active_record.lock_optimistically = true

Additional settings are detailed in the Configuring Rails Guide.

Improve data quality using database migrations

The defaults provided by Rails generators don't have any context in how they'll be used. Instead, they provide a simple skeleton for how your table may look.

Below is an example of a migration that creates a new table called 'contacts':

create_table "contacts" do |t|
  t.integer "user_id"
  t.string "name"
  t.string "phone"
  t.string "email"
end

The model may look something like the following, including some validators to enforce our intended logic:

class Contact < ApplicationRecord
  validates :user, presence: true
  validates :name, presence: true
end

This model requires an associated user and a name to be provided when creating a new contact. However, this can be avoided by simply calling save with validation set to false:

contact.save(:validate => false)

While a handy way to fix certain situations, it has a downside. Our database could have data stored that's considered invalid by the application, but is valid according to the database schema. Instead, we should apply the proper options to our migrations to enforce our data requirements on the database.

Let's revisit our previous migration, adding explicit options to enforce our desired schema:

create_table "contacts" do |t|
  t.integer "user_id", :null=>false
  t.string "name", :null=>false
  t.string "phone", :null=>true
  t.string "email", :null=>true
end

If we use the same call to save the model while skipping validations, our database will enforce our data requirements. All database clients will be forced to abide by these rules, ensuring proper data integrity no matter what new applications talk to our database in the future.

Refer to the Rails Active Record Migrations Guide for more details on how to use Rails migrations.

Collapse database migrations for faster environment setup

From the time a new Rails app is started, database migrations become important. Migrations help developers setup the database tables needed for their application. Once an application goes live into production however, migrations are often used to make smaller changes to tables. Migrations are less about installing a new database as patching an existing database with new columns, tables, indexes, and other alterations. Over time, your migrations may grow from a small number to tens or hundreds of migrations. Not only does it take time to apply these migrations onto a fresh development or test database, it can add management overhead for developers.

Rails provides an easy way to apply the latest schema all at once using rake db:schema:load. The schema is generated after you run your migrations, or anytime you run rake db:schema:dump. The schema dump is stored in db/schema.rb and provides a complete representation of your database schema based on the latest migrations that have been applied. Using rake db:schema:load, you can restore any database to this latest schema. Running this command is much faster than applying differential migrations.

Note : Running this rake command will DROP AND RE-CREATE ALL TABLES. DO NOT RUN THIS IN PRODUCTION ONCE YOU HAVE LIVE DATA!

Also, be aware that any logic applied during your migrations, such as loading seed data for lookup tables, or performing data fixes/cleanup will not be applied. This leads us to the next tip...

Separate seed data and data fixes from migrations

Confession time: I've used migrations for other tasks, such as loading seed data alongside creating a lookup table and for "fixing-up" data that wasn't right due to a bug or wrong assumption. However, Rails migrations offer a DSL for creating, dropping, and modifying your database tables. That's really the intended scope. We see this as a result of the schema dump rake target that stores our database schema into db/schema.rb. Rails migrations are not for installing seed data or fixing data.

I'll repeat this again: Rails migrations are not for installing seed data or fixing data.

So, how do we install seed data? Use db/seed.rb, which offers a single way to install any kind of data you may require. Any code inside the file has complete access to your models, database connections, and other code. To call it, use rake db:seed.

Tip: I like to make my seed.rb script idempotent, meaning I can run it multiple times but still have the same result in the end. To do this, simply wrap your data loading logic with an if block that checks if the data already exists before installing it. This will prevent installing the same data multiple times.

Here's an example:

# db/seed.rb

admin_email = 'admin@mycompany.com'
admin = User.where(email: admin_email).first
if admin.nil?
  User.new(email: admin_email)
  admin.password = '12345' # that's the password for my luggage!
  admin.save!
end

For running one-off scripts that fix-up data, I use Rake itself. Developers can run the task as necessary to test out the logic, then run it once in staging and/or production to apply the change. Below is an example of a rake task:

# lib/tasks/db_fixes.rake

namespace :db do
  namespace :fix do
    desc "Fixes user accounts after xyz change"
    task :user_accounts_after_xyz_change => :environment do
	  # SQL-based and/or Active Record-based logic goes here
    end
  end
end

You can then run this rake task using rake db:fix:user_accounts_after_xyz_change.

Optimizing model-based queries with Active Record

Three of the most common performance problems with Rails applications are:

Lack of proper indexing : Any column used by queries that are executed often needs to be indexed by the database to provide the most performance. By default, primary keys are indexed. However, foreign keys and other columns are not. Be sure to index columns used in the WHERE clause of queries that are executed often or query tables with a large number of rows. Refer to the Rails Active Record Migrations Guide for more details on defining column indexes using migrations.

N+1 query problems : When rendering views of model associations, the default approach is to load the parent first, then each child individually. This is known as the N+1 query problem: for a parent with 200 children, you'll make 201 queries to the database. While a single query doesn't take long, the time to execute and process multiple queries adds up quickly. Eager loading, sometimes called "preloading", ensures that only 1 query is executed to retrieve parents and children. I strive to be thoughtful about the kinds of calls I make to my Active Record models, including the anticipated row count for each table and the number of children involved. Where appropriate, I use eager loading to optimize my queries.

Failure to batch load large result sets : If you're working with queries that could return hundreds, thousands, or even more results, it's important to remember that Rails has to allocate memory for a model to match each returned row. This could result in hundreds or thousands of objects created and data mapped for each row, causing your query to take seconds or minutes to process. Instead, consider using batch loading that comes with Active Record.

Use SQL with Active Record to tune hotspots

There may be times when you need to retrieve only a limited amount of data without the need for a model, creating custom reports using complex JOINs, or fine-tuning queries to improve performance. Active Record supports executing arbitrary SQL to meet your needs. Some of the methods I use often include:

#find_by_sql to execute a custom SQL statement and return instantiated models
#select_all to execute a custom SQL statement and obtain an array of hashes of results, rather than instantiating a new model instance for each row for faster performance or custom report generation
#pluck to fetch only specific fields from a database table, without the need to craft custom SQL

Security Note : When crafting custom SQL, you may open up opportunities for SQL injection attacks. To prevent this, use the #sanitize_sql method when generating your SQL statements that combine your SQL with request parameters or values stored in your database that may contain SQL injection attacks.

Refer to the Finding By SQL section of the Active Record Querying guide for more details.

Optimize bulk data loading processes

Inserting a few dozen rows into a database table doesn't take long. But what happens if you have hundreds, thousands, or more rows to load at once? Taking the standard Active Record approach of creating a model with your data and saving it seems simple. However, bulk loading data using this approach is far from optimal, as each save call is wrapped in its own transaction. Over time, the transactional overhead adds up, causing data to take minutes or even hours to finish loading.

To optimize bulk loading of data, consider the following approaches:

Right-sized transactional boundaries : The smaller the transaction (e.g. 1 row per transaction or 10s of rows per transaction), the more overhead we generate on the database server. The larger the transactional boundary (e.g. 1000s of rows per transaction), the larger our transaction log gets and the longer it takes the database to cleanup afterward. We need to find the right size for our transactional boundaries when we bulk load data. I've found that somewhere between 500 and 2000 rows per transaction is a healthy number, particularly for cloud environments where filesystem and block-based storage I/O is unpredictable.

Below is an example from Chris Heald on wrapping Active Record calls with transactions:

ActiveRecord::Base.transaction do
  1000.times { Model.create(options) }
end

Bulk INSERTs : The more often we have to go back-and-forth between our Rails code and our database, the longer the bulk import process will take. Using bulk insert support from within SQL can help reduce the number of round trips we take, speeding up our import process.

Vendor and third-party utilities : If you're reading this, you probably enjoy Ruby. However, for everything that I love about it, I know that it isn't the fastest execution speed. Coming from a C and Java background, sometimes we find out that the thing holding us back the most is the speed of our programming language or framework. Rather than taking a Ruby-centric view of bulk loading, consider looking for vendor-provided bulk loading tools to accelerate one-time or batch-based bulk imports. They often use native libraries that are highly optimized and take advantage of vendor-specific optimizations not immediately available from our database libraries and frameworks.

Putting it all together

While Rails offers us many conveniences and default settings, we need to be aware of areas that we should consider to fix our specific needs:

Ensure that you have properly configured your database connection to use the proper encoding, secure connections, and connection pool size
Store all timestamps in UTC unless there is a specific reason not to do so - Active Record will handle time zone conversions for you automatically
Use Rails migrations to enforce data quality inside and outside of your codebase
Optimize developer workflow by using schema loading rather than running 10s to 100s of migrations individually
Use seed data to install lookup table data and rake tasks to perform one-time data fixes
Use SQL to optimize important areas of your application, sanitizing your SQL statements to avoid SQL injection attacks
Use bulk INSERTs and vendor-provided CLI tools to accelerate bulk imports