Sitemaps are a must-have tool to get your sites properly indexed on search engines and having a better positioning. In this post we cover how to create sitemaps for your Rails apps and host them on S3 if needed.
What is a sitemap?
A sitemap is an XML file that lists all the URLs to pages on your site considered relevant to be indexed. It also includes information like the last time a given URL was modified or the frequency of updating. Normally it would look like this:
<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:image="http://www.google.com/schemas/sitemap-image/1.1"
...
xmlns:xhtml="http://www.w3.org/1999/xhtml"
>
<url>
<loc>http://www.cookieshq.co.uk</loc>
<lastmod>2015-06-12T12:08:23+02:00</lastmod>
<changefreq>always</changefreq>
<priority>1.0</priority>
</url>
<!--- More URL defintions -->
</urlset>
On the top side we have several sitemap schema definitions (shortened here), and after that, we get all the URLs to be mapped and indexed.
Obviously generating this by hand would be a bit cumbersome, and even more, setting the last modification dates or modifying tiny bits each time something is added to the site would be unsustainable. We need to automate this process.
Enter Karl Varga’s Sitemap Generator gem.
Using the gem
The gem documentation is pretty straightforward about features, installation and configuration. Once you have the gem installed, run rake sitemap:install
to have a default config/sitemap.rb
file you can edit. Below is a file we have already done some work upon:
# Set the host name for URL creation
SitemapGenerator::Sitemap.default_host = "http://www.cookieshq.co.uk"
# pick a safe place safe to write the files
SitemapGenerator::Sitemap.public_path = 'tmp/sitemaps/'
SitemapGenerator::Sitemap.create do
add clients_path
add team_path
add about_path
add testimonials_path
add contact_path
add posts_path, changefreq: 'weekly'
add login_path, priority: 0.0
Post.find_each do |post|
add post_path(post.slug), lastmod: post.updated_at
end
CaseStudy.find_each do |case_study|
add case_study_path(case_study.slug), lastmod: case_study.updated_at, changefreq: 'never'
end
end
The default host
As you can see, the first thing we do is set the host URL for our site, which will be the root of all the URLs contained on the resulting XML file.
The path where the sitemap is stored
After that, we set the path where the compressed XML file will be generated. By default it will be the public
folder, but we can set it to be any other folder on the project, as long as we have written permission on it (more on this later). If you use the public
folder, remember to add the name of the generated file to your .gitignore
.
Adding links (finally!)
Then there’s a series of static pages we want indexed, the FAQ, the login page, the Terms & Conditions, etc. These could have been added using add /faqs
. Note that you don’t need to add the root_path
, as the gem does it automatically for you.
The post index path has the changefreq
set to weekly
, as we want to indicate the site crawlers and indexers information about how often that index is likely to change. If we were to publish a new post every day, we could set it to daily.
On our login_path
we’ve used the priority
parameter and set it to zero as we want it to be indexed, but still, we want it to be considered as the least important page for indexers and crawlers, since we want other more important information to appear first on search results. In case we didn’t want it to be indexed, we’d just remove it from the generation code.
The last two additions are more interesting, as they relate to indexing dynamic content. On both our Post
and CaseStudy
models we have set up a string field named slug
that is used on the URL, so instead of having http://www.cookieshq.co.uk/posts/1234
we have http://www.cookieshq.co.uk/posts/sitemap-generation-hosting
. To get the posts and case studies indexed the correct way, we need to add the URL for each post searching by slug. Also, we add the lastmod
parameter, so we can indicate indexers to omit this URL if it has been indexed before and not changed ever since.
Additionally, we’ve set the changefreq
to never
on Case Studies, as once a case study is published, it’s unlikely to be changed.
Those would generate XML information like this:
<!--- ... -->
<url>
<loc>http://www.cookieshq.co.uk/case_studies/gap-medics</loc>
<lastmod>2015-06-01T15:59:52+00:00</lastmod>
<changefreq>never</changefreq>
<priority>0.5</priority>
</url>
<!--- ... -->
Generating the sitemap
The gem offers a series of tasks to create your sitemap:
rake sitemap:create
andrake sitemap:refresh:no_ping
do the same: run thesitemap.rb
and generate the compressed XML file under the folder specified in thepublic_path
attribute.rake sitemap:refresh
: does the same as the previous ones, but it also will ping Google and Bing search engines so they know to fetch your newly created sitemap and update their indexed information about the site. You can ping other search engines as well, as stated in the docs.
Finally, you should set a cron job on your server to call rake sitemap:refresh
as often as needed.
Serving the sitemap
Normally, using the default configurations and working on a VPS should not add difficulties to search engines to fetch your sitemap from your public
folder, as the file would be reachable from, following with our example: http://www.cookieshq.co.uk/sitemap.xml.gz
.
However, in the case our application is hosted on Heroku, we face two problems, due to its ephemeral filesystem:
- We can’t write on the
public
folder. That’s why we use thetmp
folder on our previous sitemap configuration file. - We can’t guarantee for how long will be on the
tmp
folder what we save there.
To get around this, what we need is to host our generated sitemap somewhere else, and then allow the search engines to access it. The Sitemap Generator gem offers ways to save the generated file on S3 using fog or carrierwave, so if you already use either of those on your application, you can have a look at this wiki page. However, installing Fog or Carrierwave just for this can be a bit overkill, so here’s a way to do that depending only on the aws-sdk
gem.
Once we have the aws-sdk
gem installed, we will also need to have an Amazon S3 bucket and the proper credentials set on the corresponding Heroku configuration panel, and/or your local environment, for tests:
- An S3 Access Key Id:
ENV['S3_ACCESS_KEY_ID']
- An S3 Secret Access Key:
ENV['S3_SECRET_ACCESS_KEY']
- The name of the bucket to use:
ENV['S3_BUCKET']
Once this is set, we will need a rake task like the following:
# sitemap.rake
require 'aws'
namespace :sitemap do
desc 'Upload the sitemap files to S3'
task upload_to_s3: :environment do
puts "Starting sitemap upload to S3..."
s3 = AWS::S3.new(access_key_id: ENV['S3_ACCESS_KEY_ID'],
secret_access_key: ENV['S3_SECRET_ACCESS_KEY'])
bucket = s3.buckets[ENV['S3_BUCKET']]
Dir.entries(File.join(Rails.root, "tmp", "sitemaps")).each do |file_name|
next if ['.', '..', '.DS_Store'].include? file_name
path = "sitemaps/#{file_name}"
file = File.join(Rails.root, "tmp", "sitemaps", file_name)
begin
object = bucket.objects[path]
object.write(file: file)
rescue Exception => e
raise e
end
puts "Saved #{file_name} to S3"
end
end
end
First we setup our AWS client with the credentials, and after that we iterate the files present on the public_path
we configured for the Sitemap Generator, in this case, tmp/sitemaps
. We have to ignore the folder itself ad its parent (.
and ..
), and if you are doing tests on OS X, the habitual ‘.DS_Store’ files.
Afterwards, we’ll write the file to our remote bucket, under a sitemap
folder, which should be configured as writable on your AWS panel.
Finally, we will need a rake task that we can program on our cron that takes care of everything: create the sitemap, upload it to S3 and ping the search engines:
# sitemap.rake
namespace :sitemap do
# ...
desc 'Create the sitemap, then upload it to S3 and ping the search engines'
task create_upload_and_ping: :environment do
Rake::Task["sitemap:create"].invoke
Rake::Task["sitemap:upload_to_s3"].invoke
SitemapGenerator::Sitemap.ping_search_engines('http://www.cookieshq.co.uk/sitemap.xml.gz')
end
end
Note that on the last invocation, we’re sending the search engines the URL where they can find our sitemap. But the file is not on our server, so we need to do a small amend on our routes.rb
:
# routes.rb file
get '/sitemap.xml.gz', to: redirect("https://#{ENV['S3_BUCKET']}.s3.amazonaws.com/sitemaps/sitemap.xml.gz"), as: :sitemap
Going a bit further: testing your sitemap generation script
Recently I found this post by Mike Coutermash, in which he devises a simple test that checks if your sitemap.rb
will run properly, using RSpec. Note that it works out-of-the-box only with the latest (5.1.0) version of SitemapGenerator. Anyway, as Coutermash states, it allows you to “check if it runs and you don’t forget to update your sitemap.rb
if you change your routes”. Here’s an example based on his base specs:
# spec/lib/sitemap_generator/interpreter_spec.rb
require 'spec_helper'
describe SitemapGenerator::Interpreter do
describe '.run' do
it 'does not raise an error' do
allow(SitemapGenerator::Sitemap).to receive(:ping_search_engines).and_return true
allow(SitemapGenerator::Sitemap).to receive(:create).and_yield
FactoryGirl.create_list(:faq, 5)
FactoryGirl.create_list(:team_member, 3)
FactoryGirl.create(:post, slug: 'test-slug')
FactoryGirl.create(:case_study, slug: 'successful-app-story')
expect { described_class.run }.not_to raise_error
end
end
end
Conclusion
I hope this post is helpful to you. On a final note, I’d like to mention as source this post from status203.me by w1zeman1p which I found via the gem’s wiki and was really helpful.
Picture by kaveman743 adapted from the original on flickr, used under CC BY-NC 2.0 license.