We recently covered finding journeys within a shape using PostGIS, and this time I’m going to talk about how we generated data to test and fine-tune our matching algorithm.
Although the SQL query we covered last time is fairly straightforward and easy to test in a unit test, we wanted some way to play with accurate data – adjusting various parameters to get good results – which we found was something that needed testing by hand.
So we needed some way to get a set of testing data. Since the product was a new one, we had no data from users, so we had to generate data ourselves. Simply randomly generating coordinates wasn’t going to work (given that Great Britain is an island, you’re fairly likely to generate a coordinate in the sea somewhere), and we wanted precise coordinates (individual houses/buildings), so using a list of coordinates of major cities/towns was out as well.
In the end, the solution we came up with was to use Google’s geocoding service (which we were already using in the app) to search for something we could be sure would return a result given a general locality, such as a town or city.
Here’s the general process we used:
- To start with, we needed a list of towns and cities. We ended up getting our list from Wikipedia. A quick screen-scrape later, we had an array of city names.
- Next, we needed something to search Google with. We did a few rounds of testing with our list of cities, and got the best results with “station”, “restaurant”, “post office”, and – this being the UK – “Tesco”.
- We then wrote a simple service object that would randomly generate a list of journeys (using FactoryGirl) and associate them with a geocoded origin and destination, sourced from our list of city names.
The service object
Here’s a cut-down version of the service we wrote:
class GeoSeeder
CITIES = %w[London Edinburgh Cardiff ...]
PLACES = %w[station restaurant post\ office tesco]
def initialize(options = {})
@num_of_users = options.fetch(:users).to_i
@num_of_journeys = options.fetch(:journeys).to_i
end
def run
generate_users!
generate_journeys!
end
private
def generate_users!
@users ||= FactoryGirl.create_list(:user, @num_of_users)
end
def generate_journeys!
cities_collection.each do |(origin, destination)|
origin_search = "#{random_place} near #{origin}"
destination_search = "#{random_place} near #{destination}"
FactoryGirl.create(:journey, user: random_user, origin: origin_search, destination: destination_search)
sleep(5)
end
end
def random_place
PLACES.sample
end
def cities_collection
CITIES.permutation(2).to_a.sample(@num_of_journeys)
end
def random_user
@users.sample
end
end
To ensure that we don’t try to generate a journey with the same origin and destination, we use Array#permutation
to generate a list of all the possible pairs of cities, from which we then grab a subset to use to create journeys (we have to call #to_a
first, since #permutation
returns an Enumerator
).
We also make liberal use of Array#sample
, which returns either a single element or n random elements from an array (so you can think of it being roughly equivalent to [...].shuffle.take(n)
).
Geocoding is handled within the model by the geocoder gem, so we just need to construct the search string that’s passed onto Google’s geocoding API. Since the service is only a one-off task, run in development, we also include a sleep()
call to avoid triggering the API’s rate-limiting.
We then created a rake task that calls the service object:
# lib/tasks/geoseed.rake
task :geoseed => :environment do
GeoSeeder.new(users: ENV["USERS_COUNT"], journeys: ENV["JOURNEYS_COUNT"]).run
end
and call it like so:
$ USERS_COUNT=20 JOURNEYS_COUNT=50 bin/rake geoseed