High-performance technology for complex mobility networks in emerging markets.
From Africa to India, Latin America to Southeast Asia, we’re working to make formal and informal public transport reliable, predictable, safe, inclusive and accessible; so that everyone in the emerging world can get where they want to go.
We’re doing this by mapping all the public transport networks in emerging markets and making the data available through our integrated mobility data platform. Governments and service providers use our data and technology to sustainable urban mobility plans, and billions of people riding public transport use it to make smarter transport choices.
We currently have data from 33 cities in our platform, and we’re expecting 100+ by the end of the year. That kind of expansion requires architecture robust enough not only to the largest source of the world’s most complex public transport data, but also to provide services using that data. It’s a whale of a project.
Integrated Mobility Data Platform
In a previous post on Kubernetes & Containerisation, our Co-Founder and Software Architect, Dave, talked about supporting our ever-increasing quantities of data as we scale. This time Ivan Sams, our Lead Engineer & Product Liaison, will be examining our platform’s performance and how we make use of caching and immutability to ensure we have a resilient high-speed platform as we collect, validate, clean, and process the world’s most complex public transport networks in emerging markets.
From the outset, we designed the WhereIsMyTransport platform to support thousands of client applications. App developers can make their loading spinners as fancy as they like, but no one likes to wait for technology. We set ourselves an ambitious goal: no matter how huge, complex and diverse our data set was to become, our median server response time should always be under one second — actually well under one second, to allow for internet latency and client-side loading. UX researchers have long known that an app user’s thought process is interrupted beyond that one-second limit, so anything above one second makes life harder for the very commuter whose life we’re trying to improve.
Finding optimal journeys through public transport networks was one of our first challenges at WhereIsMyTransport. We model public transport as a network of atomic connections. Each connection is made by a vehicle travelling between exactly two stops at a particular time. As we go about mapping the world, we have added hundreds of millions of these connections (and counting) to our database. Finding the best path through this many connections proved to be as slow as it sounds: without caching, our journey planner would take anywhere from minutes to hours to process some requests. Before we could launch our platform, we needed to optimise our systems and our journey planning algorithm.
There are only two hard things in Computer Science: cache invalidation, naming things and off-by-one errors. — Phil Karlton & Leon Bambrick
Our data collection team uses a suite of custom-designed tools to map new cities and load the data into our integrated mobility data platform. Not all data storage is created equal. Which storage you use, how often you read and write to it, and the way it is stored can add up to a huge difference in how quickly it is accessed.
We store our platform data in modern NoSQL databases, namely MongoDB, Azure Table Storage and Azure Blob Storage. NoSQL databases have records that lack exact definitions and are designed to operate at the scale of millions — or billions — of entries. Accessing data from these databases is pretty fast, but to achieve the blistering speeds we expect from our systems we cache the data to the memory of the machines that run our software. Caching involves saving a copy of the data to extremely fast memory on the computer that processes the requests.
There are a number of ways to build up a cache. One is to save the results to the cache after they are first accessed, known as lazy loading. This results in the cache not taking time to build up front and reducing unnecessary expensive operations, but can mean a slow result the first time it is queried. An alternative is to load the data into the cache prior to needing it, known as eager loading. This method takes some time initially and occupies a large amount of memory whether or not it gets used, but does mean that the results are fast from the very first query.
We use a Redis database as an additional cache layer. Redis synchronises between the multiple machines that run our platform, so the data can be updated regularly yet always remain consistent. It also allows us to release new versions of the platform and restart it without having to re-access our database to build the cache. Finally, our Redis layer helps us to scale by storing all this data in one place instead of on every machine we are using, saving memory.
The upshot of caching is that when a user makes a request, no slow operations need to be made to simply get the data. We can begin processing immediately and spend our valuable single second of processing time on finding the best possible results for the user, instead of fetching data. We never need to access our relatively slow primary database except when we first start up our services and when we add new data to the platform, thanks to our use of immutability.
In computer science, objects and data that may not be changed after they are created are called “immutable”. Everything we schedule to our system is immutable — so it can only be superseded by a later version, never removed or updated. Thorough use of immutability confers a number of advantages to a software and database system, particularly one with thousands of concurrent users like ours. Immutability makes a system much simpler to understand and maintain, as every object in the system has one and only one state during its lifetime. It also means that when multiple users are accessing the same object, they all get a consistent result back, eliminating the risk of it being changed halfway through the operation.
When we publish data to our platform, it stays published forever. This simplifies our cache considerably since we neither need to update it nor tactically expire it, which is notoriously difficult to do well. Instead, we simply replace it once new data is available. Immutability enables our platform’s unique feature of allowing queries at any moment in the past or future for which we have scheduled data. Of course, this has huge practical applications for analytics, allowing comparison of a network over time. It also means that any query repeated is guaranteed to give the same results the second time, unless the data has been updated.
Immutability is not a silver bullet though, and when designing a software system we have to think carefully about when to break that rule. In general, scheduled data added to the platform is immutable. But certain things, such as our mechanism for users to choose which agencies to include, can be modified. This is because it’s conceptually mutable, and it simplifies the software design significantly.
Since we launched our platform two and a half years ago, we have made a large number of additional optimisations to our algorithms and to our infrastructure. Extensive and effective caching of immutable objects remains key to maintaining the high performance our clients have come to expect and rely upon. The current live version of our platform has more data than ever before yet is our fastest iteration to date. Check it out by signing up at developer.whereismytransport.com.
If you’d like to know more about our work, drop us a line at firstname.lastname@example.org.