How to Scale Your Data Infrastructure When the Time Has Finally Come to Scale
People always say - “Don’t plan ahead too much for a massive scale, before you really know you need to scale." So what happens when the time has finally come and you really need to scale?
Moreover, what do you do when it’s time to scale, but you already have HUGE amounts of real data running in production, serving many real customers?
Your organization and the codebase are only growing, so the amount of data is also expected to grow accordingly and significantly. At this crossroads, you have no choice; you can’t wait any longer, and NOW is the time to upgrade your infrastructure and database to support it.
This is the crazy challenge our Platform Team has had to deal with over the past few months.
As the Head of the Platform Team at Torii, I would love to share with you our journey, the challenges my team went through, and explain our overall solution to support them and our customers.
As part of the process, we’ve learned in depth about the world of data engineering, common tools and concepts, and I’m happy to say that we’ve all had the privilege of learning a fascinating lesson in this field. In addition, and most importantly, we came to realize how our data infrastructures became much more flexible, robust, and adaptable at a significant scale after introducing our changes to the system.
What started this whole thing, and why was our previous solution insufficient?
At Torii, we help our customers analyze the way their team is utilizing their stack of SaaS products, such as Salesforce, Zoom, Asana and 150+ applications. This means analyzing an ever growing number of events describing how users are interacting with the different applications. We provide our customers with the ability to measure their employees’ application usage frequency, which we call "app-usage”.
Currently, we use one SQL database that serves Torii’s main application. This database also collects and processes app-usage data; and for our original feature set, it was more than sufficient.
As time passed, our customer base grew rapidly and our features became much more ambitious. This meant storing and analyzing much larger amounts of data than originally estimated. As more and more customers onboarded into Torii, we found ourselves struggling to load more data and experienced significant performance and quality issues while executing queries on the tables.
Moreover, the app-usage analysis was consuming most of our database resources, which affected the performance and operation of other important features. And lastly, since our system was optimized for its original use case, we lacked the flexibility to easily adapt to new requirements and new features. As a result of improper modeling, we couldn’t use the data to gain more interesting insights, making the extension of the system’s capabilities hard, and, in some cases, even impossible.
With that, we knew it was time to make a change. From small incremental improvements to re-thinking the overall system for a real long-term solution, as we knew small changes will only be good as temporary patches. More specifically, we needed to plan and implement a solution that would tackle the problems listed above without disrupting our current infrastructure, which already serves many users. It was also important for us to find a managed solution that would eliminate the need to deal with maintenance and devops, since we extensively use Serverless technologies and originally chose to structure our team with a focus on developers rather than a large group of devops engineers.
Finally, we came up with the following goals for our platform:
- Collect any amount of data
- Being able to scale as we grow
- Flexibility in introducing new features
- Keep supporting today’s features
- Use minimum devops/maintenance
- Easily adoptable and adaptable
As a team of backend developers, we had to learn new technologies in the data-engineering realm for the first time. This was a completely new field for us, and we had to learn new terminologies and tools from scratch. A whole new world of technology was revealed to us, from data-processing methods to big-data warehouse technologies.
With the help of AWS experts, we managed to compose a solution that aligns with our desire for simple solutions, requires a minimal amount of devops work, but is still robust enough to support our current and future scale, which is built on top of AWS.
Here are some of the main components we chose to use:
An AWS compute engine that can run your code without the need to maintain an infrastructure or servers, as we use it to process and transform our app-usage events.
Views: In PostgreSQL, a view is a virtual table that represents the results of a SELECT query. It does not store any data itself, but rather displays data stored in one or more tables in the database.
An AWS object (file) store, as we use it to store the app-usage events in files prior to processing and ingesting them into Redshift.
An AWS delivery stream used to stream data from one data-source to another with high throughput.
We use this to move the app-usage events to S3 and from there to Redshift.
Apache Airflow (MWAA):
An open-source platform to author, schedule and monitor workflows.
We use it as our ETL engine.(ETL stands for “Extract, Transform, Load” - Collect data from one side, process it, and then load it to another destination)
A high-level-design of our solution:
So, let's see an example of a real ‘app-usage’ scenario and what happens when an employee signs into a 3rd party application, such as the Zoom app for instance:
- Once every 24 hours, we fetch sign-in events using the Zoom API
- The events are sent through Kinesis-Firehose to S3 (1)
- A Lambda function is triggered for each file and processes the events (2)
The processed events are sent to Firehose again and now stored in S3
- At this point, our goal is to store the processed events in Redshift, so we use the built-in ״COPY from S3״ command on Redshift
- As the data resides in Redshift we are now ready to start the actual “heavy” work, such as processing and aggregation. (3)
- Every few minutes, we run a task in Airflow that aggregates the data and builds the views based on the features' requirements.
- The view results are then moved back to our Prod-DB, where data can be consumed to power the features of our product. (4)
Startups usually have to move fast and deliver new features quickly; as such, we always want to build the simplest possible solution that meets the requirements.
When the time comes and you need to scale, you know you need to expand your infrastructure, but it is challenging to choose the optimal solution for your team and the technologies you already use.
If you have a team of versatile and talented people, they can quickly become familiar with another domain of expertise like data-engineering. Give them the freedom to explore and discover new tools and technologies that will benefit the entire team and your codebase.
During this change process, it is best to set milestones and goals to ensure that you’re moving in the right direction, and you too will be able to find the perfect solution for your organization’s scaling needs.