How we Reduced Google BigQuery Cost by 50%
At Acko, we run our big data platform on Google Cloud Platform (GCP). BigQuery is our Data warehouse for analytics. Apart from the Data Platform, all our transactional systems are running on Amazon Web Services (AWS). The beauty of BigQuery lies in the statement from BigQuery product page:
These above three benefits marked in bold are what we typically look for in any technical solution – scalability, low operational overhead and addressing what the end-user expects. The BigQuery offering covers concerns of all types of users associated with any DataWarehouse. E.g. The analyst who typically runs SQL on the data warehouse, the DevOps engineer who needs to set-up and manage the infrastructure and finally the tech owner whose typical concerns are future scalability with business growth.
The end-users of Data Warehouse (DW) at Acko are mainly the Analyst and the Product team. They love BigQuery as it allows them to run whatever queries/SQLs they have and BigQuery after scanning terabytes of data responds in seconds. Yes! A single query usually scans TBs of data in our DW and they get results in seconds. As all good things come with a cost, it shot up for us as well.
Initially, we thought that our billing was increasing due to growth in business, data volumes and number of users. However, within a short span of three months, our bills shot up to double. We started diving deep into finding the real cause of the cost shooting up and optimizing the same.
If you can’t measure it, you can’t improve it. – Peter Drucker
So we started measuring with quick analysis on the pattern of increase in data ingestion to BigQuery using Stackdriver Monitoring for BigQuery. The quick analysis concluded the ingestion size has a significant spike. However, we also realized that the storage size in BigQuery does not have much impact on billing and loading of data into BigQuery does not cost.
Now we have turned our focus on SQL queries being fired on BigQuery. We tried to get a quick look at the queries in the audit log offered by Google Cloud. It offers logs of the queries fired on BigQuery. We needed more insights on those queries. We needed to slice and dice to look for patterns. To do that, we created a pipeline of these SQL queries from audit logs to another BigQuery table using Log Router. The BigQuery audit log has some very useful information like SQL queries, the user who executed it and most importantly the total bytes it scanned. From these logs, we started identifying expensive queries. We quickly published Tableau dashboards to analyze expensive queries and users. After analysis we concluded that there are few tables, sizes ranging from hundreds of GBs to TBs, which are used very frequently in SQL queries for ad-hoc analysis and few reporting are also based on those tables and these queries are taking a bigger share of our billing.
After the analysis phase, we were clear that The solution lies in the problem itself – The expensive SQL queries. We started analyzing the pattern of these queries and also discussed with the end-users of these SQLs to understand the exact requirement of their ad-hoc analysis. It seemed in most cases there was no need to scan the entire table or all the columns of those big tables. However, these SQLs were scanning the entire table. Also, people were using SELECT statement with * and LIMIT. Since BigQuery uses columnar storage so putting LIMIT was of no use on data it scans.
We needed a solution for below two problems to reduce our billing:
- Ad-hoc and reporting SQL queries on large tables and scanning data in TBs
- People firing SQLs which were unoptimized for columnar storage DB
Part 1: Doing the obvious in a technical way
Partitioned table: Since in most cases those ad-hoc queries on large tables did not require to scan the entire table, we partitioned those tables based on the field identified after analyzing those SQL queries. And it turned out in most of the cases this field was the creation timestamp of the record. Like queries based on insurance policy issue date. So we implemented BigQuery’s date/timestamp partitioning. As partitioning ensured that BigQuery will scan data only in the partitions which the user has passed in the SQL as partition filter, not the whole table. That means only a part of the table will be scanned, not the entire table. And this portion was usually just 3-5 percent of the table size. Which reduced the query size from TBs to few GB. Still, we could see in the query log that queries are being fired without a partition filter, so we enforced partition filters.
Clustering: BigQuery supports clustering on partitioned tables. We implemented clustering in some of those large size tables. There was also a pattern of SQL queries where people were filtering/aggregating data based on particular columns like insurance type. So we implemented clustering based on such columns for some tables. This further improved performance of such SQL queries.
Part 2: Scoreboard for Awareness
For the second problem, we didn’t have a tech solution. We came up with a mailer for usage awareness to the end-users.
BigQuery Usage Daily Mailer: Every morning we started sending an automated awareness mail to the BigQuery users whose last day’s queries have scanned large amounts of data. We already had SQL logs in our BigQuery table. We included two types of users in this mailer:
- Users whose combined queries for the day have scanned a large amount of data.
- Users whose single query has scanned huge amount of data.
Part 3: Best Practices training
After going through these optimizations and creating awareness among the end-users we see our BigQuery monthly billing going down and it has gone down by almost half.
Going forward as our BigQuery usage increases organically we will be looking at Flat rate Pricing. Even with the flat-rate pricing model, our optimizations will be relevant as bad queries can exhaust fixed slots and can hamper the performance of optimized queries.