1. Domů
  2. Články
  3. Optimizing BigQuery for Performance and Cost Efficiency

Optimizing BigQuery for Performance and Cost Efficiency

Peter Vaňko, Staff Engineer at Kiwi
Optimizing BigQuery for Performance and Cost Efficiency

In Kiwi.com we have an advanced BigQuery environment with many use cases and approximately 4 petabytes of data. Part of my daily work is taking care of that beast so I have decided to share some lessons I’ve learned from our Kiwi journey!

Maintaining BigQuery (referred to as BQ hereafter) involves 2 key aspects. Tuning performance and managing costs. Solving performance issues with credit cards is very easy in the cloud, but then you can expect difficult discussions about not following your budget. Let’s talk about smart strategies that can improve users’ experience with minimal costs. 

Storage performance — tune your tables 

The important part of BQ performance is how the tables are structured. Using partitioning and clustering leads to final success. 

Here are the important tips when storing and querying data: 

  • To efficiently read data, ensure your WHERE condition follows table partitioning; failing to do so results in full table scans. The ultimate boost is scanning only the partitions you really need. 

  • A weak point of massive parallel processing is joining large tables. Clustering tables on the joining columns speeds up the select queries, the same applies to merge operations where clustering helps to find related rows”.  

  • To avoid expensive join operations in reporting, you can combine tables using the Nested table concept. The only pitfall here is that it is not possible to define clustering on a Nested field! 

  • Another use case for clustering is simple filtering data for reporting. You are limited to only one clustering without additional indexes. You can define more columns, but then it works only if you define WHERE conditions for all left-sided columns (order of columns matters). 

Storage Costs: Long-term Storage Strategies and Data Cleanup 

It is super simple, the more data you have the higher the invoice you receive. 

 If you don’t write data in the partition/​table, after 90 days it will automatically get into long-term storage and you pay half the price. This is an important concept when you consider updating historical data. Such an operation can be very expensive. 

 Another option for saving money is to set up rules and governance that will force users to limit the stored data. A few rules help us to keep the costs within reasonable borders: 

  1. Default data expiration 
    Big tables should have a retention period in place. Challenge this period with data consumers and agree on a reasonable set-up to find a sweet spot between costs and analytic needs.  
    For example, our biggest table keeps only 35 days of history. Sandbox and Playground tables are created with default 30/90 days retention so it prevents production” reporting from being built on top of them. 

  2. Identify unused tables 

    a. By scanning Information_​schema views you can find tables with no usage and drop them (I do not recommend doing automatic deletes). Every quarter I find 200TB that can be dropped immediately. 

    b. Once you clean the mess in one project, somebody will start doing it elsewhere, this is proved by experience. Your cleaning activity must be happening on a regular basis (in my case quarterly). I created a report that sends me the top 20 tables where I should focus my attention. This way you can easily identify the troublemakers. If you cannot find the owner, just rename the table – and they will find you instead (if it is important). Anyway, you still have 7 days to recover the table from Time Travel in the worst-case scenario. 

  3. Manuals, tutorials, and documentation 
    The idea of teaching users the best practices sounds nice, but not working all the time. When there are new people in the company, when there are too many rules, and when nobody really reads the documentation,… To ensure users follow best practices, you need to have it documented. What works for us are code snippets that can be reused. 

Our BQ storage setup went through several optimizations in the last 4 years. Split by projects shows how we started with one Datalake project and moved to data mesh architecture, where each team operates in a separate GCP project 

Compute performance — tailor resources to meet users’ demand 

BigQuery comes with two pricing concepts: On-demand (pay for processed TBs) and Editions. Since On-demand is super simple ($6.25/processed TB), I will focus on Editions as it allows you to be more creative and adjust the performance to your workload and needs. 

Google’s BigQuery uses slots as a compute unit (virtual CPU + RAM). The more data or more complex query you have the more slots you need. 

  • Save 20 – 40% of costs with a long-term commitment 
    Based on your current consumption and forecast, consider long-term commitments.  That allows you to purchase compute capacity with a fair discount (1Y – 20 %, 3Y – 40 % ). 

  • At the same time, you can use AutoScaling for peaks and ad-hoc payloads. Such slots are more expensive but they’re available for immediate use when required. 

You arrange slots into Reservations and use different setups based on your needs. We reevaluate ours on a regular basis and react to different workloads in the following ways: 

  1. Default (universal usage): 

    a. Base 600 slots

    b. +200 autoscale slots during the day (Monday +400) to solve higher demand coming from end users doing analytical work 

    i. For most of the use cases, such a setup works perfectly. 

    ii. Special cases causing issues or not performing well were isolated into separate reservations. 

  2. Reservation 1 (our largest data domain) 

    a. Base 1000 slots 

    b. +600 autoscale slots when reporting is on fire (monitoring job queues)

    i. The domain works with a big dataset that cannot be aggregated and every query requires significant computation. 

    ii. Before, the queries were consuming the majority of Default slots and simple jobs were waiting for minutes. Now they block only queries in the same reservation. 

  3. Reservation 2 (near realtime jobs with 5 minutes schedule) 

    a. Base 200 slots 

    b. No Autoscaling, slots sharing disabled 

    i. Before, the simple queries were stuck in the Default queue with other requests and processing time often exceeded 5 minutes. 

    ii. It behaved like a motorcycle in a traffic jam. 

BigQuery resource consumption by TOP 10 users (2 weeks window). Illustrates intra day periodicity with peaks by ad-hoc analytical requests.    

Compute costs

In our case, the biggest tables are streamed (or loaded on an hourly basis) which means the processing coming from ELT is more or less spread equally during the day. It allows us to utilize available resources without significant peaks and buy stable computing capacity with a discount. The queries coming from dashboards and end-users behave differently. Most of them start at 9 AM when people arrive at work. For such peaks, we utilize the autoscaling. 

At the end of the day, the Costs for BigQuery computing vary based on weekday and processing demand. The blue part is stable and represents committed capacity (90 % of costs). The red one is dynamic and is mostly driven by autoscaling (10 %).   

Google recommends having one admin project for your computing capacity. In such scenarios, you lose visibility of spending per project. To be able to see costs per project, user, and query you need to play a little bit with available data. If you divide the final monthly cost by the total used slot-milliseconds you will get the price estimate you can work with. In our case, we end up with the formula cost ⋍ consumed_​ms/​1.4e8. As comsumed_​ms is available on a job level, you can estimate costs per user and project, or by utilizing job labels you can see the costs for each dashboard and ETL process. 

Summary 

Business Intelligence enables people to get value out of the data, which is the main purpose of doing any data processing. There are always significant costs associated with the data warehouse program, people demand more data and faster processing. A well-monitored and governed platform can survive much longer without desperate users and a need for complicated migrations. If you use BigQuery or other technology, the core principles remain pretty much the same. 

Mohlo by vás také zajímat

Optimizing BigQuery for Performance and Cost Efficiency