A Big Query case study on efficient data segregation
By Ajay Lakshminarayanarao and Allen Fung
Google Bigquery is a powerful tool. A software engineer having a minimalistic idea of SQL can use it with ease. You can hack away large amounts of data for insights in seconds. At ShareThis we look into terabytes of actionable social data everyday and it has become an integral part of our work for analysis, checks and processing.
One of the recent problems we faced was to segregate rows in a Bigquery (BQ) table based on country. BQ is based on columnar storage. Instead of storing the whole record on a single volume, BQ divides the row based on column values and stores each value in its own storage volume. To take advantage of this scheme we designed an efficient low cost query solution to retrieve values for each country as described below.
Problem: To segregate a BQ table by country
The Straightforward Way
To segregate a BQ table by country we need to query the table, as many times as the number of countries.
Let us assume that we need to query columns A, B, C, D and E. The size of each column is 20GB. The billable number of bytes will be 100GB for our table.
If we need to execute our query as many times as number of countries i.e. 240, our billable number of bytes will be 24,000 GB or almost 24 TB. Let’s say the charges for 1 TB is 5$, our total cost will be 120$ for one BQ table.
The Optimized way
The trick here is to group the required number of fields as a single column. This alone will not save the number of bytes billed. The second trick is to pivot by the segregation column; in this case it will be by country.
After the above steps, our result table will contain as many columns as countries. Each column represents data for a country and will have the previously grouped data only for that country.
In our BQ table of 100GB, we can assume a probable result, as the column representing US is 20GB, the 10 major countries averaging around 3GB and the rest of the 230 countries averaging towards 0.5 GB.
Now, our billable bytes are based on the size of the data for each country. The billable cost for querying US will be just 0.1$, the cost for the major countries will be at 0.015$ each and the rest of the countries much smaller than that.
The total billable bytes will be 100GB for the grouped table and 165GB for queries of all countries combines. The total cost is around 1.5$!
Generating the grouped table
We can group the required columns into a single column using CONCAT. We can create a row in a format that represents a line in a CSV file or a valid JSON Object.
For the second step, the query will consist of 240 CASE statements, with a CASE statement for each country to pivot to a particular column. The query will be big, but BQ supports query size of up to 256KB. We can combine both the steps into a single query where we Concat in each case statement.
( CASE WHEN geo.ISO = ‘US’ THEN CONCAT( A,B,C,D,E ) ) AS COUNTRYUS,
( CASE WHEN geo.ISO = ‘CA’ THEN CONCAT( A,B,C,D,E ) ) AS COUNTRYCA …