A longtime leader in data analytics, Google continues to earn their position by continually improving their data analytics offerings. Now, with Google Cloud Platform (GCP), you can capture, process, store, and analyze your data in one place, allowing you to change your focus from infrastructure to analytics that informs business decisions. However, you can also use GCP Big Data tools in combination with other cloud-native and open-source solutions to meet your needs. Below is an overview of GCP Big Data Tools and how you might utilize them to improve analytics.
BigQuery is perhaps the most robust of GCP’s Big Data tools that companies of all sizes can use. It is a managed, serverless, enterprise data warehouse, using SQL, where you can analyze data in real time, though it also lets you bring in data from spreadsheets and object storage. Companies like Spotify have chosen BigQuery and other GCP products because Google’s products are more advanced than other cloud providers’ tools, according to Nicholas Harteau, Spotify’s vice president of engineering.1
BigQuery allows you to free your employees from database management and instead focus on providing insights to improve your bottom line using onboard, secure tools for reporting and extracting data. For example, after switching to BigQuery and App Engine, Motorola increased their data collection capabilities, which provided more information to support engineers to assist their customers with product troubleshooting.2
Speaking of the bottom line, BigQuery is also affordable. By allowing you to scale up and down as your needs fluctuate, you never pay for resources you don’t need. Google also gives you 1TB of analyzed data and 10 GB of storage for free each month to further reduce cost. You always maintain the security of your data as BigQuery is encrypted at rest and in transit, and further protected with granular access controls. BigQuery also works with StackDriver for monitoring and logging.3
On top of that, Google makes it easy to move your data into BigQuery with their BigQuery Data Transfer Service. This managed service helps you move data from such sources as AdWords, YouTube, DoubleClick and other SaaS applications.
If you are currently using Hadoop/Spark or Beam, you may prefer Cloud DataProc. Cloud DataProc provides an easy-button way to stand up and configure an Apache Hadoop service with Spark, Hive, and Pig in under two minutes. It’s a managed service that reduces the complexity of the initial setup process, allowing you to get right to work using the product. While it is a fully managed service that is highly automated, it allows manual controls if you choose. It is easily scalable, highly available, and offers multiple ways to manage your cluster, including a Web user interface and a REST API. Also, it keeps costs low by allowing you to use pre-emptible virtual machine instances that may be as much as 80% less expensive than traditional virtual machines. The service also offers per-second pricing, so you only pay for the resources you need when you need them.
Moving data between data formats or services can be a serious challenge. This sort of work is often handled by Extract-Transform-Load (ETL) services that extract data from one data source, parse it and transform it in the manner required, and then finally load it into another data structure. Cloud DataFlow is a fully-managed, serverless ETL-like service that not only works on data in real time but also can operate on data in large batches. Since Cloud DataFlow is a serverless service, resource management and allocation are not a problem.
Cloud DataFlow operates by executing a chain of user-supplied cloud functions, referred to as transformations. You can create and use cloud functions to perform numerous data transformation operations such as splitting or combining data streams. For instance, let’s say you have a fleet of construction vehicles, each with a set of sensors installed to monitor the health of the engine as well as to track its location via GPS. The sensor data is collected by an onboard computer in each vehicle that in turn transmits the data as a single combined stream to the cloud using a 3G or 4G wireless radio. Instead of sending all the data into a database such as BigQuery or BigTable, you could send the data to Cloud DataFlow where you could parse the data stream and separate it into two different tables—one for the engine data and one for the GPS information. There are many such examples where data needs to be manipulated in transit and at rest. Cloud DataFlow can just as easily ingest, parse and transform large data sets at rest. For example, combining the data from multiple files or database tables into a new data structure that better matches your business requirements. And the best part is that Cloud DataFlow is designed to work with other GCP products such as BigQuery, Machine Learning Engine, and Cloud BigTable, to name a few.
Cloud Pub/Sub fills another gap in your data processing service. It is a publisher/subscriber service that allows you to stream event data into topics that work a bit like a staging area, where other services such as Cloud DataFlow can retrieve the data for additional processing. In the Cloud DataFlow example above, you could use Cloud Pub/Sub as the internet-facing target of all the construction vehicle sensor data. The Cloud Pub/Sub service is serverless and automatically scales to handle the load it receives. It also helps distribute the load by allowing multiple subscribers to pull from the same topic or queue. Another use case is in exchanging information between services in multi-cloud and hybrid environments by enabling easy access to the data through several sources such as open APIs, client language libraries, and a Kafka connector.
Of course, you may also need further data analysis, so Google has created Cloud DataLab. This collaboration tool helps you share your reports, queries and datasets with colleagues so that you can further explore, analyze, or transform it, or even build machine learning models. Cloud DataLab is built on Jupyter notebooks and integrates with BigQuery, ML Engine, Compute Engine and Cloud Storage. It’s also free to use, though you may incur storage or other cloud services costs.
Cloud DataPrep is another tool you can use to prepare data for analysis. Trifacta, a Google partner, runs DataPrep. It is a managed, scalable service that finds anomalies, detects schemas, datatypes, and outliers, making it easier to clean up your data before analysis, saving hours of employee time. And best of all, it’s free to use.
Managing all of these services may seem overwhelming, but Google has an answer for that as well. Cloud Composer is a free, easy to use, managed service that helps you to control pipelines to your cloud and on-site data centers. It is built on Apache Airflow and uses Python. It integrates with most Google Cloud products and also lets you create a hybrid or multi-cloud environment. You can also easily monitor and troubleshoot your flows through simple charts.
Whatever your specific needs, Google Cloud Platform's Data Analytics tools are worth considering. They can provide solutions for nearly every type and size of business. Global Knowledge can assist you in building your GCP services solutions and implementing GCP data analytics in your unique environment.
Eve Eiler contributed to this post.
Never miss another article. Sign up for our newsletter.
1. Cade Metz, “Spotify Moves Itself onto Google’s Cloud –Lucky for Google,” Wired, February 23, 2016, https://www.wired.com/2016/02/spotify-moves-itself-onto-googles-cloud-lucky-for-google/.
2. Alex Barrett, editor, “Supersize it: how Motorola transformed its data warehousing and analytics with Google Cloud Platform,” Google Cloud Platform Blog, July 11, 2016, https://cloud.google.com/blog/big-data/2016/07/supersize-it-how-motorola-transformed-its-data-warehousing-and-analytics-with-google-cloud-platform.
3. Google BigQuery, https://cloud.google.com/bigquery/.