Abstract
Amazon Redshift opens up enterprise data warehouse (EDW) capabilities to even the smallest of businesses, yet its costs, security, and flexibility also make it appealing to the largest of enterprises. It allows companies to easily and conveniently scale their EDW needs both up and down, and as a managed service, it allows your team to offload all of the "undifferentiated heavy lifting" of building and maintaining an EDW. Its raw storage costs are about one-fifth to one-tenth of traditional in-house EDW, and AWS has taken great care to ensure its performance is still competitive with those in-house solutions. Before deciding to use Amazon Redshift, however, it's important to understand what it is and is not.
Sample
Like any "big data" initiative, deploying and operating a data warehouse of any size used to be limited to only large enterprises with deep budgets for proprietary hardware and multi-year software licenses. Pay-as-you-go cloud products like Google's BigQuery and AWS's Amazon Redshift change all of that, putting a fully blown, fully managed data warehouse within reach of even the smallest business.
But do commodity costs equal cut-rate performance? Can even large enterprises rely on these for workloads that currently run out of commercial solutions like Oracle Exadata and Teradata? How is security and latency affected in a hybrid environment? In this article, we focus on Amazon Redshift, with an introduction into what it is (and is not), as well as how it compares with the costs, performance, support for third-party visualization tools, scale, and reliability.
What is a Data Warehouse?
Data storage and analysis is typically bifurcated into two types of systems, known as Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP), both of which are terrible acronyms that need a bit of explanation. OLTP is about serving real-time (low latency), high concurrent connections-traditionally served by Relational DataBase Management Systems (RDBMSes) like MySQL and Oracle. OLAP systems, on the other hand, are characterized by longer-running, more complicated queries against probably much larger volumes of data. An OLTP system could provide the data layer for a customer-facing web or mobile app so that the customer could complete a transaction; whereas an OLAP system serves internal users who are typically doing data mining in order to extract business intelligence analytics. Some examples of OLAP queries are "tally and sort all sales by salesperson by region by product for all of 2013," or "find the lowest margin products by category by region for Q3 2014"-data which could be used to either calculate salesperson bonuses or decide which product lines to terminate. For a reasonably busy retailer, either of those queries could involve sorting through terabytes of data (which probably exist in a couple of disparate, siloed, and perhaps even proprietary systems), and could take hours or maybe even days to run.
The hardware and software underneath these OLAP systems is what we refer to as a data warehouse (DW), which is sometimes also referred to as an enterprise data warehouse (EDW). Technically, OLAP refers to the operation being performed, while an EDW is the static hardware and software to support OLAP operations, although many refer to EDW systems as OLAP systems as well. These systems have very different considerations than OLTP systems-they need to be able to store and process dozens, hundreds, or maybe even thousands of TB of data, as cost-effectively as possible, and they need to be able to handle very long-running, complicated queries across those large data sets. EDW administrators also need to be able to easily and cost-effectively grow the storage and processing capabilities as their business grows.
Data generated from the OLTP systems is periodically (hourly, daily) copied into the EDW system via Extract, Transform and Load (ETL) operations meaning that the amount of data in an EDW grows very quickly (see figure 1). For this reason, the "cost per TB" of storage in an EDW has always been a big consideration.
In the pre-cloud days, any business hoping to do BI queries had to fork out millions to purchase the EDW hardware and software from a limited number of companies. The makers of these EDW systems could (and did) pretty much charge the maximum they could, and because of their high cost, businesses had to be very selective about what they chose to put into their EDW.