Ever hear one of your developers retort with “TIM TOW DEE” when you suggest an alternate approach and then wonder “who is Tim, why does he want to tow Dee, and what does this have to do with anything we were talking about?” We have the open source community (and probably Larry Wall, more than anyone) to thank for the useful acronym TMTOWTDI, which is shorthand for “There’s More Than One Way To Do It.” When it comes to “doing” big data, you’ll find yourself using this phrase on a daily basis.
It’s been about 10 years since public cloud offerings like AWS opened up the world of big data analytics to allow mom-and-pop shops to do what only the big enterprises could do prior—extract business value by mining piles of data like web logs, customer purchase records, etc.—by offering low-cost commodity clusters on a pay-per-use basis. And in that decade, the offerings have blossomed to cover everything from real-time (sub-second latency) streaming analytics to enterprise data warehouses used to analyze decades worth of data in batch mode jobs that could take days or weeks to complete.
Let’s examine the top five most useful architectures used for big data stacks and learn the sweet spots of each so you’ll better understand the tradeoffs. We’ll also break down the costs (on a scale of $-$$$$$), when to use or not use, popular offerings and some tips and tricks for each architecture.
The Big Five
In no particular order, the top five big data architectures that you’ll likely come across in your AWS journey are:
• Streaming – Allows ingestion (and possibly analytics) of mission-critical, real-time data that can come at you in manic spurts.
• General (or specific) purpose ‘batch’ cluster – Provides generalized storage and compute capabilities in an extensible, cost-effective cluster which may perform any and all of the functions of the other four architectures.
• NoSQL engines – Gives architects the ability to handle the “Three V’s” -- high velocity, high volume, or the high variety/variability of the underlying data.
• Enterprise data warehouse (EDW) – Lets an organization maintain a separate database for years of historical data and run various long-running analytics on that data.
• In-place analytics – Allows users to leave their data “in place” in a low-cost storage engine and run performant, ad-hoc queries against that data without creation of a separate, expensive “cluster.”
A streaming solution is defined by one or more of the following factors:
• Mission-critical data — losing even one transaction can be catastrophic to a user.
• Manic spikes in load — your IoT farm may go from completely silent to every one of the million devices talking to you all at once.
• Real-time response — high latency responses can be catastrophic to a user.
There’s a boatload of real-world examples here, from the Tesla cars (which are basically rolling 4G devices) constantly sending the car’s location to a back-end which tells the driver where the next charging station is, to my personal favorite: Sushiro, a heavily automated sushi-boat franchise in Japan. What Sushiro did is put RFID sensors on the bottom of every sushi plate at every one of their 400 locations. Then a sensor on the “sushi conveyor belt” tracks each plate as it comes around, sending that data point to AWS Kinesis where the back end responds with a dashboard update, telling the sushi chef important info like “throw away the next plate, it’s about to go bad,” or “make more egg sushi,” or “thaw more tuna, we’re running low.” By using streaming, the chain now has not only real-time efficiency recommendations like the above, but they also get historical info for every restaurant and can start planning for trends among their customers.
Sushiro is a great example because it hits all the three requirements for streaming. The dashboards are now critical to the operation of the business.
• Cost: $$ - $$$$$ (typically RAM intensive)
• Suitability: Mission-critical data, manic spikes in load, real-time response. You’ll want to build real-time dashboards of KPIs.
• Caveats: Standalone streaming solutions can be expensive to build and maintain. Scaling can be challenging, especially if you’re building on EC2. A failure can be catastrophic to business, but most offerings provide failsafes, like replication tuning, backup and disaster recovery, to avoid this.
• Popular offerings: Kinesis (managed service), Kafka (EC2-based), Spark Streaming (both as a managed service and EC2-based), and Storm.
• Tips and tricks: Use Kinesis for starters (easy to use, cost effective at low volume). Many organizations move to EC2-based Kafka (if they just need streaming) or Spark Streaming to obtain better control and lower costs at high volume. This is one of the few times in AWS where a managed service like Kinesis can end up costing more – a great deal more – than an EC2-based solution like Kafka.
General (or specific) purpose batch cluster
Hadoop/Spark rule the roost here. With these systems, you get highly extensible, low-cost (commodity hardware, open source software) storage and compute that can be thrown at a myriad of problems in order to do batch-heavy analysis of data at the lowest cost possible.
Hadoop is highly mature, and offers an extremely rich ecosystem of software (think “plug-ins”) that can leverage those generic compute and storage resources to provide everything from a data warehouse to streaming and even NoSQL.
On top of Hadoop, we can now run Spark, which comes with its own extensible framework to provide all of the above and more in a low-latency (high RAM) manner suitable even to streaming and NoSQL.
• Cost: $ - $$$$ (highly dependent on RAM needs)
• Suitability: Lowest cost, greatest flexibility. Good choice if you desire one cluster to do everything and are moving from Hadoop or Spark on-premise. Highly suitable for machine learning.
• Caveats: A system that can “do everything” rarely “does everything well,” but this can largely be mitigated by using Spark and building clusters tailored to each job.
• Popular offerings: EMR (managed service – runs Spark as well), Cloudera (EC2-based), Hortonworks (both as a managed service via EMR, and EC2-based).
• Tips and Tricks: Store source data long-term in S3, build clusters and load that data into your cluster on an as-needed basis, then shut it all down as soon as your analytics tasks are complete. This is actually precisely how EMR works by default, but even if you’re using Cloudera or Hortonworks (nearly identical in functionality now), you can easily script all the above. Leverage EC2 spot instances to get up to a 80-90% savings (no, that is not a typo), and checkpoint your analytics so that you can spin clusters up or down to take advantage of the lowest cost spot windows.
Velocity (concurrent transactions) is of particular importance here, with these engines being designed to handle just about any number of concurrent reads and writes. Whereas other systems typically cannot be used for both end users, (who demand low latency responses), and employee analytics teams, (who may lock up several tables with long-running queries), simultaneously, NoSQL engines can scale to accommodate both masters in one system. Several developments allow real-time joining and querying of this data in a low-latency manner.
• Cost: $$ - $$$ (typically RAM intensive)
• Suitability: “Three V’s” issues. Simple and/or fast-changing data models. You’ll want to build real-time dashboards of KPIs.
• Caveats: You must give up transactions and rich, diverse SQL. Since it doesn’t use SQL, data cannot be queried directly with visualization tools like Tableau and Microstrategy. Scaling, especially adding new nodes and rebalancing, can be difficult and affect both user latency and system availability.
• Popular offerings: DynamoDB (managed service), Neptune (managed service – still in beta), Cassandra (EC2-based), CouchDB (EC2-based), and HBase (both as a managed service via EMR, and EC2-based)
• Tips and Tricks: Strive to use the AWS-managed service DynamoDB rather than provisioning EC2 and loading a third-party system. Periodically prune your end-user DynamoDB table and create weekly or monthly tables (dialing the size – and therefore cost) down on those historical tables. Use Dynamic DynamoDB to “autoscale” provisioned capacity so it always meets (and just exceeds) consumed. Use DynamoDB Streams to enable real-time responses to critical events like customer service cancellation or to provide a backup in a 2nd region.
Enterprise data warehouse (EDW)
An EDW is dramatically different than any of the other systems mentioned here. It provides what we call an “OLAP” (OnLine Analytics Processing – supports a few long running queries from internal users) versus the “OLTP” (OnLine Transaction Processing – supports tons of reads and writes from end users) capabilities of an RDBMS like Oracle or MySQL. Granted, one could use an OLTP system as an EDW, but most of us keep the OTLP database focused on the low-latency, recent event (like “track last week’s order”) needs of end users and periodically (normally daily) window older data out to an OLAP system where our business users can run long-running queries over months or years of data.
These OLAP systems use tactics like columnar storage, data denormalization (creation of “data cubes” with nearly unlimited dimensions), and provide RDBMS-level ANSI 92 SQL adherence, meaning we have full access to SQL capabilities, and visualization tools like Tableau are tailored to work with them directly.
• Cost: $$ - $$$$$ (typically need lots of nodes to store and process the mountain of data)
• Suitability: If you want to analyze data specifically for business value or build real-time dashboards of KPIs.
• Caveats: Make sure your team understands the difference between OLAP and OLTP and that they are using each in the correct way.
• Popular offerings: Redshift – there is really no other valid option with regards to cost, performance and flexibility.
• Tips and Tricks: As with EMR/Hadoop, only spin up a cluster when needed, keeping the source data in S3 (this is actually how Redshift works by default). Tag clusters so you can, in an automated fashion, quickly identify and shut down unused capacity. Consider reservations to rein in costs. Really understand the different node types available (high storage, high throughput) in order to leverage each. Be careful turning on native encryption as it can reduce performance by up to 20-25%. Deep dive into Redshift with my five-star O’Reilly course or consider taking in-person training with our excellent “Data Warehousing” class, which covers Redshift almost exclusively.
Presto kind of changed the game a few years back by offering performant analytics on data without having to move that data out of it’s native, low-cost, long-term storage. I understand the inner workings about as well as I understand fairy dust, but the end result is that rather than having to stand up (and remember to tear down) an expensive EMR or Redshift cluster, I can simply run queries ad-hoc and be charged only for exactly what I use.
Furthermore, I recoup all that time I spent trying to pick (then later manage) the right nodes and number of nodes for my EMR or Redshift cluster. With Presto, I no longer know nor care about this “undifferentiated heavy lifting” – everything just works when I need it to.
Lastly, Presto supports RDBMS-level ANSI-92 SQL compatibility, meaning all of the visualization tools work directly against it, and my SQL background can be used full bore in ad-hoc queries.
• Cost: $ - $$
• Suitability: Very low cost. No management whatsoever. Can act as a low-cost, moderately performant EDW. It doesn’t require replicating data to a second system. Large joins and complex analyses work well.
• Caveats: Not the lowest latency. To achieve decent performance, will likely reformatting the stored data using a serialization format Parquet, compressing, re-partitioning, etc. May require several rounds of query tuning and/or reformatting to get correct. Currently no support for UDFs or transactions.
• Popular offerings: AWS Athena (managed service used to query S3 data), EMR (managed service – can install Presto automatically), self-managed Presto (EC2 based – you’d never want to do this in AWS).
• Tips and Tricks: Just use Athena. Leverage AWS Glue to build an ETL pipeline for ingesting the raw data and reformatting it into something that S3 or Athena can use more efficiently. Use S3 lifecycle policies to move older data to lower cost archival storage like Glacier.
Putting it all together
With an understanding of the top five big data architectures that you’ll run across in the public cloud, you now have actionable info concerning where best to apply each, as well as where dragons lurk.
Once you start building out big data architectures in AWS, you’ll quickly learn there’s way more than five, and in many cases your company will likely end up using all of the above in tandem – perhaps using Kinesis to stream customer data into both DynamoDB and S3. You may occasionally spin up an EMR (to do some machine learning) or Redshift (to analyze KPIs) cluster on that source data, or you may choose to format the data in such a way that you can access in-place via AWS Athena – letting it sort of function as your EDW.
Having the ability to do TMTOWTDI is a great thing, and AWS strives to provide the services from which you can pick the best fit for your needs. But it can be overwhelming – even for long-term practitioners like myself. If you’re starting from scratch, the brief three days spent in an AWS-certified Global Knowledge training class will more than pay for itself by giving you the lowdown on services that will meet your needs, and let you hit the ground running as soon as you get back into the office.
Never miss another article. Sign up for our newsletter.