Although many may argue the point, technology implementation for big data (and technology implementation in general) is usually the easy part. There’s a script to install it, there are online resources for answering questions, and if all else fails, we can outsource the setup and maintenance. However, structuring and managing teams of individuals who install, maintain and use those systems is no easy task.
There are myriad roles, skill sets, and team structures that may be new or possibly foreign and that may even overlap in a big data environment. In this blog post, we define and differentiate the most important roles, namely: data scientists, data analysts, business intelligence (BI) analysts, developers and administrators (admins).
As with any technology, nothing would be possible if the admins didn’t first install and configure the software. These folks are responsible for choosing, configuring and maintaining the right hardware, software and networking equipment, and then setting it up to work in a clustered environment (assuming we’re building our own cluster and hosting our own platform). If you decide to launch your cluster in the mature, cost-effective and easy-to-use public cloud platforms, such as BigQuery, Redshift and Elastic Map Reduce (EMR), you may even find that you don’t need an administrator in your organization.
Surprising as it may be, another role that may not be needed in your organization is actually that of a developer, which may help explain why this author is now writing blog posts instead of code. As we all (should) know, developers write code in a particular language, and for big data analysis, that code performs some heavy data crunching, usually in a parallelized computing environment on massive sets of data. The ideal person to do this job would be someone with deep language (Java, Python, Perl and so forth) experience, who has a good understanding of the business domain, and is a whiz in distributed systems software development (like MapReduce). Oh, and that person also needs to be available. In other words, we’re looking for a big data unicorn but the problem is — they don’t exist.
The good news is that every mature big data platform already has interfaces that expose the power of the engine without forcing you to get a master’s degree in computer science and obtain 10+ years of work experience. Using the Hadoop engine as an example, nearly everyone, including our mythical ideal developer, uses these interfaces rather than writing code when performing actual analysis because it saves time. Someone using these tools to answer problems quickly is a person we call a data analyst.
The data analyst’s title is probably one of the most widely interpreted roles out there, but these people generally use their knowledge of data models, the business domain and some high-level interface to answer business problems and create common reports like sales forecasts, quarterly reports and the like. Using high-level interfaces or third-party tools like Tableau, Qlik and others, these individuals don’t need heavy duty math, statistical or programming skills to answer common business questions.
The role of the data analyst may seem similiar to a business analyst or business intelligence (BI) analyst, and there is a good deal of overlap in those titles. Where I make a distinction is in saying that a data analyst generally needs to understand the data better. To give you an example, a business analyst might wonder why their Hive and Pig queries run slow, but the data analyst would open the hood, restructure the underlying data, and then go back and run a faster query. BI is a field that has been around for decades, and it encompasses a set of tools and methodologies to help businesses extract actionable value from their data. Although the data analyst’s role falls in that definition and uses many of the same BI tools, they typically need a deeper understanding of the business domain and the data in order to optimize queries, data models and the like.
Our final role, the data scientist, is without a doubt the easiest to define. A data scientist is simply a data analyst who lives in California and earns three times as much (as much as I wish that were only a joke, it often times is not). Many data analysts have learned the advanced mathematics, statistical or programming skills that qualify them for one of the most lucrative positions in big data.
Data scientists are the Gandalfs on your big data team — wizards who can use their vast knowledge and experience of mathematical and statistical models and algorithms to conjure up the exact spell that you need to make deep insights that lead to identifying innovative and lucrative new products, services and other revenue streams. These are the folks driving your predictive analytics. These are the folks who jam up every whiteboard in your organization with mathematical scribble. These are the folks who will — by a very wide margin — cost your organization a lot more to recruit and retain but who will ultimately make your organization a lot more money.