This week, we were lucky to have Andrew Jones, a data engineer, visit our office to teach us about data roles in the IT industry. Andrew is a software engineer with a background in site reliability engineering and tech leadership. Most recently, he worked at Casper as a data engineer, and is now onto his new role on the data team at Grubhub Seamless.
Andrew covered three main topics during the talk: 1) the differences between data science, data engineering, and data analyst roles; 2) the most common languages / frameworks / tools used in data roles; and 3) technical interview trends in the data realm.
Data Engineering vs. Data Science vs. Data Analysis
Andrew kicked us off by covering the distinctions between the three main disciplines in data roles, and what to recruit for in each. Of the three, data science occupies the middle ground between data engineering and data analysis, and there is usually little or no overlap between the tasks of a data analyst and a data engineer.
Data Engineers are software engineers specializing in data storage and consistency. Their primary differential is that they are responsible for writing “production grade” code – code is well-tested, and that scales and persists. They sometimes utilize Big Data tools / frameworks, and own the architecture of transactional and analytical systems. They also help architect and maintain server infrastructure, build platforms for data visualization, or tools that run tasks for data migration.
- From a recruitment standpoint, Andrew advised us that data engineers generally look for roles that offer them the opportunity to work with large datasets, and tend to avoid painful data warehousing projects.
Data Analysts are generally skilled in Python or R (particularly those coming from academia). They generally focus on generating reports to help a business make changes / decisions based on data. Andrew covered the distinction between a data-informed organization vs. a data-driven organization. Whereas the former relies on periodic data reports to answer the “where are we?” as a company questions, the latter utilizes data to actually mold the direction of the company to answer “where are we?” AND “where should we be going?” using data.
- For recruitment, data analysts are generally most attracted to roles where their analysis abilities will be used to drive business decisions. They want to know that their work will be impactful.
Data Scientists is a relatively new job title. Their role bridges the gap between data engineer and data analyst. They typically build features to help a business grow more usership, revenue, improved ad spend, etc. by leveraging predictive modeling, machine learning, AI, or deep learning technologies. They identify trends and predict what might happen next in their company’s future landscape.
- For recruitment, data scientists are most drawn to roles that offer the opportunity to build data-driven features (ex: recommendation engines, driving user engagement…)
Tech Stack: Common Languages / Frameworks / Tools used in Data Roles
Next, Andrew took us through a quick overview of the primary programming languages and tools used in these roles. The main two languages for data are Python and Scala.
- Python is a great, general purpose language. It does everything! Two commonly used frameworks in the Python ecosystem are Flask and Django. Flask is much smaller, while Django is much larger and feature-filled.
- Scala is a language that is becoming increasingly more popular in the data field, and has gone through a lot of growth in the past 5 years. It’s built on Java, but has its own syntax that is much more succinct.
As for frameworks, Spark and Hadoop are the most commonly used.
- Spark is the most popular big data framework, with tooling written in Scala. It’s most compatible and performant with Scala but it can also be used with Pyspark (a library that communicates with Spark).
- Hadoop is also popular but has earned a “difficult to work with” reputation when compared to Spark. Within Hadoop, the MapReduce framework is what is actually used to write Hadoop, and HDFS (Hadoop Distributed File System) is how data is stored within Hadoop. Despite its relative unpopularity compared to Spark, Spark actually needs Hadoop concepts to run. Spark lives on top of HDFS and leverages it heavily. So, Hadoop isn’t going anywhere!
Andrew also touched on task coordination (all about scheduling task flow) and different types of data stores.
- For task coordination, the most popular is Airflow, with Luigi and Azkaban also being commonly used.
- For data stores, within the SQL ecosystem, BigQuery, Hive, or Redshift are preferable to Postgres or MySQL for data analysis purposes. MongoDB, Redis, ElasticSearch, or Cassandra are commonly used NoSQL solutions.
Interview Structure for Data Roles
Finally, we had a brief discussion on general interview structure for data roles. In Andrew’s experience, most companies seem to be moving away from whiteboarding and test for different things depending on the type of role a candidate is up for.
- Data engineers are tested as general software engineers, with a little more focus on architecture, system design, or data trivia.
- Data analysts are commonly given take-home assessments where they’re given a small dataset and asked to break down business needs, create a report, and then brought in to present their findings to the team.
- Data scientists have the most varied type of interview. They generally get a blend of engineering or analysis interviews.
Overall, it was a vastly informative and well-organized lunch n learn presenter. Andrew was a great speaker!