This article was also published on the site: https://dzone.com/articles/the-types-of-data-engineers.
We all know that in the last few years the position of data engineer, together with data science, has been in high demand on the market.
However, we can still observe in the market a certain discrepancy in the technical profile of a data engineer. I’m talking about this point specifically for the Latin American region, maybe elsewhere in the world this is more advanced.
Companies have difficulty hiring and mapping profiles properly, especially consulting companies, and often end up generalizing the profile of the vacancy, announcing the vacancy simply as Data Engineer.
They even end up with people who are (relational database) used to solving problems with SQL for projects where you need a programming skill (coding), or sometimes more of a data analysis profile.
It happens! We have different data engineer profiles, basically I thought of three, whichI will detail the following types.
1. Analytical Skill
This engineer has a more analytical skill set, usually people with training in the fields of computer science, mathematics, and physics.
This engineer is responsible for scaling machine learning models and making these models fit for production environments. They have relevant knowledge in the area of data science and know how to code very well.
This engineer knows the scikit-learn, Tensorflow, and Keras; this professional adapts the models/architectures created by data scientists, defines the design patterns, and makes this model run in production for millions of users to use. I think this is currently the most difficult profile to find.
- Distributed ML Platforms: MLib (Spark), Mahout (Hadoop), AWS SageMaker;
- Algorithms and data structure: union, linked list, trees, graphs;
- Parallel Computing for Deep Learning (Tensorflow, GPU Programming);
- Development in Containers (Docker, Rkt);
- Programming in Notebooks (Zeppelin, Jupyter).
2. Builder Skill
This engineer has a profile focused on the monitoring of resources, provisioning, pipeline, and volumetry. This professional knows how to raise a cluster from scratch, knows the solutions of APIs/libraries, and has a good knowledge of them.
Generally, this professional has training in information systems and project management, and does the hard work of saving terabytes of data and leaving the collection services in place.
- Cloud Storage Services (AWS S3, Google FS, BigQuery, Redshift);
- Streaming Platforms (Kafka, Kinesis, Storm);
- Management of VMs (EC2, GCC);
- Container Orchestration (Kubernetes, AWS Fargate, Mesos);
- Provisioning and Monitoring Tools (TerraForm, New Relic, ELK);
- Maintenance of Distributed DataStores (ElasticSearch, Mongo Clusters / MySQL / Oracle / PostgreSQL).
3. Developer (Code)
This engineer is a specialist in developing big data, batch, or real-time applications.
A developer has a great knowledge of software architecture, standards, and programming languages. Usually this professional has moved from the software development area to the big data area. They have a background in computer science and information systems.
- Programming in Java, C++ and / or Go and Functional languages (Scala, Clojure, Elixir);
- Paradigms of distributed programming (Channels, Actors);
- SQL and NoSQL Interfaces (KQL, ElasticSearch API);
- ETL APIs / Platforms (Spark, Airflow, Luigi, Nifi).
These are the thee main profiles of data engineers; of course there are always exceptions to the rules. The data engineering field is relatively new, but that’s a perception I see in the marketplace where I work. In the case of consulting companies, it is common to see a professional with analytical skills performing the task of a builder profile and vice versa.