Purpose of the Role:
We are seeking a skilled Lead Data Engineer to join our dynamic team and contribute to the design and implementation of data-driven solutions. You will be responsible for developing and optimizing distributed data processing pipelines, enabling large-scale data analytics, and ensuring the efficient handling of big data. If you are passionate about working with cutting-edge technologies in a fast-paced environment, this role is for you.
Duties and Responsibilities:
- Design, develop, and maintain data pipelines using AWS Glue, PySpark, and Apache Spark to process and transform large-scale datasets efficiently.
- Collaborate with data scientists, analysts, and engineers to understand data requirements and translate them into scalable solutions.
- Optimize data pipelines for performance and scalability in distributed environments.
- Build and deploy big data solutions in cloud environments (e.g., AWS, Azure, GCP)
- Implement streaming data pipelines using Spark Structured Streaming, Kinesis, or Kafka where business requirements demand real-time processing.
- Develop and maintain data models, ensuring data integrity and consistency.
- Troubleshoot and debug issues in existing pipelines, ensuring high reliability and availability of systems.
- Document technical solutions, data flows, and pipeline architecture to ensure knowledge sharing.
Required Experience & Knowledge:
- 3+ years of experience in data engineering with a proven track record of designing and implementing production data pipelines
- Senior-level proficiency with AWS Glue, including Glue ETL jobs, Crawlers, Data Catalog, job orchestration, and Glue Studio
- Strong experience with PySpark and Apache Spark for large-scale data processing in distributed environments
- Deep knowledge of dimensional data modeling techniques including star schema, snowflake schema, and slowly changing dimensions
- Hands-on experience designing and optimizing data warehouses (Redshift, Snowflake, or similar platforms)
- Proficiency in Python with strong understanding of data structures, algorithms, and software engineering best practices
- Production experience with AWS services including S3, Glue, Redshift, Athena, EMR, Lambda, Step Functions, and CloudWatch
- Experience implementing data quality frameworks, data validation, and monitoring solutions
- Knowledge of ETL design patterns, error handling, retry logic, and pipeline orchestration
- Proficiency working with data formats such as Parquet, Avro, ORC, JSON, and CSV
- Strong understanding of data lake and lakehouse architectures, partitioning strategies, and schema evolution
- Hands-on experience with infrastructure-as-code (Terraform, CloudFormation) and CI/CD pipelines for data workflows
- Experience with dbt (data build tool) or similar SQL-based transformation frameworks is a strong plus
- Familiarity with data governance, metadata management, and data lineage tools (AWS Glue Data Catalog, DataHub, etc.)
- Experience mentoring engineers and leading technical design discussions
Skills and Attributes:
- Strong analytical and problem-solving abilities, with attention to detail.
- Excellent collaboration and communication skills to work in cross-functional teams.
- Ability to adapt quickly to new technologies and a fast-paced work environment.
- High level of ownership and accountability for deliverables.
Required Education & Qualifications:
- Bachelor’s or Master’s degree in Computer Science, Data Engineering, or a related field (or equivalent practical experience).
- Advanced level of spoken and written English.
- Relevant certifications in big data technologies, cloud platforms, or Spark are a plus.