Data Infrastructure Engineer

at Heetch

Data Engineering Team @HeetchOur team's mission is to help the company generate confident insights, make better decisions and build data-driven products.We believe the data platform is the digital nervous system of Heetch and that empowering everyone in the company with data access is critical to our business success.As a new sub-team within Data Engineering, the Data Infrastructure team is dedicated to designing, building and scaling our data platform and the underlying data infrastructure.What will be your role?You will enable Data Scientists, Data Analysts, and Operations teams, tailor the data platform to their needs and empower them to solve challenging ML and analytics problems. If you're experienced, passionate and interested in leading the transformation of our data infrastructure, we would love to talk to you!Does it sound like you?
  • You've architected, built, scaled, tuned and maintained large-scale distributed systems in a production environment, specifically on top of AWS.
  • You've got proven experience working with data technologies that power data platforms (e.g.: Spark, Presto, Kafka, Airflow, Avro, Redshift, ElasticSearch, etc.).
  • You've led DevOps topics such as CI/CD, containerization, monitoring, etc. in a data ecosystem.
  • You display strong coding skills in Python and Scala with a focus on maintainability, scale, and automation.
  • You love to work autonomously and take on unconstrained problems.
  • You can drive a vision, estimate the associated tasks and plan from development to delivery.
  • You take pride in sharing and gathering knowledge through documentation, advocacy and getting soaked in stakeholders use cases.
What will you do?
  • Build frameworks, libraries, and abstractions to enable easy and reliable data processing, ingestion and exposition
  • Automate data pipeline and services deployment and configuration management
  • Support, manage and handle operations on cloud-based data technologies (e.g., clusters, serverless applications, APIs, databases)
  • Monitor the health of the data platform through automation
  • Handle periodic on-call rotations
  • Allow data engineering and data science to execute their pipelines through workflow management
What are going to be your challenges?
  • Build the next generation of our data platform using open source big data technologies such as Kafka, Kafka Streams, Airflow, Spark, Metacat and Kubernetes
  • Enable data scientists to test and productionize various ML models to enhance the performance of our marketplace
  • Craft robust infrastructure foundations to support API-based data access including finatra microservices and AWS Lambda functions
  • Support, manage and handle operations on our MPP databases (Redshift, Presto)
  • Design change data capture from PostgreSQL databases to feed the data lake
  • Simplify data integration with Apache Gobblin
  • Enable dataset discovery, metadata exploration, and change notification
  • Unlock acceptance testing with Airflow, Spark, and Cucumber