Data Engineer – Lakehouse / Spark / NiFi
Engineering Role Details
Posted May 28, 2025At TeamStation AI, we are on a mission to bring together the brightest minds to solve tomorrow’s toughest technology challenges. Our work is about more than just AI—it’s about building the future through collaboration and innovation. We believe that the key to solving the world’s most complex problems lies in aligning diverse talents and perspectives. Our AI-powered platform enables cutting-edge scientific and technical teams to work smarter, faster, and together. By joining us, you’ll help unlock new technological breakthroughs and drive innovation where it matters most.
Join the Mission at TeamStation AI!
Where do we come from? We are seeking visionaries, innovators, and problem solvers who thrive in fast-paced, collaborative environments. If you’re passionate about AI, technology, and solving critical challenges, we want to hear from you. Come be part of a team where your ideas can drive the future.
Data Engineer – Lakehouse / Spark / NiFi: Architecting Next-Gen Data Platforms
Location: Remote (Latin America)
We are seeking a highly skilled and experienced Data Engineer to join our team. In this pivotal role, you will be instrumental in designing and implementing a scalable, open-source-driven data lakehouse platform. This is a hands-on position focused on building high-performance pipelines, optimizing schema evolution, and supporting sophisticated analytical workloads.
Key Responsibilities
- Schema Management: Lead the design and management of Apache Iceberg schemas across Bronze, Silver, and Gold data layers, with a strong emphasis on partitioning strategies, metadata optimization, and time-travel capabilities.
- ETL Pipeline Development: Develop and orchestrate robust Bronze-to-Silver and Silver-to-Gold data pipelines utilizing Apache NiFi, Apache Airflow, and/or Apache Spark.
- Data Integration & Tuning: Integrate and fine-tune ETL processes for diverse structured and semi-structured data formats, including CSV, JSON, and EDI.
- Data Quality & Governance: Implement comprehensive data quality, deduplication, and change detection logic, employing strategies such as hash-based methodologies.
- Analytics Collaboration: Collaborate effectively with analytics teams to ensure performant Trino queries and support dashboarding initiatives within Superset.
Tech Stack & Tools
- Table Formats:Apache Iceberg, Delta Lake, Apache Hudi, Hive external tables
- ETL & Transformation: Apache Spark (SQL/DataFrame API), Apache NiFi, Apache Airflow
- Programming Languages: Python, Scala
- Distributed Query Engines:Trino, Presto, Athena
- Cloud & Storage:Azure Data Lake Storage Gen2, S3, GCS
- Monitoring & Version Control: Git
Desired Qualifications
- Data Engineering Experience: 5+ years of hands-on experience in Data Engineering with a proven track record of developing robust ETL pipelines.
- Table Format Expertise: Solid experience with table formats like Apache Iceberg (preferred!), Delta Lake, Apache Hudi, or Hive external tables with ORC/Parquet. Proficiency in managing ACID tables and schema evolution in a data lake context is essential.
- ETL & Orchestration: Demonstrated proficiency with Apache Spark (SQL/DataFrame API) for building large-scale batch and streaming pipelines. Experience with Apache NiFi, Apache Airflow, or similar orchestration frameworks is required.
- Programming Prowess: Expert-level proficiency in Python and/or Scala.
- Query Engine Proficiency: Hands-on experience with distributed query engines such as Trino, Presto, or Athena, combined with strong SQL skills and an understanding of query optimization, predicate pushdown, and partition pruning.
- Cloud & Storage Acumen: Practical knowledge of Azure Data Lake Storage Gen2, S3, or GCS, including object storage layout and file formats (Parquet, Avro). Understanding of access control mechanisms (e.g., Azure RBAC, managed identity).
- Data Architecture & Governance: Familiarity with Bronze/Silver/Gold layer design in lakehouse environments. Experience with hashing strategies for change detection and time-travel queries. Comfort working in HIPAA/PII-sensitive environments, with a focus on de-identification and auditability.
- DevOps & CI/CD: Hands-on experience with Git for managing data schema and script assets.
Nice-to-Have (Bonus Points!)
- Experience with Apache Iceberg utilizing a JDBC or Nessie catalog.
- Familiarity with Open Policy Agent (OPA) or column-/row-level security.
- Prior experience in highly regulated environments such as healthcare or financial services.
- Exposure to Superset, Looker, or Power BI on top of data lake engines.
What We Offer
- Competitive Compensation: A highly competitive salary package reflecting your expertise and contributions.
- Fully Remote Work: Enjoy the flexibility of a fully remote work environment within the US.
- Career Growth & Development: Opportunities for continuous learning, professional development, mentorship, and upskilling in cutting-edge technologies.
- Impactful Projects: Contribute to high-impact data workflows and help shape a modern, open-source lakehouse architecture from the ground up as part of a TeamStation AI partnership.
- Collaborative Environment: Work alongside a cross-functional team of talented engineers and analysts.
Learn more about TeamStation AI Careers and our mission:https://teamstation.dev/home/careers