We are seeking a Senior Data Engineer to join our Applied AI group and build the data pipelines and tools that power our AI-driven scientific platform. In this role, you will design, implement, and maintain large-scale data architectures, ensuring secure and efficient data flows for advanced machine learning and generative AI models. You will also collaborate closely with scientists, software engineers, and product managers to translate internal algorithms and diverse datasets into production-ready tools and knowledgebases. This is a unique opportunity to shape the foundation of our AI capabilities by leveraging distributed systems, AWS, and Kubernetes to deliver scalable, reliable solutions that drive our agentic AI systems.
Key Responsibilities:
- Data Pipeline Development: Design and implement robust, scalable data pipelines to support machine learning and generative AI workflows, including Retrieval-Augmented Generation (RAG).
- Distributed Systems: Architect and manage distributed data-processing systems that handle large volumes of structured and unstructured data in real time.
- Cloud & Infrastructure: Leverage AWS services (e.g., S3, EC2, Lambda, and others) to build highly available, fault-tolerant data solutions; utilize Kubernetes for container orchestration and scalability.
- Integration & Collaboration: Work cross-functionally with scientists, engineers, and product managers to define platform requirements, integrate new data sources, and ensure seamless data flow into AI/ML pipelines.
- Data Governance & Quality: Establish best practices for data security, compliance, and quality assurance, ensuring the reliability and integrity of all datasets used in production.
- Performance Optimization: Monitor and optimize data workflows for throughput, fault-tolerance, and cost efficiency; implement robust logging, monitoring, and alerting for production readiness.
Qualifications:
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
- Professional Experience: 5+ years of experience building and maintaining production-grade data pipelines or distributed systems.
- Proficiency in Python: Strong Python skills with a solid grasp of object-oriented programming principles and common data engineering libraries/frameworks.
- Relational Databases: Fluency in relational database usage (e.g., PostgreSQL) for schema design, query optimization, and data governance.
- AWS Expertise: Hands-on experience with AWS cloud services for data ingestion, storage, and processing; comfortable designing and deploying infrastructure-as-code solutions.
- Distributed Systems Knowledge: Demonstrated ability to implement and manage distributed data-processing systems (e.g., Spark, Kafka, or similar).
- Communication & Collaboration: Exceptional communication skills with the ability to explain complex technical concepts to both technical and non-technical stakeholders.
- Experience with ML & Generative AI: Prior work on data pipelines specifically supporting ML or generative AI models; familiarity with the MLOps lifecycle.
- Retrieval-Augmented Generation (RAG): Hands-on experience with RAG techniques and knowledgebases for AI systems.
- Kubernetes Proficiency: Comfort with container orchestration and scaling using Kubernetes.
- Agentic AI Systems: Exposure to or experience building agent-driven platforms where AI systems autonomously execute complex tasks.
- Startup Environment: Experience adapting quickly and delivering results in a fast-paced, evolving environment.
- Domain Background: Exposure to life sciences, material sciences, or related fields.