In this role, you’ll work closely with model researchers, data infrastructure engineers, and cross-functional partners to make sure our data is high quality and can be produced at petabyte scale in a reliable, efficient way. From understanding how data choices show up in model behavior, to building processing pipelines and running the compute behind them, you’ll help ensure our models are trained on the best data we can get.
What you’ll do
Work with model researchers to define what “good data” means for our models, including quality metrics, validation checks, and acceptance thresholds
Explore open source datasets and create internal ones most suitable to build fundamental World Models
Build algorithms for automated data quality assessment, data domain mixtures, and domain adaptation from synthetic to real data.
Track datasets, metadata, provenance, and versions so experiments are reproducible and it’s clear what data went into which training and evaluation runs
Own CI/CD and development tooling for the data stack (GitHub, Python, PyTorch), and automate repetitive workflows to reduce friction
Track and optimize throughput, storage, and compute utilization across pipelines and related assets
What we’re looking for
Strong ML and deep learning fundamentals with experience building and operating large-scale data and/or compute systems
Comfortable moving between research questions and production engineering: you can dig into data, run analyses, and also ship reliable systems
Demonstrated research experience with data compositions, quality, and dataset releases
Ability to design and execute experiments with convincing unbiased outcomes
Practical experience with distributed processing and orchestration (Spark, Ray, Airflow, or equivalents)
Solid Python skills, and familiarity with the tooling around modern model training workflows (datasets, checkpoints, experiment tracking)
Strong instincts around data quality: how to measure it, how to monitor it, and how to prevent regressions as things scale
Able to work in a fast-moving environment, prioritize what matters, and communicate clearly with both researchers and engineers
Bonus: experience with large video datasets, dataset curation for training, or building internal tooling for evaluation/analysis in ML environments
Reka's Mission
Reka's mission is to build useful multimodal artificial intelligence and use it to empower organisations and businesses. We are a globally distributed foundation model startup, headquartered in the San Francisco Bay Area, California. Embracing a remote-first approach, our team brings together top talent from around the world. Our founding team, along with many of our team members, has contributed to many of the breakthroughs in AI over the past decade.
Why Reka?
An Elite Team: Collaborate with top-tier engineers, researchers, operators from renowned organizations like Google DeepMind and Facebook AI Research (FAIR) and successful startups, driving innovation in cutting-edge AI technology.
Massive Market Opportunity: Be part of a rapidly growing industry poised to transform multiple sectors globally, offering the chance to make a significant impact.
Mission-Driven Environment: Work alongside a collaborative, mission-focused team dedicated to advancing AI for meaningful applications.
Inclusive and Open Culture: Thrive in an open and inclusive work environment that values diverse perspectives and fosters creativity.
Generous Benefits: Enjoy 5 weeks of paid leave to recharge, comprehensive healthcare benefits including vision and dental, and additional perks that support your well-being.
Visa Support: We provide visa assistance, including H1B and OPT transfers, for US employees to ensure a smooth transition and support your career with us.
