About Posh
We are all social creatures, but the dominant “social” companies today have evolved into digital loneliness machines, driving isolation, anxiety, and mental health challenges around the world.
Human connection is lost. Posh is a beacon guiding us back.
Posh enables anyone to build an IRL community based on shared interests, while connecting consumers with the communities of people just like them. Founded by event organizers who were frustrated with the growing loneliness epidemic and the tools available to build their own event brand, we’ve built the ultimate platform for launching, monetizing, and finding IRL communities of people just like you. In just 6 years, Posh has grown to a team of 70, expanded to 8M+ users, secured $70m in venture funding, and facilitated over $350M in transactions.
About The Role
We are looking for an experienced Senior Data Scientist to own the evaluation framework for our AI agent, the data that feeds it, and the success analysis, testing, and metrics that determine how well it's working. As one of the early data hires at Posh, you'll shape the technical direction of our AI quality strategy and set the standards for how agent performance is defined, measured, and improved over time.
You'll work across the full evaluation lifecycle: identifying the right signals, building clean pipelines for evaluation data, designing tests and experiments, and communicating what "good" looks like to both technical and non-technical stakeholders. Your work will directly inform how we iterate on our AI agent and how we know when it's ready to ship.
You'll partner closely with Product and Engineering to define success criteria, ensure proper instrumentation, and architect evaluation datasets that reflect real user behavior. You'll establish best practices for data quality, governance, and documentation, ensuring our evaluation framework remains trustworthy and rigorous as we grow.
This role offers a high-growth opportunity as we expand our AI and data capabilities. If you're passionate about building from 0 to 1 and making a lasting impact, this is the role for you.
This is an in-person position at our New York City office, located in the heart of SoHo.
At a high level, you’ll be in charge of:
Building and Owning the AI Agent Evaluation Framework: Design and maintain the systems and methodologies we use to measure AI agent quality. Define what good looks like, build the rubrics and benchmarks, and own the feedback loops that drive iteration. Build ETL/ELT pipelines that transform raw behavioral, transactional, and interaction data into clean evaluation inputs.
Preparing High-Quality Data for AI Models: Own the data that feeds our evaluation pipeline, from ground truth datasets and labeled examples to behavioral signals and the semantic layer. Ensure every input is reliable, well-structured, and built to last.
Instrumenting Agent Tests, Experiments, and Success Metrics: Build the testing infrastructure to evaluate agent performance across accuracy, relevance, and user satisfaction. Run structured experiments and pre/post analyses to assess the impact of model and product changes, and build dashboards that keep the team aligned on performance trends and regressions before they become problems.
Collaborating with Product and Engineering on Instrumentation: Work closely with Engineering to ensure accurate logging of agent interactions and user signals. Partner with Product to translate business goals into evaluation criteria and measurement requirements.
Ensuring Strong Data Governance and Documentation: Implement best practices for data quality, documentation, observability, and lineage across all evaluation-bound datasets. Be the person who makes sure the foundation doesn't rot.
Our ideal candidate
Possesses 5+ Years of full time Data Experience: Has at least 5 years of hands-on experience in data science or analytics engineering. Demonstrates a strong ability to design, build, and optimize scalable data systems.
Expert in SQL and Python: Demonstrates strong proficiency in SQL and Python, with deep experience cleaning data, engineering features, and building efficient, production-ready modeling pipelines.
Strong Ability to Analyze and Evaluate Models or AI Systems: Skilled in designing experiments, interpreting model performance, and communicating insights clearly to both technical and non-technical stakeholders.
Experience with AI/LLM Evaluation or Agent Quality: Has built or contributed to evaluation frameworks, golden datasets, or quality measurement systems for AI models or agent-based products in production environments.
Proficient with Modern Data and ML Tooling: Experienced working with cloud data platforms and analytics tooling, and familiar with best practices in data modeling, pipeline reliability, and ML optimization.
Has a Background in Early Stage Data Team: Exhibits high interest in startups and has experience building the early foundation of a data team at a small tech company.
LLM/Agent Tooling (Bonus): Experience with tools like LangSmith, Langfuse, or similar LLM observability and evaluation platforms, and/or working on consumer-facing AI products at scale.
Posh provides equal employment opportunities to all employees and applicants for employment and prohibits discrimination and harassment of any type without regard to race, color, religion, age, sex, national origin, disability status, genetics, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state or local laws.
This policy applies to all terms and conditions of employment, including recruiting, hiring, placement, promotion, termination, layoff, recall, transfer, leaves of absence, compensation and training.
Posh is committed to providing reasonable accommodations for qualified individuals with disabilities in our job application procedures. Please let us know if you need assistance or accommodation due to a disability
