Machine Learning Operations Engineer II

About the job

Kensho is S&P Global’s hub for AI innovation and transformation. With expertise in machine learning, natural language processing, and data discovery, we develop and deploy novel solutions to innovate and drive progress at S&P Global and its customers worldwide. Kensho's solutions and research focus on business and financial generative AI applications, agents, data retrieval APIs, data extraction, and much more. The MLOps team is the de facto ML platform team at Kensho. Our team’s mission is critical: empower our ML engineers with state-of-the-art processes, tooling, and infrastructure to iterate quickly, build reliably, and identify potential production issues early.

Responsibilities

Iterate on Kensho’s ML processes to develop tools, services, and frameworks that make every stage of the ML workflow robust, auditable, and usable.

Work closely with ML engineers to understand their unique processes, identify pain points, and form effective solutions.

Empower engineers with the stable tooling necessary to rapidly experiment and actualize their research into demonstrable prototypes and mature products

Provide resources and training for ML teams on best practices, enabling them to efficiently productionize their work to be leveraged by high-value products and services

Evaluate, select and champion open source and third-party solutions, driving their adoption across teams and integrating into Kensho’s existing platform ecosystem

Ship scalable, efficient, and automated processes for model fine-tuning and reinforcement learning and for the evaluation of LLMs/Agents

Improve LLM and Agentic observability to help monitor agentic applications in production, detecting performance, decay and drift issues

Stay at the frontier by actively tracking emerging tools and frameworks, promote best practices and strengthen the technical expertise of the team with your unique skill set

Qualifications

Minimum

2+ years of experience in ML infra, ML Ops, ML Engineering or some similar skillset

Experience managing distributed systems with Kubernetes. It is important to understand Kubernetes concepts and trade-offs

Cloud Platform (AWS) understanding. We utilize tools like EKS and managed ML services like Bedrock and SageMaker

Python proficiency (we are a python shop mostly)

Familiarity with distributed computing frameworks and workflow orchestration (ie. Ray, Airflow)

Familiarity with software engineering best practices in an ML context

Some basic understanding of ML concepts, LLMs and agents

Ability to debug distributed systems across infrastructure, networking and application layers

Excellent communication skills to drive adoption of new tools and best practices across multiple teams

Someone who’s very curious, driven, low-ego and eager to learn across a range of engineering disciplines, while being part of a fantastic team

Preferred

Experience with Agentic AI systems, tools, frameworks and workflows

Experience with running workflows on Ray

Experience with MCP server patterns