About Us
- Establish a company-wide SLO/SLA system: Define quantifiable reliability indicators (availability, latency, error rate) for each Line of Business, and drive change rhythm and investment decisions based on Error Budget
- Construct MTTD/MTTR measurement system, set grading goals and continuously optimize: P-1 target MTTD < 1min, MTTR < 5min
- Building fault self-healing capabilities: automated fault detection → diagnosis → recovery link, reducing reliance on manual intervention
- Promote chaos engineering practice: regularly conduct fault drills (Chaos Engineering) and actively discover weak links in the system
- Establish a change risk control system: canary release standardization, change impact pre-assessment, automatic rollback mechanism
- Building a Data-driven cost governance closed loop: from cost visualization → attribution analysis → optimization decision → execution verification → continuous monitoring of whole-link automation
- Establish a scientific capacity planning model: based on the correlation model between business indicators (QPS/TPS/number of users) and resource consumption, instead of impulsive N-fold reservation
- Promote the implementation of FinOps culture.
- Line of Business/Application Cost Billing and Showback
- Define cost efficiency metrics ($/transaction, $/user, $/QPS) and conduct industry benchmarking
- Embed cost assessment into the resource request process to achieve 100% capacity assessment of new resources
- Automated cost optimization engine:
- Low-load automatic recognition and scaled-down recommendation (AI-based anomaly detection and prediction model)
- Reserved Instance/Savings Plan Automated Purchase Decision System
- Optimization of elastic volume expansion and contraction strategies: pre-scaling based on predictive models to reduce over-reservation
- Automatic recycling and lifecycle management of idle resources
- Goal: Annual cloud cost optimization of 15-20% without affecting business SLO.
- Toil elimination system: measure team toil ratio (target < 30%), systematically identify and automate high-frequency repetitive operations
- GitOps/IaC fully implemented:
- Infrastructure 100% coded, all changes executed through PR review and automated pipeline
- Environmental consistency guarantee: Ensure drift detection and automatic repair of dev/staging/prod configuration through IaC
- Intelligent Operations and Maintenance (AIOps) Construction:
- AI-based alarm aggregation, root cause analysis, and repair suggestions
- Automatic detection of log/metric anomalies, moving from passive alarms to active discovery
- Knowledge Base AI: natural language query operation status, execution standard operation
- Self-service platform construction:
- Business teams can complete more than 80% of routine operation and maintenance operations (volume expansion and contraction, configuration change, permission application) by themselves.
- Operation and maintenance ticket automation processing rate target > 60%
- On-call system optimization:
- Alarm accuracy > 95% (eliminating alarm fatigue)
- Establish Runbook automated execution capability
- On-call quality measurement and continuous improvement
- Financial-grade network isolation architecture design and operation and maintenance:
- Design and implementation of network isolation strategies for multiple accounts, multiple VPCs, and multiple regions
- Standardized management of security groups, end point nodes, and dedicated lines across compliance stations
- Zero Trust Network architecture landing: micro-segmentation, minimum privilege, dynamic access control
- Compliance station efficient building website ability:
- Goal: Deployment of new compliance station infrastructure from weekly to hourly (fully automated)
- Standardized Compliance Station Templates: One-click Delivery of Network Topology, Security Policy, Middleware, and Monitoring
- Automated inter-site isolation verification: Regular automated scans ensure no cross-site data leakage
- Cloudy and multi-regional operation and maintenance:
- AWS/Tencent Cloud/Huawei Cloud unified operation and maintenance abstraction layer, shielding underlying differences
- Cross-regional disaster recovery architecture design: RPO/RTO definition and walkthrough verification
- Data Sovereignty Guarantee for Independent Deployment of Compliance Station (Data Residency, Encryption, Audit)
- Financial-grade guarantee for wallet/transaction core chain.
- Operation and maintenance guarantee of cold and hot wallet isolation architecture
- Transaction link zero downtime change capability
- Multiactive/disaster recovery switching SOP and periodic drills
- Push the team to transform from "traditional operation and maintenance" to "Site Reliability Engineering": solve operation and maintenance problems with engineering methods
- Establishing an SRE competency model and growth path: what abilities should be possessed at each level from P5 to P7 and how to measure them
- Establish knowledge sedimentation and sharing mechanisms: Runbook, Post-mortem culture, internal Tech Talk
- Eliminate single-point personnel risk: at least 2 people can handle each core system independently
- Echelon Construction: Cultivate 2-3 senior SREs who can independently be responsible for Line of Business reliability
- More than 10 years of experience in infrastructure/operations/SRE, and more than 5 years of experience leading a team of more than 10 people in SRE/Infra
- Deep understanding of SRE methodology: SLO/SLI/Error Budget, Toil Management, Capacity Planning, Incident Management are not concepts but practices
- Large-scale cost management practical experience:
- Manage environments where annual cloud spending exceeds $5 million
- Systematic FinOps practical experience (not brainstorming resources, but data-driven cost optimization)
- Capable of capacity modeling: able to predict resource requirements based on business metrics
- In-depth practice of automated operation and maintenance
- Successful cases of reducing toil from > 50% to < 30%
- Proficient in IaC tools (Terraform/Pulumi/CloudFormation) and experienced in large-scale implementation
- Experience in exploring and implementing AIOps or intelligent operation and maintenance
- Financial grade/compliance environment operation and maintenance experience
- Infrastructure operation and maintenance experience in the financial industry (banks, exchanges, payments) or equivalent security requirements
- Familiar with multi-account/multi-VPC network isolation architecture design
- Experience in independent deployment and operation and maintenance of multiple regions and compliance stations
- Understanding the infrastructure requirements of compliance frameworks such as Data Sovereignty, PCI-DSS, SOC2
- Multi-cloud experience: AWS (required) + at least one other cloud (Tencent Cloud/GCP/Azure)
- Programming ability: able to write operation and maintenance tools and automation systems in Go/Python (not writing scripts, but writing systems).
- SRE management experience in cryptocurrency exchanges, traditional securities firms, or payment companies
- Kubernetes large-scale cluster (100 + clusters/10000 + nodes) operation and maintenance experience
- Familiar with the high availability architecture of the trading system (master-slave switching, multi-active deployment, zero downtime release).
- Experience in building internal cost platforms or FinOps tools
- Possessing practical experience in chaos engineering (Chaos Monkey/Litmus/self-developed)
- Participated in infrastructure preparation work for compliance audits such as SOC2/ISO27001/PCI-DSS
Why Join Us
At Bybit, we are committed to fostering a supportive and enriching work environment.
Our benefits include:
- Study Growth Fund: We support your professional development and continuous learning.
- Internal Events: Participate in regular team-building activities, workshops, and events designed to promote collaboration and innovation.
- Global Collaboration: Be part of a diverse, international team, working alongside colleagues from around the world.
- Career Advancement: Access opportunities for growth and advancement within a rapidly expanding global company.
- Internal Mobility: Grow with us- Your long-term development is important to us. We offer internal job opportunities to help build your career path.
