SecurityScorecard is hiring an Ops Engineer to bridge the gap between our global development and operational teams who is motivated to help continue automating and scaling our infrastructure. The Ops Engineer will be responsible for setting up and managing the operation of project development and test environments as well as the software configuration management processes for the entire application development lifecycle. Your role would be to ensure the optimal availability, latency, scalability, and performance of our product platforms. You would also be responsible for automating production operations, promptly notifying backend engineers of platform issues, and checking long term quality metrics.
Our infrastructure is based on AWS with a mix of managed services like RDS, ElastiCache, and SQS, as well as hundreds of EC2 instances managed with Ansible and Terraform. We are actively using three AWS regions, and have equipment in several data centers across the world.
This role will be remote in North America, or in our HQ in NY, NY.
- Training, mentoring, and lending expertise to coworkers with regards to operational and security best practises.
- Reviewing and providing feedback on GitHub Pull Requests to team members AND development teams- a significant percentage of our Software Engineers have written Terraform.
- Identifying opportunities for technical and process improvement and owning the implementation.
- Championing the concepts of immutable containers, Infrastructure as Code, stateless applications, and software observability throughout the organization.
- Systems performance tuning with a focus on high availability and scalability.
- Building tools to ease the usability and automation of processes
- Keeping products up and operating at full capacity
- Assisting with migration processes as well as backup and replication mechanisms
- Working on a large-scale distributed environment where you were focused on scalability/reliability/performance
- Ensuring proper monitoring / alerting are configured
- Investigating incidents and performance lapses
Come help us with projects such as…
- Extending our compute clusters to support low latency, on-demand job execution
- Turning pets into cattle
- Cross region replication of systems and corresponding data to support low latency access
- Rolling out application performance monitoring to existing services, extending integrations where required
- Migration from self hosted ELK to a SaaS stack
- Continuous improvement of CI/CD processes making builds & deployments faster, safer, and more consistent
- Extending a Global VPN WAN to a datacenter with IPSec+BGP
- 3+ years of DevOps and/or Operations experience
- 1+ years of production environment experience with Amazon Web Services (AWS)
- 1+ years using SQL databases (MySQL, Oracle, Postgres)
- Scripting ability (Bash, Python, C++ a plus)
- Strong Experience with CI/CD processes (Jenkins, Ansible) and automated configuration tools (Puppet/Chef/Ansible)
- Experience with container orchestration (AWS ECS, Kubernetes, Marathon/Mesos)
- Ability to work as part of a highly collaborative team
- Understanding of monitoring tools like DataDog
- Strong written and verbal communication skills
Nice to Have
- You knew exactly what is meant by "Turning pets into cattle"
- Experience working with Kubernetes on bare-metal and/or the AWS Elastic Kubernetes Service.
- Experience with RabbitMQ, MongoDB, or Apache Kafka.
- Experience with Presto or Apache Spark.
- Familiarity with computation orchestration tools such as HTCondor, Apache Airflow, or Argo.
- Understanding of network concepts- OSI layers, firewalls, DNS, split horizon DNS, VPN, routing, BGP, etc.
- A deep understanding of AWS IAM, and how it interacts with S3 buckets.
- Experience with SAFe.
- Strong programming skills in 2+ languages.