Accomplished IT professional with strong work ethic, specializing in Site Reliability, implementing and automating quality assurance standards, performance testing, monitoring, system and application alerting, incident management.
(22+ years) with various full system development life cycle experience.
Qualifications include:
6+ years of Site Reliability experience,
4+ years of Selenium test automation experience (functional and performance testing), 12+ years of QA software testing experience
-Solid work history with familiar companies such as Scratch.mit.edu, VirtuStream(Dell company), LexisNexis, Bed Bath & Beyond, JPM Chase, Princeton Review, McGraw Hill, etc...
-Excellent communication and presentation skills, self starter, quick learner, team player.
-Certified Scrum Master and AWS Architect Associate
Help build out all new functions and have more oversight of these processes. On a day to day this candidate will setting up SLI's and SLO's and gathering & monitoring to put this information into a dashboard and reporting SLOs to key stakeholders. they will also be building out CI/CD templates and then relay that to the engineering team.
Application and System Monitoring:
Observability (Datadog):
-PagerDuty integration and maintenance.
-Migrating monitoring from Pingdom, New Relic and Sentry
-Created Dashboards for apm services/various was cloud and on-prem infrastructure/synthetic checks
-Created custom checks for Raid Wear Level for our mysql hosts
-Created observer ability, ie: alerts and dashboards for aws serverless services (lambda)
-Created S3 bucket, Glue tables, Work Group & Database for Fastly CDN Access Log streaming
-Implemented Datadog Synthetic check SLO’s for proof of concept
SRE Standards and Processes:
-Created Incident Management Process which includes registration (Jira) Communication (slack to appropriate parties), Diagnosis, Resolution, Incident Closure, Post Mortem process
-Created Post Mortem process, template and guidelines
-Created workflow for internal team to manually create PagerDuty incidents via slack using slack/PagerDuty integration.
-Updated PagerDuty Services to receive Datadog alerts (also updated service and created Event rules for Dynamic notifications based on alert severity to reduce red blindness)
-SRE Best Practices like Team Charter, Incident management process, Production Readiness checklist, etc…
Operations:
-Release Captain periodically (GitHub)
-Incident Management On-Call which includes triage, notifications, documentation, diagnosis and resolution.
-Implemented TLS encryption at rest and in flight for our Fastly CDN services to our was VPC and On-Prem.
-Various Certificate renewals
-Created aws cloudfront distributions for static S3 Bucket websites.
Scrum Master for many teams including my own SRE team
Educated organization by encouraging Agile and Scrum best practices.
Containerized Highly Available HashiCorp Vault architecture for credential management.
Created POC for containerizing a SNOW Midserver using docker
Created appropriate alerting & monitoring (zabbix) and incident management for our SNOW VM provisioning workflow. Created alerts that triggered pagerDuty which routed to appropriate on-call which would then use our Zabbix custom Dashboard and runbook to resolve the incident.
PagerDuty integration and maintenance.
Scrum Master for small SRE Team
Iaas / paas - AWX (Ansible) to configure the VM's (deploy splunk agent, deploy trend agent and AD Join) after another team completed building VM's triggered by a SNOW workflow which triggered automation in vCenter (to provision VM)
Developed SLO & SLI process for critical services. Using New Relic and mostly Splunk we created a scalable process that recorded RED metrics for previously identified key transactions for the top 4 critical services.
IT solution development
AWS Certified Solutions Architect – Associate
AWS Certified Cloud Practitioner