The DevOps Solutions Architect is a member of the IT Enterprise Applications team - focusing on DevOps, SRE and AI Ops.
- DevOps focuses on the end-to-end application lifecycle.
- SRE focuses on delivery and the stability of the production environment.
- AI Ops is centered on the deployment, oversight, and monitoring of AI-specific elements.
You are responsible for the smooth operation of Teknion’s Enterprise Applications infrastructure. You have an essential role in integrating the various project solutions within the existing application and infrastructure.
You will interface directly with your senior technology leaders to transform business and technology capabilities. You will be a dedicated contributor to senior leaders as they define a target state, roadmaps, and identify new and emerging technologies that will transform and optimize the business.
Roles & Responsibilities
DevOps
- Design and implement end to end highly scalable and resilient solutions for infrastructure and application services with Teknion’s hybrid clouds
- Software Automated Delivery
- Design, implement, and manage Continuous Integration / Continuous Delivery (CI/CD) pipelines to automate the build, test, and deployment processes
- Automate the configuration and management of systems and applications
- Design, implement and manage Source Code Control (Github); software components and build artifacts in a repository manager integrating into CI/CD pipelines
- Test Automations driven by Test-Driven Development strategies in partnership with development leading to increase in code quality and confidence
- Web Application Firewall & Reverse Proxy
- Manage application security posture protecting APIs & Web Applications at the edge
- Manage the deployment and configuration of application based definitions
- Integrate WAF into CI/CD pipelines to ensure security is built into development process
- Align & implement WAF policies with industry & organization standards
- Implement and manage Reverse Proxy and Web Application Firewalls (Cloudflare WAF) to provide unified application security posture protecting APIs & Web Applications at the edge; reduces client-side risks
- Cyber Security o Identifying and deploying cybersecurity measures by continuously performing vulnerability assessment and risk management. Address Common Vulnerabilities and Exposures (CVE) as per established procedures
- Other duties as assigned
Site Reliability
- Reliability and Availability
- Design and implement Development, QA, UAT and Production application & database environments
- Ensure application environments, tools & approved 3rd party components are kept up to date as per established patching & update procedures. Liaise with vendors to manage the monthly patching exercises
- Ensure unplanned downtime is kept to a minimum (preferred 0.00%)
- Implement automated processes wherever possible with continuous modernization and upgrade of existing processes / scripts
- Identity & Access Management (IAM)
- Manage, Configure and monitor applications related IAM actions
- Incident Response
- Responding to and mitigating production incidents.
- Conducting post-incident reviews (postmortems) to identify root causes and prevent future occurrences.
- Monitoring and Alerting
- Designing and implementing comprehensive monitoring systems to track system health and performance.
- Setting up effective alerting mechanisms to notify teams of potential issues.
- Participate in code reviews, security audits, and performance testing to maintain the integrity of Cloud and Hybrid solutions
- Change Management
- Managing and automating the deployment of software changes.
- Implementing safe deployment practices, such as canary releases and blue-green deployments.
- Participate & contribute in IT / Cyber Security Change Advisory Board meetings
- Strategic Planning, Performance Optimization & Capacity Planning
- Drives continuous technology transformation to minimize technical debt
- Provides architecture direction for developers recognizing custom and standard technical frameworks, GRC (Governance, Risk & Compliance) audit policies and procedures including PII (Personally Identifiable Information) and CUI (Controlled Unclassified Information)
- Participates in defining target state technology architecture and roadmaps & ensure alignment of initiatives
- Work closely with cross-functional application & infrastructure teams to produce comprehensive end-to-end solution opportunities
- AI Observability & Performance Monitoring
- Design and implement monitoring dashboards within APM tools (Prometheus, Grafana, Datadog, or New Relic) to track AI-specific metrics such as API latency, token utilization, and foundational model error rates.
- Set up cost-tracking alerts to monitor the consumption of Generative AI resources and prevent budget overruns in Development, QA, UAT, and Production environments.
- Strategic Governance & Data Compliance
- Provide architectural guidelines for developers to ensure AI applications strictly adhere to GRC audit policies, specifically blocking the leakage of Personally Identifiable Information (PII) and Controlled Unclassified Information (CUI) into public AI training sets.
- Maintain accurate Standard Operating Procedures (SOPs) detailing the failover and recovery mechanisms for AI-driven system capabilities.
- Stay up-to-date with the latest technologies and security trends to ensure our solutions remain innovative, secure, and cost-efficient
- Define and maintain documentation of architectural solutions and procedures (Standard Operating Procedures)
- Other duties as assigned
AI Ops
- AI Security & Edge Protection
- Configure and manage Web Application Firewalls (Cloudflare WAF) and API gateways to safeguard Generative AI endpoints from emerging threats like prompt injection and data exfiltration.
- Integrate security guardrails into the development process to automatically scan and intercept unsafe data payloads sent to external or internal AI foundation models.
- AIOps, Observability & Incident Governance
- AI Observability & Cost Tracking: Design and implement monitoring dashboards within APM tools (e.g., Prometheus, Grafana) to track AI-specific metrics like API latency, token utilization, and foundation model error rates. Establish cost-tracking guardrails across local and cloud environments.
- AIOps & Intelligent Event Correlation: Architect and implement AIOps platforms to ingest, aggregate, and correlate telemetry data
- Other duties as assigned
Skills & Qualifications
Education, Experience & Soft Skills
- Bachelor’s degree in information technology, software engineering, computer science, or related
- Proven experience in engineering and software architecture design.
- Must be self-motivated and driven. Strong ability to work with internal resources and vendors
Technical Skills
- Virtual Private Clouds and Cloud Platforms (AWS, Azure, GCP):
- Experience in managing Virtual Private Clouds (OSI Transport layer and above)
- Experience with cloud services like EC2, S3, Azure VMs, Kubernetes Engine, etc.
- Understanding of cloud networking, security, and infrastructure as code.
- Containerization and Orchestration:
- Expertise in Docker for containerizing applications.
- Experience with Kubernetes or other orchestration tools for managing containerized workloads.
- Configuration Management:
- Familiarity with tools like Ansible for automating system configurations.
- CI/CD Tools:
- Expertise in CI/CD tools - Jenkins, GitHub Actions CI/CD
- Scripting and Programming:
- Proficiency in scripting languages like Python, Bash, or PowerShell.
- Understanding of programming concepts for building automation tools.
- Operating Systems & Server Management:
- Strong understanding of RHEL 8 (& above) and/or Windows Server.
- Networking:
- Networking knowledge, including TCP/IP, DNS, and load balancing.
- Identity & Access Management:
- Experience in Okta, Active Directory, Azure Active Directory
- Observability, Monitoring and Logging:
- Experience with monitoring tools like Prometheus, Grafana, New Relic or Datadog.
- Familiarity with logging tools like ELK stack (Elasticsearch, Logstash, Kibana) or Splunk.
- Version Control:
- Web Application Firewall & Reverse Proxy:
- Cloudflare WAF, Cloudflare Reverse Proxy, Cloudflare Tunnel
- Databases:
- Postgres, OpenEdge, No SQL, MongoDB, SQLServer
- AI & LLM Fundamentals:
- Fundamental understanding of Generative AI principles, foundation models (LLMs), tokenization, and basic prompt engineering lifecycle concepts.
- Basic familiarity with configuring vector databases or semantic caching mechanisms alongside standard database systems like Postgres, NoSQL, or MongoDB
The expected base salary range for this position $110,000 - $130,000. Final base salary offers will reflect an assessment of the selected candidate's skills, demonstrated competencies, and adherence to our internal pay equity framework.
Teknion is committed to supporting a culture of diversity and accessibility across the organization, starting with the hiring process. It is our priority to remove barriers to provide equal access to employment and support a diverse workforce. Teknion welcomes and encourages applications from people with disabilities. Accommodations are available on request for candidates taking part in all aspects of the selection process. All information received in relation to accommodation will be kept confidential.
By applying for a position with Teknion, you understand that, should you be made an offer, it will be contingent on your undergoing and successfully completing a background check consistent with Teknion's employment policies. Background checks may include some or all the following based on the nature of the position: SSN/SIN validation, education verification, employment verification, credit check and criminal check. You will be notified during the hiring process which checks are required by the position.