SDLC SOP 1010 - Site Monitoring and Problem Management

From OpenSDLC

Jump to: navigation, search

Contents

SDLC SOP 1010 - Site Monitoring and Problem Management

Objective:

The objective of this Standard Operating Procedure (SOP) is to document the monitoring and problem management and resolution procedures.

Scope:

Monitoring and Problem Management procedures provide for the determination of system status, the maintenance of systems to ensure appropriate execution and the evaluation, escalation and resolution of problems should they occur.

Owner:

Operations


Definitions

Term Definition/Description
Program/Project Management The systematic execution of a System Development Life Cycle (SDLC) for a release or projects that have significant impact on an organization’s service delivery.This procedure oversees the SDLC execution; thus, it relies heavily on defined procedure activities and acceptance criteria for inputs and outputs

Note: Every unit within SDLC interacts with Program/Project Management.Every release of new and enhanced features and functionality requires the commitment and effort from all departments.

Project Manager The focal point throughout a project who ensures that the responsible party has completed with quality and comply with defined acceptance criteria.The Project Manager also acts as the conduit for communicating the progress of the project and decisions made throughout the process to the Project Sponsor Contracting Organization, and the Performing Organization
Program Management Addresses oversight for a group of projects.Program Managers shoulder the responsible for the successful completion of program objectives by supporting and developing project staff.Reporting at this level provides Executive Management with the information necessary to make informed decisions and execute actions that optimize benefits to the organization.
Program Manager The tactical manager who facilitates, monitors and communicates the progress and issues in implementing the strategic objectives of an approved program.The Program Manager works cross-functionally to develop the blueprint that integrates multiple release deliverables that enhance the program’s portfolio.
PMO The PMO is the organization that consolidates all project plans and reports the status to executive management.Impacts from individual projects can be seen from an organizational perspective and responded to rapidly. PMO is where project and program standards, procedures, policies and reporting are established.
Business Gate defined milestone in a project lifecycle when specific requirements must be met in order to make or validate business decisions relating to the project.
Lock-Down The milestone in a project schedule achieved when agreement exists between the Performing Organizations and the Contracting Organizations for the delivery of a defined project scope of work within a defined schedule at a defined cost.
Management Phase Review An event associated with selected business gates where specific decisions concerning the project are made by appropriate levels of management.Deviations in deliverables or timeframe are handled by convening the Gate 6 Review Board. This group will make any decisions concerning scope,cost, and schedule tradeoffs.These business gates are:
  • G-11: Project Strategy Lock-Down
  • G-10: Requirements Scope Lock-Down
  • G-6: Project Lock-Down
  • G-4: Begin Validation
  • G-2: Begin FOA
  • G-0: General Availability


SDLC Business Gates The foundation is Program/Project Management in the SDLC Business Gates.This Systems Development Life Cycle (SDLC) begins at project initiation and moves through deployment to the production environment.
Phase A collection of logically related project activities, usually culminating in the completion of a major deliverable. The conclusion of a project phase is generally marked by a review of both key deliverables and project performance in order to determine if the project should continue into its

next phase as defined or with modifications or be terminated and to detect and correct errors cost effectively.

Program A defined set of projects containing common dependencies, and/or resources and/or objectives overseen by a Program Manager
Project A temporary endeavour undertaken to create a unique product or service. A project has a defined scope of work (unique product or service), a time constraint within which the project objectives must be completed (temporary) and a cost constraint.In the context of SDLC, a project may be one of:
  • an individual feature
  • a collection of features making a release
  • a collection of product releases making up a portfolio
  • a new product development


System Development Life Cycle (SDLC) A predictable series of phases through which a new information system progresses from conception to implementation.All of the activities involved with creating and operating an information system, from the planning phase and/or the initial concept to the point at which the system is installed in a production environment.The major phases are Release Planning, Definition (Requirements and Specifications), Development, Test (Validation), and Deployment.


Process Flow Diagrams

Site Monitoring and Problem Management Overview

file: SOP1010-01.gif


Roles and Responsibilities

Role Responsibility
Operations Center Control The Operations Control Center (OCC) provides first level monitoring of all SDLC production and Beta services These responsibilities include monitoring the health of the web servers, databases, mail and chat facilities, content feeds and Exodus provided services
Systems Engineering System Engineering is responsible for identifying, building and improving site-monitoring tools.Tools are delivered following a formal turnover that includes source code, documentation, and training.System Engineering provides ongoing support to the OCC.
Operations Staff Each individual in the Operations Department provides second tier monitoring of the production and beta environments.
Engineering Staff Third tier monitoring of the production and beta environments is provided by staff in all other areas of the Engineering Department, outside Operations.
Technical Support Technical Support is the primary point of contact for customers and partners. They provide first level, and sometimes second level, support to customers and partners.
Content The Product Department area that makes changes to site content. Site content is defined as inclusive of content, ad servicing and objects. The Content area executes and monitors site content changes and communicates changes to site content to OCC for site logging/tracking purposes.
Exodus Exodus is the “ping, power and pipe” for SDLC.Their services cover applications, hardware, communications and the environment where the server farms are located.Exodus communicates directly with the OCC when issues arise.
  • Systems Engineering – Need to be more involved in the use of tool they developed and used in day-to-day ops monitoring.
  • Exodus – Exodus provides network monitoring and port mirroring sniffer monitoring of the production servers.Issues or failures above the ten server threshold are reported to the OCC
  • Release Engineering – The OCC turns off and on access to areas of the server farm for Release Engineering when a push is being executed. OCC coordinates with Quality Assurance to determine on how a push will impact the site.Also, Qualification Function and Emergency Push Procedures provide the process that the OCC, Release Engineering and Quality Assurance follow prior to and immediately after an emergency push.The OCC may delay a release due to current traffic and/or active incident under problem management control.
Technical Support Respond to customer report site issues and perform damage control, as needed.


Metrics

Metric Description
Cycle Time The number of days or hours it takes to complete requirements for a SDLC Business Gate and/or milestone.
Defects Instances of failure to pass specific tests or quality measures or to meet specification/acceptance criteria. These are recorded and assessed throughout a project and reported at the end of the project.
Change Agents Individuals who analyse a process and recommend ways to improve it, successful or not in its

adoption, will be reported to Engineering Department management.These individuals will receive recognition for their effort to compress cycle times and/or improve quality.


Procedure Activities

Gate/Activity Description

file:SOP1010-02.gif


Site Monitoring The OCC executes a number of activities to ensure that site health is maintained and/or issue identified as they emerge to maintain site presence and availability.These activities are spread across the spectrum of service delivery mechanisms that support the site.OCC’s first tier monitoring is augmented at all times by Operations Staff,second tier, and Engineering Department Staff in general, third tier, monitoring of the site.
Technical Support Technical Support and the Content area of Product augment the OCC monitoring activities through the normal execution of their daily responsibilities.
OCC The OCC monitors the following site services using the listed tools to measure the health of the site, and collaboration with other areas to monitor and communicate site conditions.
OCC The OCC monitors the web servers using tools and verifying page views.This is accomplished by on demand and scheduled activities as well as monitoring tools reporting facilities.
OCC On demand OCC makes sure pages are being served from a server.
OCC On a defined schedule look at every machine.
OCC Reporting tools monitor web servers and perform automated load balancing and/or monitoring.Automatic paging, NetOps mail box and monitored alias mailbox notifications are triggered whenever site measurements exceed the predetermined threshold.
OCC Content replication to all servers is reactively monitored when problems are reported by Technical Support.The OCC will verify the image on the master server then systemically check tiers within the server farm to determine the segment or part of farm not being updated (broken images).Once the scope of the issue is determined content replication to the identified segment or part of the farm will be executed.
OCC The OCC and Technical Support causally surfs the site to monitor feed content for freshness (i.e., current date).The Content function in Product also in monitors feeds both in conjunction with their content changes and as part of ensuring site currency.


Content Feed issues are addressed either proactively or reactively.Proactively, the OCC monitors feed vendors for connectivity; except those using satellite communications.Reactively, internal and external customers monitor feeds andnotify Technical Support when failures occur.In many instances content partners notify their Account Representative who notifies Technical Support or OCC.


OCC A tool that parses logs and uses scripts to identify errors and CPU performance conditions performs database monitoring.OCC views the identified data to determine if intervention is necessary.Specific areas are monitored at a five-minute delay, they are:
  • database connections
  • replication queues
OCC Database Administrators (DBA) manually monitor databases keys, extents, table space growth (size and physical space) ensuring timely intervention and maintenance.
OCC OCC monitors the service and audits the ports using an automated detection and reporting system.


file: SOP1010-03.gif


OCC Site resources are the collective of customer e-mail, web mail and others.The monitor is onfigured to “ping” the resource’s web site.Failures are reported via page and email to both NetOps and specified alias mailboxes.

Site resources are also self monitored by the service.E-mail notifications are sent to NetOps when site resource issues are identified.

OCC OCC monitor port 80 (HTTP ping) to see if active.Users also e-mail Technical Support when issues or failures occur.Receipt user notification is generally delayed notification.OCC verifies port 80 active when ever an issue or failure is reported.
Exodus Exodus provides network monitoring and port mirroring sniffer monitoring of the production and beta servers.Issues or failures above the ten server threshold are reported to the OCC.
OCC and Release Engineering The OCC turns off and on access to areas of the server farm for Release Engineering when a push is being executed. OCC coordinates with Quality Assurance to determine on how a push will impact the site.Also, Quality Function and Emergency Push Procedures provide the process that the OCC, Release Engineering and Quality Assurance follow prior to and immediately after an emergency push.The OCC may delay a release due to current traffic and/or active incident under problem management control.
OCC and System Engineering Provides support for OCC and System Engineering including the tools to ensure optimal use in day-to-day operations monitoring.
OCC and Configuration Management Configuration Management addresses the set-up of the platform, network and software components for a release.It also addresses the controlled change and update of the development, QA, FOA/Beta, staging and production environments.
OCC and Technical Support Respond to customer report site issues and perform damage control, as needed.


Forms

  • None at this time


Exceptions

  • None at this time


Tools/Software/Technology Used

  • None at this time


Attachments

  • None at this time


Related Standard Operating Procedures:

  • None


Personal tools
SDLC Forms