|Past Meeting Archive||Los Angeles ACM home page||National ACM home page||Click here for More Activities this month|
|Check out the Southern California Tech Calendar|
Regular Meeting of the
Wednesday, December 7, 2005
"Avoiding the Destiny of Failure in Large Software Systems"
John Cosgrove, PE, CDP
A recent article by Watts Humphrey in CrossTalk magazine* notes that most large systems are not successful and that software systems larger than $5M have virtually no likelihood of successful completion. The most that can be hoped-for is a "challenged" result where major capabilities are absent or seriously compromised. In Humphrey's view, this is primarily a result of an unrealistic approach to planning and managing large software projects. Since there is little or no visible way of judging the progress and current status of most software projects, the means used to set goals, formulate plans, measure progress and status, etc., must be even more rigorous than other (inherently visible) projects of similar magnitude.
Unfortunately, the opposite is the usual case. Performance objectives, budgets, schedules, specific plans, etc., are typically created without the active participation of those responsible for developing the system. Furthermore, plans should be treated as works-in-progress, needing constant update to remain realistic as the system evolves. The result is inevitable - missed schedules, budgets exhausted, performance and quality failures - all happening with little warning after a period of optimistic progress status reports.
The record of failure does not have to continue. A return to basic engineering principles as they apply to software-systems can change it. All other engineered systems require a fully conceived, realistic plan to meet functional and resource objectives before those objectives are committed. This is a good place to start. Additionally, all elements of risk must be identified and plans created to reduce and manage that risk by well-established means. These principles must be honored in spirit as well as formally. For instance, another useful article by Phil Armour in a recent Communications of the ACM issue**, points out that identified risks are seldom quantified in actual exposure to costs. As Armour states "It is as if the concepts of risk and failure are somehow disconnected." Without the step of assigning an actual cost to the possibility of failure, no action is likely to be approved by project management to minimize or avoid the perceived risk. This should also become part of the project management process on an on-going basis.
John Cosgrove, PE, CDP has over forty-five years experience in computer systems and has been a self-employed, consulting software engineer since 1970. He was a part-time lecturer in the UCLA School of Engineering and LMU graduate school. He regularly gives a lecture on Ethics of Software Engineering as part of UCLA's undergraduate course in Engineering Ethics. He has an invited article, Software Engineering and Litigation in the Encyclopedia of Software Engineering. He holds the CDP, is a member of ACM, NSPE, a Life-Senior member of IEEE Computer Society and a professional engineer in California. Formal education includes a BSEE from Loyola Marymount and a Master of Engineering from UCLA.
LA ACM Chapter Meeting
LA ACM Chapter October Meeting. Held Wednesday, December 7, 2005.
The presentation was "Avoiding the Destiny of Failure in Large Software Systems" by John Cosgrove, PE, CDP Cosgrove Consulting Group. This was a regular meeting of the Los Angeles Chapter of ACM.
John Cosgrove said his program is derived quite a bit from works by Phil Armour and Watts Humphrey. The problem is that most software systems fail and the bigger fail more often. We deliver an important economic product so why is so much of it unsuccessful? These failures are now getting involved in litigation. Failure is not inevitable, notable exceptions exist. A state of Alaska project recovered from a failure. What we do is inherently obscure. Poor natural visibility is typical with software so effective planning and status assessment are critical. It is different when building structures, as definite plans are required. For software someone has a vague idea of what they want. It is difficult to define and we don't know what we have until we are done. It is hard to define software requirements completely. Risk assessment and planning has to be intertwined because risk management is needed when you can't plan all the details. Risk assessment must include economics of failure.
Plans must involve all the responsible stakeholders, the developers, customers, end users, and any others directly affected by the software. You should work for "Win-Win" not "Lose-Lose" and you don't known what "Lose-Lose" really means until you end up in a courtroom. The development cycle policy must be explicit. Critical drivers must be stated as independent variables. Critical combinations are schedule, cost, performance or quality. You can choose one or two as independent variables and the others are dependent variables. When you fix the independent variables the dependent variables vary. Planning is never complete. A rule is never to fail a plan; change the plan before it fails. Developers usually recognize in advance when a plan as stated is not achievable and good practice is to report the problem and change the plan to reality. Provide advance visibility when there is a problem; don't just keep things quiet until the plan fails. As an example, set schedule as the independent variable. When you fix it the other variables vary. You can't have it all; you need the resources that are required to meet that schedule. What are the critical drivers? The Boeing 777 has a million plus lines of code and a lot of safety critical software. The independent variables were to bring it in on-time and that it be functionally adequate and this was accomplished successfully.
True risk management is an element of planning and flows from unknowns identified in planning. There are two broad categories, catastrophic risk and conventional risk. Catastrophic risk is an unacceptable risk that requires insurance in some form. Conventional risk exposure is met by classical risk mitigation steps. Both demand dollar quantification of the failure and the cost of failure drives budgets. An example of an unacceptable risk was the DC-10 flight test system where if it was not delivered on schedule the cost of the failure to make the schedule was a billion dollars. Donald Douglas provided full resources and working software was delivered on time.
There is a new world of regulation. Sorbanes Oxley (SOX) enforces accountability for reporting "correctness." Software projects are investment assets. Correctness, control mechanisms, and security are auditable. Noncompliance realities include civil and criminal liability. "If we managed finances in companies the way we manage software then somebody would go to prison." – Armour.
The future of Software Engineering is that functional size and complexity are increasing rapidly. Size is increase ten times every 5 years and scale matters in all engineered systems. Humphrey provided an analogy with transportation system speeds from 3 mph to 3,000 mph. At the low end the driving technology is shoe leather, at 30 mph the technology is wheels, at 300 mph wings, and at the high end rocket propulsion systems, so scale matters. Increasingly software (i.e. computer systems) is a critical part of the products and services in almost all industries. Most computer systems are interconnected and have more internal and external threats. In the past, we assumed a friendly environment.
Software and hardware have significant differences. Software requirements are seldom complete. "With software the challenge is to balance the unknowable nature of the requirements with the business need for a firm contractual relationship." - Watts Humphrey. Most engineered systems are defined by comprehensive plans and specifications prior to startup. Few software-intensive systems are. Most software projects are challenged or fail completely. Less than 10 percent of projects costing over 16 million dollars succeed while 50 per cent of projects costing 1 million dollars succeed. The primary cause is lack of realistic planning by developers with no natural visibility of progress or completion status.
Software is valuable, there is value created by the abstraction of productive knowledge and the development is a social learning process. Economic value comes from impact on useful activity such as providing efficient auto ignition systems. Value is increased when the knowledge is readily adaptable. Example-McDonalds's hamburger franchises also work well in China. Franchises show how preserved abstractions can be valuable. Poor software development techniques are indefensible and unethical. Software engineers are ethically obligated to optimize value. Source-Baetjer.
Software creation involves a social learning process where ignorance changes to useful, reproducible knowledge. It can be described as five levels of Orders-of-Ignorance (OI) as follows.
Some people act as if the concepts of risk and failure are somehow disconnected and define the purpose of development as doing something not done before. 90 per cent success is sometimes accepted as good but means there is 1 failure in 10. Is the failure tolerable? It must be made tolerable, possibly by insurance. The dollar cost of failure must be calculated and minimized. Failure costs are never zero and making costs explicit improves planning. There are a number of steps to minimize failure costs. All catastrophic risks must be made tolerable. In our own life we use insurance on life and property to cover catastrophic events. Projects may require an alternate "Plan B" solution. We must quantify risk exposure in terms of failure costs. The rationale behind testing is to avoid costly field retrofits. Failure cost exposure drives budgets for mitigation.
There have been a number of interesting events such as a recent air traffic control failure. The LA regional system failed and there was no radar coverage on 9/14/2004 for 3.5 hours and the backup system also failed. There were many mid-air collision near misses with 800 plus aircraft and disaster was avoided by the aircraft onboard anti-collision systems. The failure was improperly blamed on "human error." The fault was with a known "glitch" avoided by manual operations that were introduced with a year-ago system re-host. Only 1 of 21 centers has the fault corrected. The episode raises questions about testing, the fault tolerance policy etc. Then the backup system failed immediately???
New FBI software is called unusable. New anti-terrorism software with virtual case files has had further delays in a four-year effort. A half billion dollar upgrade will not work and the errors render worthless much of a current 170 million dollar contract. It may have outlived its usefulness before it was implemented. Officials thought "get it right the first time." That never happens with anybody.
Another example is unsafe automotive ignition where the engine died when accelerating into traffic. Ignition control software failed when there was an intermittent open circuit on a sensor wire. Hazard analysis missed the hardware-software interaction. There were incomplete software system safety requirements. Deterministic values were provided for common failures such as open and short circuits. The control algorithm must provide protection, detect failures and substitute "safe" values.
There must be a framework for defendable designs such that the engineering process can be defended in court. Three state bounds must be set for the system, the operating-envelope for normal operations, the non-operating where normal operation is not possible (fail soft), and exception where the system recovers to normal after an anomaly. Normal may be a degraded, but safe, operating condition. Mishaps occur during state transitions so planning must identify software system dependability requirements and suggest mishap mitigation which could include both hardware and software.
Mr. Cosgrove gave a very interesting presentation with many additional detailed examples that are not covered in this article. This article provides most of the data from the charts presented and more information can be obtained by checking the sources in the bibliography, but you can't get the real flavor of his talk without having attended the meeting.
This was another of the regularly scheduled meetings of the Los Angeles Chapter of ACM. Our next regular meeting will be held on January 11, 2006.
This was third meeting of the LA Chapter year and was attended by about 12 persons.
|On January 11th, Windows versus Linux. Proponents of each Operating System will give a talk about their perspective OS. An exciting exchange is bound to follow.||
Directions to LMU & the Meeting Location:
This month's meeting will be held at Loyola Marymount University, University Hall, Room 1767 (Executive Dining Room), One LMU Dr., Los Angeles, CA 90045-2659 (310) 338-2700.
From the San Diego (405) Freeway:
Dinner will be in the Faculty Dining Room, UHall 1767: To get to the Roski Dining Hall, where you may purchase your food, take one of the elevators in the bay at the west end of the parking structure to the Lobby level. Exit the elevators, then walk straight ahead through the glass doors and into the atrium. Turn right. The entrance to the cafeteria is on the right before you reach the cafeteria seating area at the west end of the atrium. (The cafeteria entrance is room 1700 according to the building floor plan).
To enter the Faculty Dining Room from the cafeteria:
After paying for your food, head back to the area between the grill and the sandwich bar. Turn toward the exterior windows (north side of the room), and walk toward the windows. Before you reach the windows, there will be an opening on the east side of the room, which leads to a hall along the exterior north wall of UHall. Walk down the hall until you come to the faculty dining room. Alternatively, leave the dining area through the doors on the south side of the dining area and walk east (left) through the lobby until you reach the Executive Conference Center (ECC). Enter the double glass doors to the ECC, continue straight down the hall to the end, then turn left and you will be in the faculty dining room.
The meeting will also be in the Faculty Dining Room, UHall 1767. From parking Lot P2 or P3 under University Hall, take one of the elevators in the bay at the center of the parking structure to the Lobby level of University Hall. When you exit the doors into the atrium, the next set of doors a short distance to your right says ECC Center. Enter those doors and walk straight down the hallway. Room 1767 is on your left hand side.
Directions to LMU & the Meeting Location:
The Schedule for this Meeting is
5:00 p.m. Networking/Food
6:00 p.m. Program
7:30 p.m. Council Meeting
9:30 p.m. Adjourn
No resevations are required for this meeting. You are welcome to join us for a no host dinner in Room 1767. Food can be bought in the Cafeteria. Look for the ACM Banner.
If you have any questions about the meeting, call Mike Walsh at (818)785-5056, or send email to Mike Walsh .
For membership information, contact Mike Walsh,
(818)785-5056 or follow this
Other Affiliated groups
Return to "More"
Please visit our website for meeting dates, and news of upcoming events.
For further details contact the SIGPHONE at (310) 288-1148 or at Los_Angeles_Chapter@siggraph.org, or www.siggraph.org/chapters/los_angeles
Return to "More"
|Past Meeting Archive||Los Angeles ACM home page||National ACM home page||Top|
Last revision: 2005 1214 [Webmaster]