The operations of many large organizations rest on large applications that are characterized as "legacy." To increase flexibility or reduce costs businesses are looking to modernize these applications, for instance, via renovation, introducing an SOA architecture, or even re-implementing in a new environment. No matter which approach is taken, it's important to salvage as much knowledge and logic as possible from the legacy application. Unless the application's function is obsolete recovering functional knowledge (what does the application do?) and structural knowledge (how does it do it?) can accelerate the modernization effort.
A parallel can be drawn with renovating a building, since modernization can involve gradual changes to the building's internal structure, say, larger doors, or complete demolition and reconstruction. In both cases blueprints of the buildings are required. These blueprints form a shared basis of knowledge between the architect and the developer that's necessary for planning and execution. Just as blueprints are necessary to determine how a building can be adapted to suit a new need, they are also necessary to determine how to adjust an application. However, since legacy applications have, by their nature, been developed over long periods of time and, in most cases, by many people such blueprints don't exist. They have to be recovered to start the modernization process.
Many legacy analysis tools have emerged to address this situation. They attempt to reveal the internal and implicit architecture and business function of a legacy application in a fashion understandable for people. The discovery and then the expression of the business function at an abstract level that discards incidental technical details is a difficult task. However, this is precisely what is necessary since modernization projects are interested in re-implementing business functions, not particular technical solutions.
The tension between expressing business functions in a comprehensible fashion and expressing them with enough detail and precision has to be addressed. This requires a language in which business knowledge can be practicably expressed.
UML as a Solution
UML has emerged as the most successful non-proprietary object modeling and specification language. UML includes a
standardized graphical notation that can be used to create an abstract model of a system, that is the UML model. UML can precisely and understandably describe an application at a high level of abstraction, hiding the implementation details to reveal the actual business functions. Furthermore, as a formal language, UML has a well-defined syntax that makes it suitable for both forward and reverse engineering. In forward engineering, a UML model could be used to generate actual code (e.g., Java code). Most commercially available UML tools are capable of forward engineering (in various degrees), while possessing some reverse engineering features. The most common reverse engineering activity involves extracting class diagrams from Java code. However, reverse engineering is limited so far to modern programming languages, while lacking similar capabilities for older technologies like COBOL.
The Benefits of UML vis-à-vis Legacy
The ability to reverse engineer legacy applications to UML would offer substantial benefits:
Thus UML can be the intermediate stage through which an application rewrite may pass. The process would thus be:
"People Reuse"
The standard method of acquiring business process knowledge through user interviews involves a major commitment
of time and new resources. The results may be incomplete and error-prone. Alternatively, extracting knowledge directly from the current application circumvents these concerns while continuing to rely on the resources currently involved in application maintenance.
Limitations
There's no silver bullet for reverse engineering legacy to UML. In fact, one may notice that some concepts simply don't match. Furthermore, some important information about the use of the legacy application isn't captured in the code itself, making automatic extraction impossible. For instance:
Actors
In UML, an actor is a user of the system; "user" can mean a human user, a machine, or even another system that interacts with the system from outside its boundaries. Because of this definition in most cases information about actors can't be found in the code.
Requirements
While the current system may implement business requirements, they may not appear explicitly in the code, but rather in the form of fulfilled requirements.
Navigation & Sequence
Certain sequences of operations may not be explicitly specified in the application. So, while a user knows that to open a new account, he or she must perform activities A, B, and C in this precise order, the application may allow other paths that aren't meaningful from a business perspective.
We can therefore recognize that any UML description of a legacy application can't be achieved through completely automatic reverse engineering. While a legacy analysis tool may expose the artifacts of the application, only a human can assemble them into meaningful UML diagrams.
A Balanced Approach
We have shown that a totally automated approach isn't feasible. At the other extreme, a completely manual approach has
two primary disadvantages:
Economics
Over time applications tend to be modified to such a degree that neither the initial plans, nor the current documentation reflects the reality of the application. Knowledge must be acquired from the code itself,
but to manually review a multimillion-line application would be far too burdensome financially to be a realistic option.
Completeness
As a legacy application is modified and enhanced over the years users often lose a complete understanding of how the application functions. For example, in a pension system the rules for computing the pension can be spread through numerous government and corporate policy documents. This knowledge is already in the code, which is more complete, precise, and concise than what would come from user interviews. Moreover, the application stakeholders are
likely to insist that nothing is lost from the current functionality.
The best balance between fully manual and fully automatic can be called "tool assisted." In this approach, a software tool may be able to:
What UML Diagrams Can Be Extracted?
We have now identified the approaches that yield the maximum benefit and their drawbacks. So let's look at specific information that can be extracted from the legacy application. These possibilities should be thought of as a starting point since more automation will likely arise as UML extraction tools increase in sophistication.
Use-Case Diagrams
Use cases correspond to major areas of the application in which the
user can accomplish defined business goals. The
cascading hierarchy of menus can represent such major functional areas.
The 'Maintain Account' menu screen may lead to 'Add Account,' 'Delete
Account,' or 'Freeze Account,' each of which can be a use case of its
own. As legacy analysis tools are aware of these screens as well as the
transitions between them, the existing information can be used to
extract use-case diagrams.
If automation is pursued, the task isn't so simple. A legacy analysis tool may be able to locate the screens and transitions between them. This would result in a graph in which the screens are nodes and the transitions are edges. A pure technical analysis would offer no indication of what the starting point for the navigation is and what the difference between entering a major area of the application and returning to the initial point is. For example, we may detect that there's a transition from 'Maintain Account' to 'Increase Line of Credit,' but from a purely technical viewpoint, the navigation may as well go in the reverse direction. In most cases, if the user can go from Screen A to Screen B then he can go from Screen B to Screen A. If each of the two screens indicates an important use case then what is the relation between them? Which one is using or including the other?
As a result, we have the interesting problem of taking a non-directional graph and transforming it into a directional one. One possible solution is to let the user designate a starting point, which is usually the application's main menu screen. An algorithm may be developed by which the program will traverse the graph starting with the designated main menu and navigate the edges without ever returning to a node (screen) that was visited before on the path from the main menu. This will result in a sub-graph of the screen navigation that's directional and indicates the normal navigation of the user who starts at the first main menu screen.
As an example, a total navigation graph that's automatically extracted from the code may look like this:
In this graph it's not clear if 'Increase Credit Line' includes 'Customer Maintenance' or the other way around. However, if 'Customer Maintenance' is designated as the starting point, the navigation graph may be reduced to:
Now it's clear that a use case called "Maintain Account" would include or use the use case "Increase Credit Line" and we can derive a use case diagram:
As stated earlier, the actors may be hard to detect from the code, so additional information may be needed to complete the picture. The manually adjusted diagram will show:
Activity Diagrams
To do a certain task, the user of the legacy application must navigate
a particular path through various screens while doing some specific
operations on each. Legacy analysis tools are aware of the screens and
all possible paths through them, and this information may be used to
generate activity diagrams. The user of the UML extraction tool would
then have to select particular paths that reflect particular user
tasks. Now we have a problem similar to that for use cases. The screen
flows of the legacy application may allow all possible transitions
between the screens, but only some particular ones would make sense in
the context of an activity diagram. A technical analysis of the
application
may indicate the following possible transitions:
A tool-assisted extraction of an activity diagram can allow the user to point to a starting point and then select a particular navigation path, which reflects how the creation of a new order is done in real life. Furthermore, the user may also indicate some decision points. The final activity diagram would look like this:
State Diagrams
The states of a particular object can appear more or less explicitly in
the COBOL code. A so-called 88-level field could in itself list all the
possible states. Looking at all possible values that a variable can
have may help collect other states. Furthermore, even some of the
conditions that lead to a state transition can be discovered.
Legacy analysis tools can generally discover data flows (e.g., value moved to a field) and program connections (i.e., calls, links, exclusive control, etc.). Suppose, for example, that the legacy tool discovers the following:
Data flow analysis done for the LOANSTATUS field may discover the following statements:
A state diagram can then be inferred that would look like:
A legacy UML extraction tool would therefore need to discover data flows (focused on fields indicating state) and program flows and then process this information and assemble it into a state transition diagram.
Class Diagrams
The discovery of the major classes in the application requires human
judgment. While there could be compelling reasons
to create certain classes, in many cases the classes are created for
maximum programming convenience. There are some simple heuristics
(e.g., create a class for each table) that could be used as an initial
step. The data members of the class can be inferred from the sub-fields
of a structure, from the columns of a table, or from other artifacts.
While discovery of data members is almost a trivial exercise, the
discovery of methods offers a more difficult challenge.
One way to discover the methods corresponding to a certain class is to take its data members (or rather, the legacy fields from which the members were inferred) and find their use in the application code. One may discover how they are calculated and used, and then designate corresponding methods for the class. If the class is derived from some data store, it's natural to create Insert, Update, Delete, and Read methods.
Conclusion
UML extraction from legacy is beneficial and made possible by commercially available legacy analysis tools that
caption legacy application knowledge in well-organized repositories. We expect that UML extraction software will
emerge in consecutive steps and lead to increased automation, coverage, and precision. By using such UML extraction
tools, companies can document and retain the business knowledge embedded in their legacy applications. UML would
make sharing this information and the use of forward-engineering techniques for the purpose of re-implementing the
functionality in new environments possible.