Reconstruction is closely related to program understanding. Program understanding allows the developer to form increasingly abstract descriptions of the system. Program-understanding techniques typically consider source code in increasingly abstract forms: raw text, preprocessed text, lexical tokens, syntax trees, control and data-flow graphs, program plans, architectural descriptions, and conceptual models. The more abstract forms entail additional syntactic and semantic analysis. This corresponds more to the meaning and behavior of the code and less to its form and structure. Different levels of analysis are necessary for different users and different reverse-engineering activities. Reconstruction requires program- comprehension activities at each level of abstraction represented in the horseshoe model for architectural transformation. Knowledge discovered at each level of abstraction builds on knowledge gathered at the lower levels of abstraction. In architectural transformation, this process ends when a sufficiently accurate model of the architecture is developed. The following sections look at program-comprehension techniques at the code structure, functional, and architectural levels of abstraction. Code-Structure RepresentationCode-structure representation includes source code and artifacts, such as abstract syntax trees (ASTs) and flow graphs obtained through parsing and routine analytical operations. Recovering a system architecture begins with understanding source code and other existing artifacts. A number of activities can be applied in this process. Manual Code ReadingIn this activity, the software engineer reads through source code in printed form or browses it on line. This activity is almost always applied in some form but it is not viable for very large systems. A good software engineer may be able to keep track of approximately 50,000 lines of code. If there is much more than that, the amount of information becomes unwieldy. Artifact ExtractionArtifact extraction involves discovering and documenting elements and relationships among elements in code-structure representations of the system. For example, Table 5-1 lists typical elements that might be extracted from a COBOL system and their relationships. The specific set of extracted elements and relationships depend on the type of system. For example, if the system were written in Java, almost all the source element, relation, and target elements would change. Static AnalysisStatic-analysis techniques commonly involve parsing the application's source code to generate a variety of reports , including call graphs, data and control flows, structure charts , cross-reference information, and define/use analysis for data types and variable instances. Most reverse-engineering tools provide a variety of static-analysis capabilities. Table 5-1. A Typical Set of Source Elements and Relationships
In most cases, static analysis provides the necessary information to build abstractions. Static information can be obtained from the source code, design information, and compile-time artifacts, such as build and make files. However, relevant information may not be obtainable because of late binding. The use of polymorphism, function pointers, and runtime parameterization can all inhibit discovering source code structure by static analysis. Another problem with static analysis is that the precise topology of a system may not be determined until runtime. For example, systems that use middleware, such as CORBA, Jini, or COM, frequently establish their topology dynamically, depending on the availability of system resources. Because the topology of such systems cannot be determined from their source artifacts, you cannot reverse engineer them using static extraction tools. Dynamic AnalysisDynamic analysis observes a program executing in the operational environment or in a simulation of the operational environment. Dynamic analysis can help developers understand systems that use late binding and those that are configured dynamically. Examples of such systems include distributed, real-time, or client/server programs. Dynamic-analysis techniques include profiling, snooping, and code instrumentation. Profiling gathers execution-time information, such as actual call sequences and data flow. Call sequences can show which system elements implement a particular feature. Snooping can provide insight into interactions between components by allowing you to observe communications between components or anywhere that data and control extend past a component boundary. Code instrumentation has a wide variety of uses for tracing code execution and changing data values. SlicingProgram slicing is a family of program-decomposition techniques. These techniques select statements relevant to a computation, even if the statements are scattered throughout the program [Lanubile 97]. A slice identifies all logic that affects the value of a particular set of variables at a given point in a program. Program slicing, as originally defined by Weiser, is based on static data-flow analysis on the flow graph of a program [Weiser 84]. Program slicing has been applied in program understanding and software maintenance, using conventional slicing, dynamic slicing [Agrawal 90, Korel 88], and other variants. Conventional program slicing has been also advocated in reverse engineering [Beck 93]. Function-Level RepresentationFunction-level representation describes the relationships among the program functions (calls, for example), data (function and data relationships), and files (groupings of functions and data). Semantic and Behavioral Pattern MatchingSemantic and behavioral pattern matching is similar to structural pattern matching but is used to discover dynamic behavior. Patterns are identified by discovering code components that share specific data-flow, control-flow, or dynamic ”program-execution-related ”relationships. RedocumentationRedocumentation is one of the oldest forms of reverse engineering [Sneed 84]. It is the process of retroactively providing documentation for an existing software system. The reconstructed documentation is typically used to aid program understanding. This process can be thought of as a transformation from source code to pseudocode and/or prose , which is considered to be at a higher level of abstraction. The documentation produced is typically in-line text but can also take the form of linked documentation accessible via hypertext, cross-reference listings, or graphical views of the software system's artifacts and relationships [Tilley 91, 92]. Plan RecognitionProgram plans are abstractions of source code fragments. Comparison methods can recognize instances of programming plans in a subject system, using pattern matching at the programming language semantic level. Plan recognition can identify similar code fragments so that they may be consolidated. Aggregation HierarchiesAggregation hierarchies are artifacts created from legacy code by grouping elements together. This technique is used, for example, to aggregate objects into a common class hierarchy. RefactoringRefactoring is the process of changing a software system so that it does not alter the external behavior of the code but instead improves its internal structure. Refactoring can also be viewed as cleaning up code in a disciplined way that minimizes the chance of introducing defects [Fowler 99]. Architectural-Level RepresentationThe architectural level of abstraction assembles clusters of function-level and code-level artifacts into subsystems of related components or concepts. Structural Pattern MatchingIn structural pattern matching, existing libraries of design patterns are matched against code patterns that were mined using static-analysis techniques. Structural pattern matching can identify, for example, module dependencies that cannot be identified with a simple regular expression pattern-matching tool, such as grep. Concept Assignment and ReasoningConcept assignment discovers human-oriented concepts within a specific program or its context and assigns them to their realizations [Biggerstaff 93]. One approach to concept assignment is for a maintenance engineer to designate relationships between textual cues and domain concepts and between domain concepts. These relationships form a simple domain model that can assign concepts to elements of the source code under analysis. These results can help the maintenance engineer understand the source code and reduce the cost of impact analysis. Architecture and Structure IdentificationArchitecture and structure identification involves uncovering the as-is architecture of the system. As this technique is of particular interest in our modernization approach, we discuss it in detail in the next section. |