IIP developed its storage solution from years of experience in scanning and duplicating photographic images, ranging from military applications to historical documents. The challenges had always been the time required to scan an image, versus the quality required for large clients . This had given way to the amount of space required for each image, given that the IIP client base dealt with millions of images.
The type of work IIP performed was not directed toward the market of flatbed scanners nor even larger scanners in terms of quality and production process requirements. Instead, IIP developed a process and proprietary software that used specialized digital cameras to scan images of various dimensions, physical states, and types-various papers, photographic techniques and types, and so on. The process, coupled with the scanning stations , provided a production-oriented environment where imaging could take place 24/7, if required. The process and software included a fully complete life cycle of imaging capture, correction, and quality assurance before it was placed on a CD or tape for shipment to the client.
Clients of IIP had requirements to digitize documents and photographs to provide a wider distribution and availability of these items through the Internet. Consequently, these clients had become part of the growing movement within both the academic community and public sector to save historical documents. For the most part, these markets are just emerging, given the tremendous amount of material that remains to be scanned and digitized.
IIP's client requirements could run into the 500,000-plus number of images for a single project. That type of work drove the production-oriented environment introduced by IIP in the late 1990s.
IIP maintained a distributed infrastructure with its headquarters in Texas and field locations in New York, Washington, D.C., and San Francisco. Figure A-1 illustrates the configurations at the headquarters' locations. Here you see the integration of capture stations, process servers, correction/quality assurance workstations, database servers, and archive servers. Each field location is set up in an identical fashion. Each is linked to the headquarters' network and web server through a virtual private network (VPN). E-mail and FTP services are handled in this manner. Architecturally, this setup was designed for future development of remote scan processing and diagnostic imaging services.
The imaging process will describe the storage utilization scenarios and why the process is so data-centric. Figure A-2 depicts the process from start to finish.
Physical images are entered into a database. The database drives and tracks the processing of images from capture to output on media. In IIP's case, the database is a relational database that tracks the location and status of the scanned image so that at any one time the image can be located and accessed for specific purposes. The initial step in the process is to have the image scanned into the system. An important note regarding the database: Because of certain restriction and challenges with unstructured data and relation technology, the scanned images are actually stored as files within the system. Therefore, the database is actually quite small, as its main job is tracking the locations and characteristics of the scanned image.
At the capture station, the images are scanned into the system. Capture stations are high-end workstations that communicate with the specialized digital cameras. A camera scans the image into the capture station; then the capture software, running on Windows 2000 Professional, writes the scanned image to a network drive on an available process server over the network and moves to the next image. Average images range from 300 to 500MB per raw image.
The process server automatically corrects the image for perspective and positioning and updates the database. The process creates another set of files for the corrected image. This set of files is smaller because it contains only the changes made to the image. The image is then ready to be evaluated and checked out.
The image evaluators view each image for correct quality of preset parameters, as discussed with clients. If the images meet the quality criteria, they are sent to checkout for delivery to media servers. However, they may need manual correction, which is performed at the evaluation workstations or sent back for rescanning, as required. This is all tracked by the database.
The delivery point is the media servers where the completed images are batched together and written to optical media, CDs, or tape, per the requirements of the client. At this point, a final verification is performed, and if passed, the storage becomes available for more scanned images on the process servers.
What IIP had not foreseen were the systems infrastructure requirements for this type of production work. This required the calculation of both processing cycles and, most of all, the amount of storage space that would be needed on a daily operational basis. Because IIP is a small business, it had resisted a formal capacity plan and had relied on its ability to respond quickly when additional capacities were needed. That meant the additional servers were purchased on an 'as-needed' basis, with most of the hardware being 'do-it-yourself' built chassis and motherboard configurations.
With this orientation to developing and maintaining the hardware portion of the systems infrastructure, the storage challenges were met with larger and higher speed internal IDE disks. This gave rise to additional server installations that were needed to handle the post scan image processing (see Figure A-2, Steps 2 through 4). This then prompted the acquisition of dedicated media servers to write out the client images using CD or tape media. This is the archival system on the backside of the process (see Figure A-2, Step 5-the delivery process). This meant that a faster network was necessary to speed the transmission of scanned raw files to the process servers, and ultimately it placed the problem back at the storage infrastructure once again as the image scans overtook the capacities on the servers.
A stopgap effort was a move to IDE RAID to provide adequate storage for the process servers. This was largely driven by the 'do-it-yourself' mode of the hardware and severe limitations of budgets constraints. Although the IDE RAID facilitated a quick fix, IIP's flexibility in providing reliability and backup protection was problematic . In many cases, the volatility of the data movement over the period of one week could easily surpass more than five terabytes running through a single process server. As the tremendous write activities continued , the IDE drives generally failed twice a month, with minimal success running data protection with RAID level 1. However, the space it provided offered a brief interlude to the space problems that shut down the process entirely.
Given that additional business was coming in with more restrictive time constraints for completion, IIP concluded that a longer term solution had to be found.
The IIP hardware and software IT personnel researched the solution along with assistance and input from the imaging software specialist. They found that a SAN was a valid consideration, since it appeared to be the choice of others working with unstructured data such as video and audio projects. However, they found that the imaging system, although proprietary by its methodology, used open and commodity levels of hardware and operating environments and was further open to additional solutions that integrated well into the small business environment. Another alternative was to move to larger process servers with external SCSI drive arrays, to scale up in both process and storage power. Yet another alternative was the consideration of a NAS solution, which would integrate easily with the existing network, would use file systems, and would have the capacity they needed.
The IT personnel, working with an outside consultant, used the guidelines mentioned in Chapter 17 in identifying the company workloads. They further moved to understand the types of configuration needed for all three possible alternatives. First looking at the SAN configuration, followed by the larger server with external RAID, and finally the potential NAS configuration. The results are summarized in the following sections.
Workload Identification Looking at a year's history of scanning images, the IIP team concluded that the workload was complex and data-centric, and it fit somewhere between online transaction processing (OLTP) and data warehousing. The workload encompassed OLTP characteristics when scanning the image and then transmitting the write transaction to the process server. Although developed as a synchronous process, it was recently changed to an asynchronous process to facilitate greater throughput at the capture station. However, this still required a sequential write process at the process server as each image was scanned.
On average, the image scans were 300MB in size . The calculation of 300 images per shift — three capture stations working two shifts provided the necessary throughput. It was determined that at least 540GB of free space was needed to accommodate the daily scanning process. This required that the overall storage infrastructure be able to accommodate a 566MB per second throughput rate.
Workload Estimates for SAN Using the guidelines described in Chapter 19, we can quickly calculate that the required components for a SAN could be handled by one, 16-port switch, given that a single point of failure is acceptable for the installation; or it could be handled by two, 8-port switches for some level of redundancy. Three HBA adapters with 2 ports each for redundancy and performance would be required for the process servers. However, not to be overlooked, this configuration will require additional Fibre Channel storage arrays to accommodate and be compatible with the new Fibre Channel storage network. Given that the total capacity of 540GB needs to be available every 24 hours, we can estimate that two storage arrays of 500GB each would provide the necessary capacity with sufficient free space to handle peak utilization as images are processed through the system.
Workload Estimates for Direct Attached Aligning the requirements to new servers, we find that all the process servers would have to be upgraded. This would also require that the storage capacities be carefully aligned with each process server. Even with this alignment, specific workload affinity would have to be observed to utilize the storage effectively. On the other hand, the process server could more easily share storage across the network but would have to reflect some level of duplication for storage requirements to accommodate the total capacity, essentially doubling the entire storage requirement.
This would require, in addition to new servers installed, OS software upgrades, with appropriate maintenance and all the necessary activities of a major system installation. It would result in a normal disruption of service and reliability characterized by new system installations. However, the new servers would have to be configured to handle the I/O throughput of an aggregate of 566MB per second. This would require each server to handle 188MB per second if the workload is evenly distributed, which in most cases will not be the case; however, we will use this for estimating purposes. That relates to a minimum of six Ultra-wide SCSI-3 adapters necessary to handle the sustained rate of 188MB per second. This requires the total storage to be divided among the servers, and subsequently the adapters, and places a limitation of approximately 120GB per LUN. Thus, a more complex management problem in terms of flexibility of reconfiguration based on required storage would be necessary, given that one capture station could generate 180GB of images every 24 hours.
Workload Estimates for NAS Using the worksheet and guidelines in Chapter 19, we can calculate that our workload requirements are definitely within the mid-range NAS device configuration and probably just under the enterprise NAS solutions. Our calculations indicate that the workload requires the following minimum requirements:
Two network paths
Eleven data paths
An additional path for redundancy (calculated using 10 percent special applications category)
13.2 total logical paths
Comparing total logical paths to data paths = 83 percent
Using the quick estimate NAS Sizing Factor Table in Chapter 19, we select mid-range even though our sizing factor is within the enterprise range. This is based on the special application circumstances and because the aggregate data is below a terabyte and would be physically segmented within the aggregate data capacity estimate. In addition, we considered the workload being further characterized by limited users working with an almost dedicated Gbe network.
The NAS solutions also offer the flexibility of storage incremental selection-for example, installing two large NAS servers and one small server, or one large and two medium- sized servers. These solutions also provide the flexibility of RAID processing, network compatibility, and non-disruption to the existing server configurations. In addition, these solutions can be easily configurable to support the scanning projects and mapped as network drives with the same flexibility. They will also provide a closed, yet remotely accessible, solution for the remote network configurations.
One last word on our estimating process: We recognize the characteristics of the small integrated IT staff and the company's lack of any formal capacity planning activities. The process of workload identification and estimates provides this company a level of direction and planning. The result of this exercise has identified that the mid-range NAS devices can meet the company's workload now and within a limited planning period. However, it also provides an insight into future challenges IIP will encounter, as its staff has become aware that it borders on moving into enterprise solutions of either the NAS type or probably a SAN if the budget for infrastructure can support either.