One of the obstacles to data preparation, integration, and analysis is dealing with unstructured free text data as captured in crime reports. As we saw in the case study in the previous chapter (1.15.3 Data Selection, Cleaning, and Coding), crime reports may contain fields, which make automated data analysis difficult. Police officers and investigators may use widely varying styles and formats in describing criminal scenes and modus operandi. Spelling errors and abbreviations may vary in these criminal reports using free-text fields making it difficult to structure the data they contain into categories for importing into data mining software.
In order to reduce this inconsistency, police departments and agencies may want to standardize their crime reports by eliminating or reducing free-text form fields and instead use reports using checklist and categorical fields. So that for example, rather than allowing investigators to enter in free-form word sequencing data such as "Southern acent," "Southern accent," "accent southern," "local accent," "accent: Southern," or "not local accent," etc. the crime report would use a table checklist format, such as this:
CRIME REPORT Perpetrator Accent [ ] Local [ ] Not local [ ] Southern, [ ] etc., Perpetrator Race [ ] White [ ] Black [ ] Hispanic [ ] Asian [ ] etc., Perpetrator Height [ ] 5' [ ] 5' 2" [ ] 5' 4" [ ] 5' 6" [ ] 5' 8" [ ] 5' 10" [ ] 6' [ ] etc., Perpetrator Age [ ] 14 [ ] 16 [ ] 18 [ ] 20 [ ] 22 [ ] 24 [ ] 26 [ ] 28 [ ] 30 [ ] etc., Perpetrator Build [ ] Slim [ ] Medium [ ] Heavy [ ] etc., Perpetrator Hair Color [ ] Dark [ ] Light [ ] etc., Perpetrator Hair Length [ ] Short [ ] Long [ ] etc.,
Another possible method by which free text descriptions from crime reports can be standardized is through the use of a text mining tool. Text mining software can extract free form text summaries as found on crime reports and create major categories. This is one possible solution in situations where there are a voluminous number of historical crime reports.