Refactor DRAFT
The Processing Phase is the foundational stage of the Canopy IRDM framework. It is focused on technical normalization—ensuring that data is physically accessible, indexed, and transformed into a searchable format. The goal is to eliminate “Technical Debt” before the legal and scoping tracks begin.
The primary objective of Processing is to transform a raw, chaotic dataset into a “Search-Reliable” environment through four core outcomes:
- 100% Index Accessibility: Ensuring every file is decrypted, OCR’d, or transcribed so it can be “seen” by the system.
- Noise Reduction: Applying Global Deduplication and De-NISTing to ensure the team only interacts with unique, user-generated content.
- Technical Integrity: Verifying the bit-for-bit integrity of the data through SHA256 hashing and Chain of Custody documentation.
- Jurisdictional Selection: Establishing a broad view of applicable laws based on the entity’s industry and data source.
- Initial Detection Calibration: Setting the “Pulse Check” parameters to identify known client-specific data patterns (e.g., specific Employee ID formats).
- Exclusionary Logic: Defining the initial boundaries for date ranges and excluded folders to prevent unnecessary cost.
The phase begins once the following project management criteria are met:
- Processing Template Created: The Project Manager (Perry) has configured the template to address the specific scope of data and detection requirements.
- Infrastructure Allocation: Computational resources are assigned for high-volume OCR and ASR (Automated Speech Recognition).
The digital intake process focuses on authentication and accountability—“signing for the package.”
- Authentication & Integrity: Verifying data matches the source via SHA256 hashing.
- Chain of Custody: Creating a defensible audit trail of who handled the data and when.
- Security Screening: Scanning for malware or ransomware to protect the processing engine.
- Volume Validation: Confirming file counts and sizes match the source collection.
The “opening of the files” to extract content and metadata.
- Exception Resolution: Identifying and remediating skipped, password-protected, or corrupted files.
- Media Transformation: Running ASR on audio/video files and OCR on image-based documents (PDFs, JPEGs).
- Image Classification: Flagging high-sensitivity visual data (IDs, SSN cards) that keywords might miss.
Refining the dataset to increase search efficiency.
- Global Deduplication: Identifying a unique “Master” set for scoping.
- De-NISTing: Removing system files and known software signatures.
- Culling: Applying exclusionary rules (Date Range, File Type) to reduce the “noise” before Assessment.
- Digital Chain of Custody: The formal record of data transfer and handling.
- Intake Exception Report: Documentation of files that could not be processed and their remediation status.
- Processing Overview: High-level summary of file types, volume, and processing success rates.
- Search-Reliable Index: A fully indexed dataset ready for keyword and regex testing.
Meeting these criteria is mandatory before the Assessment Phase (Scoping) can begin.
- 100% Exception Clearance: All files in the Exception Overview are reprocessed, decrypted, or moved to a “Manual Triage” bucket.
- Transcription & OCR Complete: All audio/video and image-based files have searchable text vectors.
- Deduplication Applied: Scoping searches are calibrated to run against a unique “Master” set.
- Pulse Check Validated: Initial regex/detection rules successfully hit known client PII patterns.
- Status Transition: The project is marked as “Ready for Assessment.”
Next Phase: Assessment (The Scoping)