Refactor DRAFT

Phase 1: Processing (The Strategy)

The Processing Phase is the foundational stage of the Canopy IRDM framework. It is focused on technical normalization—ensuring that data is physically accessible, indexed, and transformed into a searchable format. The goal is to eliminate “Technical Debt” before the legal and scoping tracks begin.

1. Phase Goals

The primary objective of Processing is to transform a raw, chaotic dataset into a “Search-Reliable” environment through four core outcomes:

Data Track: “Normalize”

100% Index Accessibility: Ensuring every file is decrypted, OCR’d, or transcribed so it can be “seen” by the system.
Noise Reduction: Applying Global Deduplication and De-NISTing to ensure the team only interacts with unique, user-generated content.
Technical Integrity: Verifying the bit-for-bit integrity of the data through SHA256 hashing and Chain of Custody documentation.

Legal Track: “Strategize”

Jurisdictional Selection: Establishing a broad view of applicable laws based on the entity’s industry and data source.
Initial Detection Calibration: Setting the “Pulse Check” parameters to identify known client-specific data patterns (e.g., specific Employee ID formats).
Exclusionary Logic: Defining the initial boundaries for date ranges and excluded folders to prevent unnecessary cost.

2. Entry Criteria (The Initiation Gate)

The phase begins once the following project management criteria are met:

Processing Template Created: The Project Manager (Perry) has configured the template to address the specific scope of data and detection requirements.
Infrastructure Allocation: Computational resources are assigned for high-volume OCR and ASR (Automated Speech Recognition).

3. High-Level Activities

Activity A: Upload and Intake Data

The digital intake process focuses on authentication and accountability—“signing for the package.”

Authentication & Integrity: Verifying data matches the source via SHA256 hashing.
Chain of Custody: Creating a defensible audit trail of who handled the data and when.
Security Screening: Scanning for malware or ransomware to protect the processing engine.
Volume Validation: Confirming file counts and sizes match the source collection.

Activity B: Technical Normalization (Process Data)

The “opening of the files” to extract content and metadata.

Exception Resolution: Identifying and remediating skipped, password-protected, or corrupted files.
Media Transformation: Running ASR on audio/video files and OCR on image-based documents (PDFs, JPEGs).
Image Classification: Flagging high-sensitivity visual data (IDs, SSN cards) that keywords might miss.

Activity C: Metadata Culling & Deduplication

Refining the dataset to increase search efficiency.

Global Deduplication: Identifying a unique “Master” set for scoping.
De-NISTing: Removing system files and known software signatures.
Culling: Applying exclusionary rules (Date Range, File Type) to reduce the “noise” before Assessment.

4. Phase Deliverables & Analytics

Digital Chain of Custody: The formal record of data transfer and handling.
Intake Exception Report: Documentation of files that could not be processed and their remediation status.
Processing Overview: High-level summary of file types, volume, and processing success rates.
Search-Reliable Index: A fully indexed dataset ready for keyword and regex testing.

5. Exit Criteria (The Handoff Gate)

Meeting these criteria is mandatory before the Assessment Phase (Scoping) can begin.

100% Exception Clearance: All files in the Exception Overview are reprocessed, decrypted, or moved to a “Manual Triage” bucket.
Transcription & OCR Complete: All audio/video and image-based files have searchable text vectors.
Deduplication Applied: Scoping searches are calibrated to run against a unique “Master” set.
Pulse Check Validated: Initial regex/detection rules successfully hit known client PII patterns.
Status Transition: The project is marked as “Ready for Assessment.”

Next Phase: Assessment (The Scoping)