🔧 Classification Pipeline - Technical Data Flow

Detailed Architecture & Processing Schema | Phases 1-10

Classification Rules

Data Columns Added

Processing Phases

Rule Tiers

📊 Data Flow Architecture

graph LR A[Raw Panjiva CSV
129 columns] --> B[Stage 00
Preprocessing] B --> C[Column Removal
79 columns deleted] C --> D[Column Rename
12 renamed] D --> E[Enrichment
+12 new columns] E --> F[Master CSV
62 columns] F --> G[Classification
Pipeline] G --> H[Phase 1: Filters] H --> I[Phase 2-3: Carriers] I --> J[Phase 4-7: HS2+Keywords] J --> K[Phase 8: Combinatorial] K --> L[Phase 9: Refinements] L --> M[Phase 10: High-Value] M --> N[Classified Output
+5 classification columns] N --> O[Pivot Summaries
Analytics] style A fill:#e74c3c style F fill:#3498db style G fill:#9b59b6 style N fill:#2ecc71 style O fill:#f39c12

🗄️ Column Schema Evolution

Stage 00: Raw to Master

Step	Columns	Action
Raw Import	129	Original Panjiva export
After Removal	50	Removed 79 irrelevant columns
After Enrichment	62	+RAW_REC_ID, +Count, +HS2/4/6, +Qty/Pckg, +10 classification cols

Classification Columns Added

Group           → Dry Bulk | Liquid Bulk | Break-Bulk | Ro/Ro | Reefer
Commodity       → Metals & Minerals | Petroleum | Agricultural Products | etc.
Cargo           → Iron Products | Crude Oil | Grain | etc.
Cargo_Detail    → Iron Ore/DRI | Crude Oil | Wheat | etc.
Pass            → Pass 1, Pass 2, ... Pass 104
Rule_ID         → CARR-RORO, AGG-01, CRUDE-VAR, etc.
Filter          → SHIP_SPARES | FROB | [empty]
            

🔄 Processing Order & Dependencies

graph TD Start[Unclassified Record] --> Check1{Filter ≠ empty?} Check1 -->|Yes| Skip[Skip - Already Filtered] Check1 -->|No| Check2{Group ≠ empty?} Check2 -->|Yes| Skip2[Skip - Already Classified] Check2 -->|No| P1[Apply Phase Rules] P1 --> Tier1{Tier 1:
Carrier Match?} Tier1 -->|Yes| Lock[LOCK Classification
Never Override] Tier1 -->|No| Tier2{Tier 2:
Package Type?} Tier2 -->|Yes| ClassPkg[Classify by Package] Tier2 -->|No| Tier3{Tier 3:
HS2 + Keywords?} Tier3 -->|Yes| ClassHS2[Classify by HS2] Tier3 -->|No| Tier4{Tier 4:
Tonnage Override?} Tier4 -->|Yes| ClassTons[Override HS Code] Tier4 -->|No| Unclass[Remain Unclassified] Lock --> Output[Write to Output] ClassPkg --> Output ClassHS2 --> Output ClassTons --> Output Unclass --> Output style Lock fill:#2ecc71 style ClassPkg fill:#3498db style ClassHS2 fill:#9b59b6 style ClassTons fill:#e74c3c style Unclass fill:#95a5a6

⚙️ Rule Execution Matrix

Tier	Priority	Override	Accuracy	Example Rules
TIER 1 Carrier Locks	Highest	NEVER	100%	WALLENIUS → RoRo COOL CARRIERS → Reefer STOLT → Chemicals
TIER 2 Package Types	Very High	Can refine	98%	LBK → Liquid Bulk BLK/DBK → Dry Bulk
TIER 3 HS2 + Keywords	High	Can refine	85-95%	HS2 68 + "limestone" → Aggregates HS2 44 + "lumber" → Lumber
TIER 4 Tonnage Override	Medium	OVERRIDES HS	75-85%	>1000 tons + "steel" → Steel >500 tons + "cement" → Cement
TIER 5 User Refinements	Low-Medium	Can refine	75-90%	"manganese" → Manganese "phosphate rock" → Phosphate

📁 File System Structure

Directory Layout

G:\My Drive\LLM\project_manifest\
│
├── 00_raw_data\
│   ├── 00_01_panjiva_imports_raw\          [170 ZIPs + CSVs]
│   └── 00_05_all_raw_archive\              [Master consolidated]
│
├── 01_step_one\
│   └── 01_01_panjiva_imports_step_one\     [Yearly splits]
│       ├── panjiva_imports_2023_*.csv
│       ├── panjiva_imports_2024_*.csv
│       └── panjiva_imports_2025_*.csv
│
└── build_documentation\
    ├── classification_full_2023\
    │   ├── panjiva_2023_classified_phase10_*.csv
    │   └── pivot_summary_2023_phase10_*.csv
    ├── classification_full_2024\
    │   ├── panjiva_2024_classified_phase10_*.csv
    │   └── pivot_summary_2024_phase10_*.csv
    └── classification_full_2025\
        ├── panjiva_2025_classified_phase10_*.csv
        └── pivot_summary_2025_phase10_*.csv
            

🎯 Classification Hierarchy

graph TD Root[Record] --> Group Group --> G1[Dry Bulk] Group --> G2[Liquid Bulk] Group --> G3[Break-Bulk] Group --> G4[Ro/Ro] Group --> G5[Reefer] G1 --> C1[Construction Materials] G1 --> C2[Metals & Minerals] G1 --> C3[Agricultural Products] G1 --> C4[Fertilizers] G1 --> C5[Industrial Minerals] G2 --> C6[Petroleum] G2 --> C7[Chemicals] G2 --> C8[Vegetable Oils] G3 --> C9[Steel] G3 --> C10[Lumber & Wood Products] G3 --> C11[Paper & Forest Products] G3 --> C12[Machinery] C1 --> Car1[Aggregates] C1 --> Car2[Cement] C1 --> Car3[Stone Products] C2 --> Car4[Iron Products] C2 --> Car5[Aluminum] C2 --> Car6[Ores] C2 --> Car7[Base Metals] C6 --> Car8[Crude Oil] C6 --> Car9[Refined Products] C6 --> Car10[LPG] C9 --> Car11[Flat Products] C9 --> Car12[Finished Steel] C9 --> Car13[Semi-Finished Steel] style Root fill:#e74c3c style Group fill:#3498db style G1 fill:#9b59b6 style G2 fill:#1abc9c style G3 fill:#f39c12 style C1 fill:#2ecc71 style C6 fill:#e67e22 style C9 fill:#34495e

📊 Data Volume by Phase

🔍 Rule Specificity Levels

Specificity	Example	Accuracy	Tonnage Impact
Exact Grade	"BASRAH HEAVY CRUDE"	99%+	Very High (40M tons)
Product Spec	"PRIMARY ALUMINIUM 99.90%"	95%+	High (5M tons)
Brand Name	"EUCABOARD"	98%+	Medium (3M tons)
Carrier Name	"WALLENIUS"	100%	Very High (30M tons)
Package Type	"LBK"	98%	Extremely High (275M tons)
HS2 + Keyword	HS2 68 + "limestone"	90%+	High (30M tons)
Generic Keyword	"salt"	85%+	Medium (33M tons)

⏱️ Performance Metrics

30K

Records/minute

Minutes per year

Minutes for 2025

Total runtime (all years)

🎓 Key Learnings

1. Package Types > HS Codes

LBK package indicator alone captured 501M tons (50% of classified tonnage), proving package types are more reliable than HS codes for bulk commodities.

2. Simple Keywords > Complex Patterns

User's "salt is just salt" intuition: simple keyword matching captured 13-19x more tonnage than specific variants.

3. Carrier Names = 100% Accuracy

RoRo/Reefer/Chemical carrier rules showed 100% accuracy across all years. Always process carrier-based rules first.

4. Tonnage Overrides Work

Rules like ">1000 tons + steel keywords = steel" successfully override misclassified HS codes, capturing bulk shipments.

5. Commodity Grades Are Gold

Specific crude oil grades (BASRAH, KIRKUK, LIZA, TUPI) captured 79M tons - more reliable than generic "crude oil" keywords.

System Status: ✅ Production Ready

Architecture validated across 1.3M records, 2.1B tons, 3 years

Ready for: ML pattern discovery | Monthly updates | Granularity refinement