Generated: 2026-03-08
Database: K-CAT SQLite database (built from KFG source data)
Script: scripts/run_phase6_anomaly.py
Inputs: data/processed/phase3_clusters.csv · data/processed/phase5_color_diversity.csv
Status: ✅ Complete — anomaly detection is unsupervised; flagged khipus require expert review
Which khipus in the KFG corpus are structurally exceptional relative to the bulk of the corpus? Are there consistent structural patterns among outliers?
Three complementary anomaly detection methods applied to an 11-feature structural matrix per khipu.
Feature set:
| Feature | Description |
|---|---|
n_cords |
Total cord count |
n_pendants |
Pendant (level-0) cord count |
n_subsidiaries |
Subsidiary cord count (levels 1+) |
n_groups |
Number of cord groups |
numeric_coverage |
Fraction of cords with a decoded numeric value |
frac_broken |
Fraction of cords marked as broken/damaged |
n_colors |
Simple color count (Phase 3) |
n_pattern_types |
Count of distinct summation pattern types present |
n_unique_colors |
Normalized color code count (Phase 5) |
sub_ratio |
Subsidiaries / pendants |
group_size |
Pendants / groups |
All features StandardScaler-normalized before model fitting.
| Method | Parameters | Flagging threshold | ||
|---|---|---|---|---|
| Isolation Forest | 200 trees, contamination=5%, random_state=42 | Predicted −1 (outlier class) | ||
| Local Outlier Factor | k=20 neighbors, contamination=5% | Predicted −1 (outlier class) | ||
| Z-score | Per-feature, all features | Any feature | z | > 3.0 SD |
Consensus classification:
visualizations/phase6/anomaly_method_venn.png
| Method | Flagged | % corpus |
|---|---|---|
| Isolation Forest | 36 | 5.1% |
| Local Outlier Factor | 36 | 5.1% |
| Z-score (any feature > 3 SD) | 75 | 10.6% |
| Pair | Count |
|---|---|
| IF ∩ LOF | 14 |
| IF ∩ Z-score | 30 |
| LOF ∩ Z-score | 27 |
| All three | 14 |
| Class | Count | % corpus |
|---|---|---|
| Normal | 619 | 87.3% |
| Candidate (1 method) | 47 | 6.6% |
| High-confidence (≥ 2 methods) | 43 | 6.1% |
| Feature | Khipus flagged by this feature as primary |
|---|---|
group_size (pendants/groups) |
14 |
numeric_coverage |
14 |
frac_broken |
13 |
n_groups |
8 |
n_subsidiaries |
7 |
visualizations/phase6/anomaly_scatter.png
visualizations/phase6/anomaly_profiles.png
Data: data/processed/phase6_anomaly_catalog.csv
| kfg_id | Provenance | Zone | Cluster | n_cords | n_patterns | n_uniq_colors | Numeric cov. | Frac broken | Primary flag | Methods |
|---|---|---|---|---|---|---|---|---|---|---|
| KH0082 | Lluta Valley | Arica & N. Chile | Complex | 1,831 | 7 | 236 | 22% | 0.2% | n_groups | IF+LOF+Z |
| KH0329 | Unknown | — | Complex | 1,227 | 8 | 129 | 28% | 7.8% | n_pendants | IF+LOF+Z |
| KH0468 | Ica / Pisco | Ica & Paracas | Complex | 955 | 6 | 54 | 24% | 17.2% | n_cords | IF+LOF+Z |
| KH0242 | Leymebamba | Chachapoyas | Complex | 874 | 9 | 8 | 63% | 0.1% | n_cords | IF+LOF+Z |
| KH0349 | Nazca | Nazca & Far South | Complex | 866 | 9 | 37 | 83% | 6.6% | n_pendants | IF+LOF+Z |
| KH0239 | Leymebamba | Chachapoyas | Complex | 758 | 6 | 36 | 61% | 11.5% | n_pendants | IF+LOF+Z |
| KH0083 | Mollepampa | Chachapoyas | Complex | 591 | 5 | 151 | 60% | 0% | n_subsidiaries | IF+LOF+Z |
| KH0617 | Incahuasi | Cañete–Pisco | Complex | 374 | 7 | 12 | 51% | 77.3% | n_subsidiaries | IF+Z |
| KH0135 | Pachacamac | Central Coast | Complex | 281 | 2 | 3 | 7.5% | 0.7% | numeric_coverage | LOF+Z |
| KH0289 | Unknown | — | Complex | 180 | 4 | 4 | 97.8% | 6.1% | n_groups | LOF+Z |
| KH0384 | Pisco Valley | Cañete–Pisco | Complex | 96 | 3 | 2 | 100% | 0% | n_groups | LOF+Z |
| KH0271 | Huari | S. Highlands | Complex | 91 | 4 | 2 | 100% | 2.2% | n_groups | LOF+Z |
All 43 high-confidence anomalies are Complex-cluster khipus. The full catalog is in data/processed/phase6_anomaly_catalog.csv.
Anomalies group into four structural profiles based on their primary flag features:
Large cord count (KH0082, KH0329, KH0468, KH0242, KH0349, KH0239, KH0083): > 500 cords. These are the largest khipus in the corpus. The Leymebamba cache accounts for several.
Near-complete numeric coverage (KH0280, KH0289, KH0271, KH0384): numeric coverage > 95%. Most khipus have coverage ~45%; these are well above corpus norms.
High breakage fraction (KH0617 at 77%, KH0519 at 58%, KH0498 at 47%): heavily damaged but retaining enough structure for analysis.
Extreme structural geometry (KH0454, KH0453, KH0383): either very few cord groups relative to pendant count or unusually high subsidiary depth.
visualizations/phase6/anomaly_features.png
High-confidence anomalies separate from the normal corpus primarily on n_cords (right tail), n_unique_colors (both extremes), sub_ratio (high end), and frac_broken (0.5–1.0 range).
frac_broken reflects missing or damaged values in the KFG — a data quality indicator that includes cords where the KFG marks knot data as uncertain.python scripts/run_phase6_anomaly.py
Requires Phase 3 and Phase 5 outputs.
| Output | Description |
|---|---|
data/processed/phase6_anomaly_scores.csv |
Full corpus with all anomaly scores and flags |
data/processed/phase6_anomaly_catalog.csv |
Flagged khipus only (90 total: 43 high-conf + 47 candidates) |
visualizations/phase6/anomaly_scatter.png |
n_cords vs coverage and color diversity scatter |
visualizations/phase6/anomaly_features.png |
Feature distribution: normal vs high-confidence |
visualizations/phase6/anomaly_method_venn.png |
Consensus class distribution + method overlap counts |
visualizations/phase6/anomaly_profiles.png |
Normalized feature profiles for top-20 anomalies |
Corpus sweep run against K-CAT SQLite database. Re-run with scripts/run_phase6_anomaly.py to refresh.