YAKE Keyword Extraction Integration¶

Overview¶

Claude Knowledge Catalyst v0.10.0 introduces advanced keyword extraction capabilities powered by YAKE (Yet Another Keyword Extractor). This unsupervised machine learning algorithm automatically identifies and extracts meaningful keywords from technical documents, enhancing the automated classification and tagging system.

What is YAKE?¶

YAKE (Yet Another Keyword Extractor) is a novel feature-based system for multi-lingual unsupervised automatic keyword extraction that builds upon statistical text features without relying on dictionaries or thesauri, training documents, or linguistic knowledge.

Key Benefits¶

Unsupervised Learning: No training data or dictionaries required
Multi-Language Support: Works with 7 languages including Japanese and English
Technical Content Optimization: Specifically tuned for technical documentation
Confidence Scoring: Quality assessment for extracted keywords

Supported Languages¶

YAKE integration in CKC supports the following languages:

English (en) - Primary language
Japanese (ja) - Full support for technical documents
Spanish (es)
French (fr)
German (de)
Italian (it)
Portuguese (pt)

How It Works¶

1. Automatic Language Detection¶

When processing content, CKC automatically detects the language:

# Example: Processing a Japanese technical document
content = """
# FastAPI の認証システム

JWTトークンベースの認証を実装します。
OAuth2スキーマとセキュリティスコープを使用。
"""

# YAKE automatically detects Japanese and extracts relevant keywords
# Result: ["FastAPI", "認証システム", "JWT", "OAuth2", "セキュリティ"]

2. Keyword Extraction Process¶

The YAKE integration follows these steps:

Content Analysis: Analyze document structure and content
Language Detection: Automatically identify the primary language
Text Normalization: Clean and prepare text for processing
Keyword Extraction: Apply YAKE algorithm with optimized parameters
Confidence Scoring: Rate keyword quality and relevance
Tag Integration: Merge keywords with existing tag system

3. Configuration Options¶

YAKE extraction can be customized through configuration:

# ckc_config.yaml
ai_classification:
  yake_enabled: true
  yake_config:
    max_ngram_size: 3           # Maximum phrase length
    deduplication_threshold: 0.7 # Similarity threshold for deduplication
    max_keywords: 20            # Maximum keywords to extract
    confidence_threshold: 0.5   # Minimum confidence score
    supported_languages:
      japanese: "ja"
      english: "en"
      spanish: "es"
      french: "fr"
      german: "de"
      italian: "it"
      portuguese: "pt"

Integration with Smart Classification¶

YAKE works seamlessly with CKC’s existing classification system:

Enhanced Metadata Generation¶

# Before YAKE integration
---
title: "API Design Guide"
type: concept
tech: ["api"]
---

# After YAKE integration
---
title: "API Design Guide"
type: concept
tech: ["api", "rest", "graphql", "openapi"]
domain: ["web-dev", "backend"]
keywords: ["authentication", "rate-limiting", "versioning"]
confidence: high
---

Keyword-Driven Tag Suggestions¶

The SmartContentClassifier now uses YAKE keywords to enhance tag suggestions:

Tech Tags: Extract technology names (FastAPI, React, Kubernetes)
Domain Tags: Identify application domains (web-dev, machine-learning)
Concept Tags: Discover conceptual terms (authentication, optimization)
Quality Assessment: Evaluate content depth and technical accuracy

Usage Examples¶

Basic Keyword Extraction¶

# Extract keywords from a single file
uv run ckc classify my-document.md --show-keywords

# Batch process with keyword extraction
uv run ckc batch-classify .claude/ --enable-yake

Language-Specific Processing¶

# Force language detection for mixed-language documents
uv run ckc classify bilingual-doc.md --language japanese

# Process Japanese technical documentation
uv run ckc sync --language-priority japanese

Integration with Obsidian¶

YAKE-extracted keywords automatically become Obsidian tags:

<!-- In Obsidian vault -->
---
title: "Machine Learning Pipeline Optimization"
tags:
  - type/concept
  - tech/python
  - tech/tensorflow
  - tech/kubernetes
  - domain/machine-learning
  - yake/optimization
  - yake/pipeline
  - yake/performance-tuning
---

# Machine Learning Pipeline Optimization

Content with automatically extracted and tagged keywords...

Performance and Quality¶

Extraction Quality¶

YAKE in CKC is optimized for technical content:

Precision: High-quality keyword identification
Recall: Comprehensive coverage of important terms
Relevance: Context-aware technical terminology
Consistency: Stable results across similar documents

Performance Characteristics¶

Speed: Average 50-100ms per document
Memory: Minimal memory footprint
Scalability: Efficient batch processing
Accuracy: >85% relevant keyword extraction for technical content

Best Practices¶

1. Content Preparation¶

For optimal results:

Use clear headings and structure
Include technical terminology naturally
Maintain consistent language usage
Provide sufficient content length (>100 words)

2. Configuration Tuning¶

Adjust YAKE parameters based on your content:

# For highly technical content
yake_config:
  max_ngram_size: 4           # Capture longer technical phrases
  confidence_threshold: 0.7   # Higher quality threshold
  max_keywords: 15            # Focus on most relevant terms

# For diverse content types
yake_config:
  max_ngram_size: 2           # Shorter, more general phrases
  confidence_threshold: 0.4   # Lower threshold for broader coverage
  max_keywords: 25            # More comprehensive extraction

3. Quality Monitoring¶

Monitor extraction quality:

# Review keyword extraction quality
uv run ckc analyze keywords --confidence-report

# Identify low-confidence extractions
uv run ckc search --confidence low --type yake-keywords

Troubleshooting¶

Common Issues¶

1. Poor Language Detection

# Explicitly specify language
uv run ckc classify document.md --force-language japanese

2. Low-Quality Keywords

# Increase confidence threshold
yake_config:
  confidence_threshold: 0.8

3. Too Many/Few Keywords

# Adjust keyword count and n-gram size
yake_config:
  max_keywords: 10
  max_ngram_size: 2

Performance Optimization¶

For large document sets:

# Process in batches with progress tracking
uv run ckc batch-classify .claude/ --batch-size 50 --progress

# Use confidence filtering to reduce processing time
uv run ckc classify --min-confidence 0.6

Migration from Previous Versions¶

If upgrading from CKC v0.9.x:

Automatic Migration: YAKE integration is enabled by default
Backward Compatibility: Existing classifications remain unchanged
Enhanced Metadata: New YAKE keywords supplement existing tags
Configuration: Update ckc_config.yaml to customize YAKE behavior

# Update existing classifications with YAKE
uv run ckc migrate --enhance-with-yake .claude/

# Preview YAKE enhancements
uv run ckc migrate --enhance-with-yake .claude/ --dry-run

Integration with Development Workflow¶

Claude Code Development¶

# During development - automatic keyword extraction
cd my-claude-project
echo "# Database optimization techniques..." > .claude/db-optimization.md
# YAKE automatically extracts: ["database", "optimization", "performance", "indexing"]

# In Obsidian vault - enhanced discoverability
# Tags: #tech/database #domain/backend #concept/optimization

Team Collaboration¶

# Team configuration for consistent keyword extraction
team_config:
  yake_enabled: true
  shared_vocabulary: true
  quality_threshold: 0.7
  language_priority: ["english", "japanese"]