YAKE Keyword Extraction Integration

Overview

Claude Knowledge Catalyst v0.10.0 introduces advanced keyword extraction capabilities powered by YAKE (Yet Another Keyword Extractor). This unsupervised machine learning algorithm automatically identifies and extracts meaningful keywords from technical documents, enhancing the automated classification and tagging system.

What is YAKE?

YAKE (Yet Another Keyword Extractor) is a novel feature-based system for multi-lingual unsupervised automatic keyword extraction that builds upon statistical text features without relying on dictionaries or thesauri, training documents, or linguistic knowledge.

Key Benefits

  • Unsupervised Learning: No training data or dictionaries required

  • Multi-Language Support: Works with 7 languages including Japanese and English

  • Technical Content Optimization: Specifically tuned for technical documentation

  • Confidence Scoring: Quality assessment for extracted keywords

Supported Languages

YAKE integration in CKC supports the following languages:

  • English (en) - Primary language

  • Japanese (ja) - Full support for technical documents

  • Spanish (es)

  • French (fr)

  • German (de)

  • Italian (it)

  • Portuguese (pt)

How It Works

1. Automatic Language Detection

When processing content, CKC automatically detects the language:

# Example: Processing a Japanese technical document
content = """
# FastAPI の認証システム

JWTトークンベースの認証を実装します。
OAuth2スキーマとセキュリティスコープを使用。
"""

# YAKE automatically detects Japanese and extracts relevant keywords
# Result: ["FastAPI", "認証システム", "JWT", "OAuth2", "セキュリティ"]

2. Keyword Extraction Process

The YAKE integration follows these steps:

  1. Content Analysis: Analyze document structure and content

  2. Language Detection: Automatically identify the primary language

  3. Text Normalization: Clean and prepare text for processing

  4. Keyword Extraction: Apply YAKE algorithm with optimized parameters

  5. Confidence Scoring: Rate keyword quality and relevance

  6. Tag Integration: Merge keywords with existing tag system

3. Configuration Options

YAKE extraction can be customized through configuration:

# ckc_config.yaml
ai_classification:
  yake_enabled: true
  yake_config:
    max_ngram_size: 3           # Maximum phrase length
    deduplication_threshold: 0.7 # Similarity threshold for deduplication
    max_keywords: 20            # Maximum keywords to extract
    confidence_threshold: 0.5   # Minimum confidence score
    supported_languages:
      japanese: "ja"
      english: "en"
      spanish: "es"
      french: "fr"
      german: "de"
      italian: "it"
      portuguese: "pt"

Integration with Smart Classification

YAKE works seamlessly with CKC’s existing classification system:

Enhanced Metadata Generation

# Before YAKE integration
---
title: "API Design Guide"
type: concept
tech: ["api"]
---

# After YAKE integration
---
title: "API Design Guide"
type: concept
tech: ["api", "rest", "graphql", "openapi"]
domain: ["web-dev", "backend"]
keywords: ["authentication", "rate-limiting", "versioning"]
confidence: high
---

Keyword-Driven Tag Suggestions

The SmartContentClassifier now uses YAKE keywords to enhance tag suggestions:

  1. Tech Tags: Extract technology names (FastAPI, React, Kubernetes)

  2. Domain Tags: Identify application domains (web-dev, machine-learning)

  3. Concept Tags: Discover conceptual terms (authentication, optimization)

  4. Quality Assessment: Evaluate content depth and technical accuracy

Usage Examples

Basic Keyword Extraction

# Extract keywords from a single file
uv run ckc classify my-document.md --show-keywords

# Batch process with keyword extraction
uv run ckc batch-classify .claude/ --enable-yake

Language-Specific Processing

# Force language detection for mixed-language documents
uv run ckc classify bilingual-doc.md --language japanese

# Process Japanese technical documentation
uv run ckc sync --language-priority japanese

Integration with Obsidian

YAKE-extracted keywords automatically become Obsidian tags:

<!-- In Obsidian vault -->
---
title: "Machine Learning Pipeline Optimization"
tags:
  - type/concept
  - tech/python
  - tech/tensorflow
  - tech/kubernetes
  - domain/machine-learning
  - yake/optimization
  - yake/pipeline
  - yake/performance-tuning
---

# Machine Learning Pipeline Optimization

Content with automatically extracted and tagged keywords...

Performance and Quality

Extraction Quality

YAKE in CKC is optimized for technical content:

  • Precision: High-quality keyword identification

  • Recall: Comprehensive coverage of important terms

  • Relevance: Context-aware technical terminology

  • Consistency: Stable results across similar documents

Performance Characteristics

  • Speed: Average 50-100ms per document

  • Memory: Minimal memory footprint

  • Scalability: Efficient batch processing

  • Accuracy: >85% relevant keyword extraction for technical content

Best Practices

1. Content Preparation

For optimal results:

  • Use clear headings and structure

  • Include technical terminology naturally

  • Maintain consistent language usage

  • Provide sufficient content length (>100 words)

2. Configuration Tuning

Adjust YAKE parameters based on your content:

# For highly technical content
yake_config:
  max_ngram_size: 4           # Capture longer technical phrases
  confidence_threshold: 0.7   # Higher quality threshold
  max_keywords: 15            # Focus on most relevant terms

# For diverse content types
yake_config:
  max_ngram_size: 2           # Shorter, more general phrases
  confidence_threshold: 0.4   # Lower threshold for broader coverage
  max_keywords: 25            # More comprehensive extraction

3. Quality Monitoring

Monitor extraction quality:

# Review keyword extraction quality
uv run ckc analyze keywords --confidence-report

# Identify low-confidence extractions
uv run ckc search --confidence low --type yake-keywords

Troubleshooting

Common Issues

1. Poor Language Detection

# Explicitly specify language
uv run ckc classify document.md --force-language japanese

2. Low-Quality Keywords

# Increase confidence threshold
yake_config:
  confidence_threshold: 0.8

3. Too Many/Few Keywords

# Adjust keyword count and n-gram size
yake_config:
  max_keywords: 10
  max_ngram_size: 2

Performance Optimization

For large document sets:

# Process in batches with progress tracking
uv run ckc batch-classify .claude/ --batch-size 50 --progress

# Use confidence filtering to reduce processing time
uv run ckc classify --min-confidence 0.6

Migration from Previous Versions

If upgrading from CKC v0.9.x:

  1. Automatic Migration: YAKE integration is enabled by default

  2. Backward Compatibility: Existing classifications remain unchanged

  3. Enhanced Metadata: New YAKE keywords supplement existing tags

  4. Configuration: Update ckc_config.yaml to customize YAKE behavior

# Update existing classifications with YAKE
uv run ckc migrate --enhance-with-yake .claude/

# Preview YAKE enhancements
uv run ckc migrate --enhance-with-yake .claude/ --dry-run

Integration with Development Workflow

Claude Code Development

# During development - automatic keyword extraction
cd my-claude-project
echo "# Database optimization techniques..." > .claude/db-optimization.md
# YAKE automatically extracts: ["database", "optimization", "performance", "indexing"]

# In Obsidian vault - enhanced discoverability
# Tags: #tech/database #domain/backend #concept/optimization

Team Collaboration

# Team configuration for consistent keyword extraction
team_config:
  yake_enabled: true
  shared_vocabulary: true
  quality_threshold: 0.7
  language_priority: ["english", "japanese"]

See Also