- Add 60 new agents across all 10 categories (75 -> 135) - Add 95 new plugins with command files (25 -> 120) - Update all agents to use model: opus - Update README with complete plugin/agent tables - Update marketplace.json with all 120 plugins
1.6 KiB
1.6 KiB
/index-docs - Index Documents for RAG
Index documents into a vector store for retrieval-augmented generation.
Steps
- Ask the user for the document source: directory, URLs, database, or API
- Detect document types: PDF, markdown, HTML, text, code, DOCX
- Load documents using appropriate parsers for each file type
- Split documents into chunks using semantic-aware chunking:
- Respect paragraph and section boundaries
- Target chunk size: 500-1000 tokens with 100-token overlap
- Clean and preprocess chunks: remove boilerplate, normalize whitespace
- Generate embeddings for each chunk using the configured embedding model
- Store embeddings in the vector database: Pinecone, Weaviate, Chroma, or pgvector
- Create metadata for each chunk: source file, page number, section title, date
- Build an index mapping for fast retrieval and source citation
- Validate the index by running sample queries and checking relevance
- Report: documents indexed, total chunks, vector dimensions, storage size
- Save the indexing configuration for incremental updates
Rules
- Use semantic chunking that respects document structure over fixed-size splitting
- Include sufficient overlap between chunks to preserve context at boundaries
- Store source metadata with each chunk for citation and provenance
- Handle duplicate documents by comparing content hashes before indexing
- Support incremental indexing: add new documents without re-indexing everything
- Use the same embedding model for indexing and querying
- Monitor embedding costs and set budget alerts for large document sets