Overview
Trainly’s File Management API handles document upload, processing, and lifecycle management. Files are automatically converted to embeddings and stored in a graph database for semantic search.Supported formats: PDF, DOCX, TXT, MD, HTML, JSON, CSV, and common code files
File Upload
Extract Text from File
Upload and extract text from any supported file format:POST /extract-pdf-text
File to process (max 5MB)
Extracted and sanitized text content from the file
SHA-256 hash of the file for deduplication
Size of the file in bytes
Sanitized filename
Unix timestamp in milliseconds
Supported File Types
Documents
Documents
- PDF (.pdf) - Full text extraction with layout preservation
- Word (.docx) - Paragraphs and formatting
- Text (.txt, .md) - Plain text files
- HTML (.html) - Text extraction with tag stripping
- XML (.xml) - Structured data parsing
Data Formats
Data Formats
- CSV (.csv) - Comma-separated values - JSON (.json) - Structured JSON data - YAML (.yaml, .yml) - Configuration files
Code Files
Code Files
- JavaScript (.js) - Source code
- TypeScript (.ts) - TypeScript files
- Python (.py) - Python scripts
- Java (.java) - Java source
- C/C++ (.c, .cpp, .h) - C family languages
- C# (.cs) - C# source code
- PHP (.php) - PHP scripts
- Ruby (.rb) - Ruby scripts
- Shell (.sh, .bat, .ps1) - Shell scripts
Document Processing
Create Embeddings
Process extracted text and create searchable embeddings:POST /create_nodes_and_embeddings
Full text content to process (max 1,000,000 chars)
Unique identifier for this document
Chat/workspace ID to associate with
Original filename for reference
Custom metadata for filtering (see Scope Management)
1
Text Chunking
Document is split into 500-character chunks with overlap for context
preservation
2
Embedding Generation
Each chunk is converted to a vector embedding using
text-embedding-3-small3
Graph Storage
Chunks are stored as nodes in Neo4j with relationship edges
4
Semantic Analysis
AI analyzes chunk relationships to create intelligent links (EXPLAINS,
SUPPORTS, etc.)
5
Indexing
Vector embeddings are indexed for fast semantic search
File Deletion
Remove File from Chat
Delete a specific document and all its chunks:DELETE /remove_context/{file_id}
Unique identifier of the document to delete
Chat Data Management
Delete All Chat Data
Remove all documents and chunks associated with a chat:DELETE /delete_chat_nodes/{chat_id}
Cleanup Chat Cluster
For advanced cleanup including subchats:POST /cleanup_chat_data/{chat_id}
Alternative chat ID to include
Comma-separated list of child chat IDs
Debug Endpoints
Debug Chat Data
Inspect what data exists for a chat:GET /debug_chat_data/{chat_id}
File Processing Features
Automatic Deduplication
Files are automatically deduplicated based on content hash:- Content Hashing: SHA-256 hash of file content
- Duplicate Detection: Checks within 30-second window
- Cache Management: Automatic cleanup of expired entries
Security & Sanitization
All uploaded files undergo rigorous security checks:XSS Detection
Scans for malicious scripts and code injection
Content Validation
Validates file format matches extension
Size Limits
Enforces 5MB maximum file size
Filename Sanitization
Removes special characters and path traversal
Magic Byte Detection
Files are validated using magic bytes for security:Chunk Relationships
Trainly creates intelligent relationships between chunks:Semantic Relationships
Semantic Relationships
AI analyzes content to create meaningful connections:
- EXPLAINS: One chunk explains concepts in another
- SUPPORTS: One chunk provides evidence for another
- ELABORATES: One chunk adds details to another
- INTRODUCES: One chunk introduces topics in another
- CONCLUDES: One chunk concludes ideas from another
Sequential Relationships
Sequential Relationships
Preserves document structure:
- NEXT: Sequential order in the original document
- HAS_CHUNK: Document to chunk relationship
Example Graph Structure
Analytics Tracking
File uploads automatically track analytics:- File Size: Monitored for storage management
- Processing Time: Performance metrics
- User Association: Tracks uploads per user/app
- Storage Quotas: Monitors space usage
Best Practices
Batch Processing
Use bulk upload endpoints for multiple files
File Validation
Validate files client-side before upload
Progress Tracking
Implement upload progress indicators
Error Handling
Handle network errors with retries
Upload Example with Retry Logic
File Metadata
Documents stored in Trainly include rich metadata:Error Codes
Invalid file format or malicious content detected
File exceeds 5MB size limit
File type not supported
File processing exceeded 30-second timeout
Error during text extraction or embedding generation