Skip to main content

Overview

Trainly’s File Management API handles document upload, processing, and lifecycle management. Files are automatically converted to embeddings and stored in a graph database for semantic search.
Supported formats: PDF, DOCX, TXT, MD, HTML, JSON, CSV, and common code files

File Upload

Extract Text from File

Upload and extract text from any supported file format:
POST /extract-pdf-text
curl -X POST https://api.trainlyai.com/extract-pdf-text \
  -H "Content-Type: multipart/form-data" \
  -F "file=@document.pdf"
Request Body (Multipart Form)
file
file
required
File to process (max 5MB)
Response
{
  "text": "Extracted text content from the document...",
  "file_hash": "abc123def456...",
  "size_bytes": 524288,
  "filename": "document.pdf",
  "uploaded_at": 1609459200000,
  "processing": "async_nonblocking_fixed"
}
Response Fields
text
string
Extracted and sanitized text content from the file
file_hash
string
SHA-256 hash of the file for deduplication
size_bytes
integer
Size of the file in bytes
filename
string
Sanitized filename
uploaded_at
integer
Unix timestamp in milliseconds

Supported File Types

  • PDF (.pdf) - Full text extraction with layout preservation
  • Word (.docx) - Paragraphs and formatting
  • Text (.txt, .md) - Plain text files
  • HTML (.html) - Text extraction with tag stripping
  • XML (.xml) - Structured data parsing
  • CSV (.csv) - Comma-separated values - JSON (.json) - Structured JSON data - YAML (.yaml, .yml) - Configuration files
  • JavaScript (.js) - Source code
  • TypeScript (.ts) - TypeScript files
  • Python (.py) - Python scripts
  • Java (.java) - Java source
  • C/C++ (.c, .cpp, .h) - C family languages
  • C# (.cs) - C# source code
  • PHP (.php) - PHP scripts
  • Ruby (.rb) - Ruby scripts
  • Shell (.sh, .bat, .ps1) - Shell scripts

Document Processing

Create Embeddings

Process extracted text and create searchable embeddings:
POST /create_nodes_and_embeddings
curl -X POST https://api.trainlyai.com/create_nodes_and_embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_text": "Full document text here...",
    "pdf_id": "unique_doc_id_123",
    "chat_id": "chat_abc123",
    "filename": "research_paper.pdf",
    "scope_values": {
      "project_id": "proj_456",
      "category": "research"
    }
  }'
Request Body
pdf_text
string
required
Full text content to process (max 1,000,000 chars)
pdf_id
string
required
Unique identifier for this document
chat_id
string
required
Chat/workspace ID to associate with
filename
string
required
Original filename for reference
scope_values
object
Custom metadata for filtering (see Scope Management)
Response
{
  "status": "success",
  "message": "Created 42 chunks with relationships",
  "duplicate_skipped": false
}
Processing Details
1

Text Chunking

Document is split into 500-character chunks with overlap for context preservation
2

Embedding Generation

Each chunk is converted to a vector embedding using text-embedding-3-small
3

Graph Storage

Chunks are stored as nodes in Neo4j with relationship edges
4

Semantic Analysis

AI analyzes chunk relationships to create intelligent links (EXPLAINS, SUPPORTS, etc.)
5

Indexing

Vector embeddings are indexed for fast semantic search

File Deletion

Remove File from Chat

Delete a specific document and all its chunks:
DELETE /remove_context/{file_id}
curl -X DELETE https://api.trainlyai.com/remove_context/doc_abc123 \
  -H "Authorization: Bearer your_api_key"
Path Parameters
file_id
string
required
Unique identifier of the document to delete
Response
{
  "status": "success",
  "message": "Document doc_abc123 and its chunks deleted",
  "nodes_deleted": 43
}
This action is permanent and cannot be undone. All chunks and embeddings will be deleted.

Chat Data Management

Delete All Chat Data

Remove all documents and chunks associated with a chat:
DELETE /delete_chat_nodes/{chat_id}
curl -X DELETE https://api.trainlyai.com/delete_chat_nodes/chat_abc123 \
  -H "Authorization: Bearer your_api_key"
Response
{
  "status": "success",
  "message": "All nodes for chat chat_abc123 deleted",
  "nodes_deleted": 256,
  "relationships_deleted": 512
}

Cleanup Chat Cluster

For advanced cleanup including subchats:
POST /cleanup_chat_data/{chat_id}
curl -X POST "https://api.trainlyai.com/cleanup_chat_data/chat_abc123?convex_id=chat_xyz&child_chat_ids=subchat1,subchat2" \
  -H "Authorization: Bearer your_api_key"
Query Parameters
convex_id
string
Alternative chat ID to include
child_chat_ids
string
Comma-separated list of child chat IDs
Response
{
  "status": "success",
  "message": "Chat cluster chat_abc123 and 2 child chats deleted from Neo4j",
  "nodes_deleted": 512,
  "relationships_deleted": 1024,
  "debug_info": {
    "parent_chat_id": "chat_abc123",
    "child_chat_ids": ["subchat1", "subchat2"],
    "total_chat_ids_processed": 3,
    "documents_found_before_deletion": 15
  }
}

Debug Endpoints

Debug Chat Data

Inspect what data exists for a chat:
GET /debug_chat_data/{chat_id}
curl -X GET https://api.trainlyai.com/debug_chat_data/chat_abc123
Response
{
  "chat_id": "chat_abc123",
  "search_prefix": "subchat_chat_abc123_",
  "documents_found": 3,
  "total_chunks": 127,
  "documents": [
    {
      "chatId": "chat_abc123",
      "docId": "doc1",
      "filename": "research.pdf",
      "chunk_count": 42
    },
    {
      "chatId": "subchat_chat_abc123_user1",
      "docId": "doc2",
      "filename": "notes.txt",
      "chunk_count": 85
    }
  ]
}

File Processing Features

Automatic Deduplication

Files are automatically deduplicated based on content hash:
  • Content Hashing: SHA-256 hash of file content
  • Duplicate Detection: Checks within 30-second window
  • Cache Management: Automatic cleanup of expired entries

Security & Sanitization

All uploaded files undergo rigorous security checks:

XSS Detection

Scans for malicious scripts and code injection

Content Validation

Validates file format matches extension

Size Limits

Enforces 5MB maximum file size

Filename Sanitization

Removes special characters and path traversal

Magic Byte Detection

Files are validated using magic bytes for security:
# PDF Detection
if head.startswith(b"%PDF-"):
    return "pdf"

# DOCX Detection (ZIP signature)
if head.startswith(b"PK\x03\x04") and filename.endswith(".docx"):
    return "docx"

# HTML Detection
if head.lstrip().startswith(b"<!DOCTYPE") or head.lstrip().startswith(b"<html"):
    return "html"

Chunk Relationships

Trainly creates intelligent relationships between chunks:
AI analyzes content to create meaningful connections:
  • EXPLAINS: One chunk explains concepts in another
  • SUPPORTS: One chunk provides evidence for another
  • ELABORATES: One chunk adds details to another
  • INTRODUCES: One chunk introduces topics in another
  • CONCLUDES: One chunk concludes ideas from another
Preserves document structure:
  • NEXT: Sequential order in the original document
  • HAS_CHUNK: Document to chunk relationship

Example Graph Structure

Document: "Research Paper"
├── NEXT → Chunk 0: "Introduction..."
│   └── EXPLAINS → Chunk 3: "Methodology..."
├── NEXT → Chunk 1: "Background..."
│   └── SUPPORTS → Chunk 5: "Results..."
└── NEXT → Chunk 2: "Literature Review..."
    └── ELABORATES → Chunk 4: "Discussion..."

Analytics Tracking

File uploads automatically track analytics:
  • File Size: Monitored for storage management
  • Processing Time: Performance metrics
  • User Association: Tracks uploads per user/app
  • Storage Quotas: Monitors space usage
Example analytics event:
{
  "event": "file_uploaded",
  "app_id": "app_abc123",
  "user_id": "user_xyz789",
  "filename": "research.pdf",
  "size_bytes": 524288,
  "file_type": "pdf",
  "timestamp": 1609459200000
}

Best Practices

Batch Processing

Use bulk upload endpoints for multiple files

File Validation

Validate files client-side before upload

Progress Tracking

Implement upload progress indicators

Error Handling

Handle network errors with retries

Upload Example with Retry Logic

async function uploadFile(file, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const formData = new FormData();
      formData.append("file", file);

      const response = await fetch(
        "https://api.trainlyai.com/extract-pdf-text",
        {
          method: "POST",
          body: formData,
        },
      );

      if (response.status === 413) {
        throw new Error("File too large (max 5MB)");
      }

      if (response.status === 415) {
        throw new Error("Unsupported file type");
      }

      if (!response.ok) {
        throw new Error(`Upload failed: ${response.status}`);
      }

      return await response.json();
    } catch (error) {
      if (attempt === maxRetries - 1) throw error;

      // Exponential backoff
      const delay = Math.pow(2, attempt) * 1000;
      await new Promise((resolve) => setTimeout(resolve, delay));
    }
  }
}

File Metadata

Documents stored in Trainly include rich metadata:
{
  id: "doc_abc123",              // Unique document ID
  chatId: "chat_xyz789",         // Associated chat
  filename: "research.pdf",      // Original filename
  uploadDate: 1609459200000,     // Upload timestamp
  sizeBytes: 524288,             // File size

  // Custom scopes (optional)
  project_id: "proj_456",
  category: "research",
  workspace_id: "ws_123"
}

Error Codes

400 Bad Request
error
Invalid file format or malicious content detected
413 Payload Too Large
error
File exceeds 5MB size limit
415 Unsupported Media Type
error
File type not supported
408 Request Timeout
error
File processing exceeded 30-second timeout
500 Internal Server Error
error
Error during text extraction or embedding generation