File Management

Overview

Trainly’s File Management API handles document upload, processing, and lifecycle management. Files are automatically converted to embeddings and stored in a graph database for semantic search.

Supported formats: PDF, DOCX, TXT, MD, HTML, JSON, CSV, and common code files

File Upload

Extract Text from File

Upload and extract text from any supported file format:

POST /extract-pdf-text

curl -X POST https://api.trainlyai.com/extract-pdf-text \
  -H "Content-Type: multipart/form-data" \
  -F "file=@document.pdf"

Request Body (Multipart Form)

file

required

File to process (max 5MB)

Response

{
  "text": "Extracted text content from the document...",
  "file_hash": "abc123def456...",
  "size_bytes": 524288,
  "filename": "document.pdf",
  "uploaded_at": 1609459200000,
  "processing": "async_nonblocking_fixed"
}

Response Fields

text

string

Extracted and sanitized text content from the file

file_hash

string

SHA-256 hash of the file for deduplication

size_bytes

integer

Size of the file in bytes

filename

string

Sanitized filename

uploaded_at

integer

Unix timestamp in milliseconds

Supported File Types

Documents

PDF (.pdf) - Full text extraction with layout preservation
Word (.docx) - Paragraphs and formatting
Text (.txt, .md) - Plain text files
HTML (.html) - Text extraction with tag stripping
XML (.xml) - Structured data parsing

Data Formats

CSV (.csv) - Comma-separated values - JSON (.json) - Structured JSON data - YAML (.yaml, .yml) - Configuration files

Code Files

JavaScript (.js) - Source code
TypeScript (.ts) - TypeScript files
Python (.py) - Python scripts
Java (.java) - Java source
C/C++ (.c, .cpp, .h) - C family languages
C# (.cs) - C# source code
PHP (.php) - PHP scripts
Ruby (.rb) - Ruby scripts
Shell (.sh, .bat, .ps1) - Shell scripts

Document Processing

Create Embeddings

Process extracted text and create searchable embeddings:

POST /create_nodes_and_embeddings

curl -X POST https://api.trainlyai.com/create_nodes_and_embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_text": "Full document text here...",
    "pdf_id": "unique_doc_id_123",
    "chat_id": "chat_abc123",
    "filename": "research_paper.pdf",
    "scope_values": {
      "project_id": "proj_456",
      "category": "research"
    }
  }'

Request Body

pdf_text

string

required

Full text content to process (max 1,000,000 chars)

pdf_id

string

required

Unique identifier for this document

chat_id

string

required

Chat/workspace ID to associate with

filename

string

required

Original filename for reference

scope_values

object

Custom metadata for filtering (see Scope Management)

Response

{
  "status": "success",
  "message": "Created 42 chunks with relationships",
  "duplicate_skipped": false
}

Processing Details

Text Chunking

Document is split into 500-character chunks with overlap for context preservation

Embedding Generation

Each chunk is converted to a vector embedding using text-embedding-3-small

Graph Storage

Chunks are stored as nodes in Neo4j with relationship edges

Semantic Analysis

AI analyzes chunk relationships to create intelligent links (EXPLAINS, SUPPORTS, etc.)

Indexing

Vector embeddings are indexed for fast semantic search

File Deletion

Remove File from Chat

Delete a specific document and all its chunks:

DELETE /remove_context/{file_id}

curl -X DELETE https://api.trainlyai.com/remove_context/doc_abc123 \
  -H "Authorization: Bearer your_api_key"

Path Parameters

file_id

string

required

Unique identifier of the document to delete

Response

{
  "status": "success",
  "message": "Document doc_abc123 and its chunks deleted",
  "nodes_deleted": 43
}

This action is permanent and cannot be undone. All chunks and embeddings will be deleted.

Chat Data Management

Delete All Chat Data

Remove all documents and chunks associated with a chat:

DELETE /delete_chat_nodes/{chat_id}

curl -X DELETE https://api.trainlyai.com/delete_chat_nodes/chat_abc123 \
  -H "Authorization: Bearer your_api_key"

Response

{
  "status": "success",
  "message": "All nodes for chat chat_abc123 deleted",
  "nodes_deleted": 256,
  "relationships_deleted": 512
}

Cleanup Chat Cluster

For advanced cleanup including subchats:

POST /cleanup_chat_data/{chat_id}

curl -X POST "https://api.trainlyai.com/cleanup_chat_data/chat_abc123?convex_id=chat_xyz&child_chat_ids=subchat1,subchat2" \
  -H "Authorization: Bearer your_api_key"

Query Parameters

convex_id

string

Alternative chat ID to include

child_chat_ids

string

Comma-separated list of child chat IDs

Response

{
  "status": "success",
  "message": "Chat cluster chat_abc123 and 2 child chats deleted from Neo4j",
  "nodes_deleted": 512,
  "relationships_deleted": 1024,
  "debug_info": {
    "parent_chat_id": "chat_abc123",
    "child_chat_ids": ["subchat1", "subchat2"],
    "total_chat_ids_processed": 3,
    "documents_found_before_deletion": 15
  }
}

Debug Endpoints

Debug Chat Data

Inspect what data exists for a chat:

GET /debug_chat_data/{chat_id}

curl -X GET https://api.trainlyai.com/debug_chat_data/chat_abc123

Response

{
  "chat_id": "chat_abc123",
  "search_prefix": "subchat_chat_abc123_",
  "documents_found": 3,
  "total_chunks": 127,
  "documents": [
    {
      "chatId": "chat_abc123",
      "docId": "doc1",
      "filename": "research.pdf",
      "chunk_count": 42
    },
    {
      "chatId": "subchat_chat_abc123_user1",
      "docId": "doc2",
      "filename": "notes.txt",
      "chunk_count": 85
    }
  ]
}

File Processing Features

Automatic Deduplication

Files are automatically deduplicated based on content hash:

Content Hashing: SHA-256 hash of file content
Duplicate Detection: Checks within 30-second window
Cache Management: Automatic cleanup of expired entries

Security & Sanitization

All uploaded files undergo rigorous security checks:

XSS Detection

Scans for malicious scripts and code injection

Content Validation

Validates file format matches extension

Size Limits

Enforces 5MB maximum file size

Filename Sanitization

Removes special characters and path traversal

Magic Byte Detection

Files are validated using magic bytes for security:

# PDF Detection
if head.startswith(b"%PDF-"):
    return "pdf"

# DOCX Detection (ZIP signature)
if head.startswith(b"PK\x03\x04") and filename.endswith(".docx"):
    return "docx"

# HTML Detection
if head.lstrip().startswith(b"<!DOCTYPE") or head.lstrip().startswith(b"<html"):
    return "html"

Chunk Relationships

Trainly creates intelligent relationships between chunks:

Semantic Relationships

AI analyzes content to create meaningful connections:

EXPLAINS: One chunk explains concepts in another
SUPPORTS: One chunk provides evidence for another
ELABORATES: One chunk adds details to another
INTRODUCES: One chunk introduces topics in another
CONCLUDES: One chunk concludes ideas from another

Sequential Relationships

Preserves document structure:

NEXT: Sequential order in the original document
HAS_CHUNK: Document to chunk relationship

Example Graph Structure

Document: "Research Paper"
├── NEXT → Chunk 0: "Introduction..."
│   └── EXPLAINS → Chunk 3: "Methodology..."
├── NEXT → Chunk 1: "Background..."
│   └── SUPPORTS → Chunk 5: "Results..."
└── NEXT → Chunk 2: "Literature Review..."
    └── ELABORATES → Chunk 4: "Discussion..."

Analytics Tracking

File uploads automatically track analytics:

File Size: Monitored for storage management
Processing Time: Performance metrics
User Association: Tracks uploads per user/app
Storage Quotas: Monitors space usage

Example analytics event:

{
  "event": "file_uploaded",
  "app_id": "app_abc123",
  "user_id": "user_xyz789",
  "filename": "research.pdf",
  "size_bytes": 524288,
  "file_type": "pdf",
  "timestamp": 1609459200000
}

Best Practices

Batch Processing

Use bulk upload endpoints for multiple files

File Validation

Validate files client-side before upload

Progress Tracking

Implement upload progress indicators

Error Handling

Handle network errors with retries

Upload Example with Retry Logic

async function uploadFile(file, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const formData = new FormData();
      formData.append("file", file);

      const response = await fetch(
        "https://api.trainlyai.com/extract-pdf-text",
        {
          method: "POST",
          body: formData,
        },
      );

      if (response.status === 413) {
        throw new Error("File too large (max 5MB)");
      }

      if (response.status === 415) {
        throw new Error("Unsupported file type");
      }

      if (!response.ok) {
        throw new Error(`Upload failed: ${response.status}`);
      }

      return await response.json();
    } catch (error) {
      if (attempt === maxRetries - 1) throw error;

      // Exponential backoff
      const delay = Math.pow(2, attempt) * 1000;
      await new Promise((resolve) => setTimeout(resolve, delay));
    }
  }
}

File Metadata

Documents stored in Trainly include rich metadata:

{
  id: "doc_abc123",              // Unique document ID
  chatId: "chat_xyz789",         // Associated chat
  filename: "research.pdf",      // Original filename
  uploadDate: 1609459200000,     // Upload timestamp
  sizeBytes: 524288,             // File size

  // Custom scopes (optional)
  project_id: "proj_456",
  category: "research",
  workspace_id: "ws_123"
}

Error Codes

400 Bad Request

error

Invalid file format or malicious content detected

413 Payload Too Large

error

File exceeds 5MB size limit

415 Unsupported Media Type

error

File type not supported

408 Request Timeout

error

File processing exceeded 30-second timeout

500 Internal Server Error

error

Error during text extraction or embedding generation

Overview

Authentication

Core APIs

Endpoint Examples

Overview

File Upload

Extract Text from File

Supported File Types

Document Processing

Create Embeddings

File Deletion

Remove File from Chat

Chat Data Management

Delete All Chat Data

Cleanup Chat Cluster

Debug Endpoints

Debug Chat Data

File Processing Features

Automatic Deduplication

Security & Sanitization

XSS Detection

Content Validation

Size Limits

Filename Sanitization

Magic Byte Detection

Chunk Relationships

Example Graph Structure

Analytics Tracking

Best Practices

Batch Processing

File Validation

Progress Tracking

Error Handling

Upload Example with Retry Logic

File Metadata

Error Codes

Overview

Authentication

Core APIs

Endpoint Examples

​Overview

​File Upload

​Extract Text from File

​Supported File Types

​Document Processing

​Create Embeddings

​File Deletion

​Remove File from Chat

​Chat Data Management

​Delete All Chat Data

​Cleanup Chat Cluster

​Debug Endpoints

​Debug Chat Data

​File Processing Features

​Automatic Deduplication

​Security & Sanitization

XSS Detection

Content Validation

Size Limits

Filename Sanitization

​Magic Byte Detection

​Chunk Relationships

​Example Graph Structure

​Analytics Tracking

​Best Practices

Batch Processing

File Validation

Progress Tracking

Error Handling

​Upload Example with Retry Logic

​File Metadata

​Error Codes

Overview

File Upload

Extract Text from File

Supported File Types

Document Processing

Create Embeddings

File Deletion

Remove File from Chat

Chat Data Management

Delete All Chat Data

Cleanup Chat Cluster

Debug Endpoints

Debug Chat Data

File Processing Features

Automatic Deduplication

Security & Sanitization

Magic Byte Detection

Chunk Relationships

Example Graph Structure

Analytics Tracking

Best Practices

Upload Example with Retry Logic

File Metadata

Error Codes