feat: extended-files #1427

Toddyclipsgg · 2025-03-02T18:50:55Z

Detailed Summary of Document Processing Feature

The implemented feature introduces a robust system for processing documents and files in the application, with a special focus on PDF and DOCX files. Here is a detailed breakdown of the features:

Document Text Extraction

The core component is the implementation of text extractors for different types of documents:

DOCX file processing:

Uses the JSZip library to decompress the contents of the DOCX file
Parses the internal XML document (word/document.xml) to extract the text
Uses regular expressions to identify and extract the contents of <w:t> tags
Reconstructs the text of the document while maintaining basic formatting

PDF file processing:

Implements a simplified approach that does not rely on PDF.js workers
Parses the raw bytes of the PDF to find text strings
Looks for common patterns in PDFs such as text enclosed in parentheses
Identifies uncompressed blocks of text enclosed in /BT and /ET tags
Extracts TJ and Tj command strings that frequently contain text
Processes and cleans the extracted text, preserving line breaks line

File Handling in the Interface

The system also implements a complete interface for uploading and manipulating files:

File Upload:

Support for multiple file types (.md, .docx, .pdf, .txt)
File selection interface with appropriate filters
Asynchronous processing of selected files
Validation of file types and maximum size (5MB)

Document Preview:

Display of specific icons per file type
Formatted preview for PDF and DOCX documents with distinctive icons
Special handling for text documents with proper formatting
Informative toast notifications about files attached

Manipulation via Drag & Drop and Clipboard:

Support for dragging and dropping files into the interface
Capture of pasted files from the clipboard
Security validation to prevent uploading of scripted files
Visual feedback during the upload process

Integration with the Chat System:

Adding files to the conversation context
Processing of documents before sending the message
Extracting text for analysis by the language model
State management for attached files

The feature implements a complete flow from uploading to processing and displaying documents, with a focus on user experience and security, avoiding potential problems with malicious or very large files.

Resource created from request: #1412 @leex279

thecodacus · 2025-03-05T12:16:44Z

app/components/chat/Chat.client.tsx

+        .join('\n\n');
+
+      const contentWithFilesInfo = textFilesInfo ? `${messageContent}\n\n${textFilesInfo}` : messageContent;
+


i think we should change the order,

Here is some context from files for your reference\n\n---\n ${textFilesInfo}\n---\n${messageContent}

i think we should change the order,

Here is some context from files for your reference\n\n---\n ${textFilesInfo}\n---\n${messageContent}

Can I reverse the order then?

yes.. the content at the end gets more emphasis, so the AI will know what to do with all the context, if we put that after the context

thecodacus · 2025-03-05T12:22:17Z

I just got chance to review the code. looks goof to me.. @leex279 if you can test the functionality and UI then we are good I believe

leex279 · 2025-03-05T12:26:35Z

will test later today

Toddyclipsgg · 2025-03-05T12:34:09Z

I just got chance to review the code. looks goof to me.. @leex279 if you can test the functionality and UI then we are good I believe

Thanks @thecodacus for reviewing the feature!

leex279 · 2025-03-05T20:14:06Z

Findings

PDF-Files look not working properly. Either it gets an error (maybe to big) or it does not use the pdf content as instruction, instead building a pdf viewer :D
1. Not working at all Example: https://www.cmu.edu/swartz-center-for-entrepreneurship/assets/creating-an-effective-landing-page.pdf

2. Building wrong app:
Unbenanntes Dokument.pdf

3. Styling could be improved (margin between filebox and chat):

Toddyclipsgg · 2025-03-08T00:19:34Z

Can you please test it for me now? @leex279

Toddyclipsgg · 2025-03-09T11:02:43Z

@thecodacus I ended up deleting the commits of the feature by accident if you can see if I didn't do anything wrong when I recovered the commits please!

leex279 · 2025-03-09T21:15:25Z

@thecodacus please take a look. I dont know :D ... let me know when I can test/review again :)

Toddyclipsgg force-pushed the extended-files branch from 983b2b5 to dbb7d1c Compare March 2, 2025 18:58

leex279 requested a review from thecodacus March 4, 2025 21:18

thecodacus reviewed Mar 5, 2025

View reviewed changes

leex279 assigned Toddyclipsgg Mar 5, 2025

Toddyclipsgg force-pushed the extended-files branch 2 times, most recently from 2925271 to 2e6b0e0 Compare March 8, 2025 14:09

Toddyclipsgg requested a review from thecodacus March 8, 2025 14:14

Toddyclipsgg closed this Mar 9, 2025

Toddyclipsgg force-pushed the extended-files branch from 898febe to 50dd74d Compare March 9, 2025 10:13

fix: error delete work

3fa0d9d

Toddyclipsgg reopened this Mar 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: extended-files #1427

feat: extended-files #1427

Toddyclipsgg commented Mar 2, 2025 •

edited

Loading

thecodacus Mar 5, 2025

Toddyclipsgg Mar 5, 2025

thecodacus Mar 5, 2025

thecodacus commented Mar 5, 2025

leex279 commented Mar 5, 2025

Toddyclipsgg commented Mar 5, 2025

leex279 commented Mar 5, 2025

Toddyclipsgg commented Mar 8, 2025

Toddyclipsgg commented Mar 9, 2025

leex279 commented Mar 9, 2025

		.join('\n\n');

		const contentWithFilesInfo = textFilesInfo ? `${messageContent}\n\n${textFilesInfo}` : messageContent;

feat: extended-files #1427

Are you sure you want to change the base?

feat: extended-files #1427

Conversation

Toddyclipsgg commented Mar 2, 2025 • edited Loading

Detailed Summary of Document Processing Feature

Document Text Extraction

File Handling in the Interface

thecodacus Mar 5, 2025

Choose a reason for hiding this comment

Toddyclipsgg Mar 5, 2025

Choose a reason for hiding this comment

thecodacus Mar 5, 2025

Choose a reason for hiding this comment

thecodacus commented Mar 5, 2025

leex279 commented Mar 5, 2025

Toddyclipsgg commented Mar 5, 2025

leex279 commented Mar 5, 2025

Toddyclipsgg commented Mar 8, 2025

Toddyclipsgg commented Mar 9, 2025

leex279 commented Mar 9, 2025

Toddyclipsgg commented Mar 2, 2025 •

edited

Loading