first commit

2025-11-05 01:03:10 +01:00
commit 5a802e7641
20 changed files with 6161 additions and 0 deletions
@@ -0,0 +1,185 @@
+# How It Works: Manus AI Clone - System Summary
+
+## Overview
+
+Manus AI Clone is an AI-powered browser automation system that allows users to control a web browser using natural language prompts. The system combines a modern web frontend, FastAPI backend, LangChain AI agent, and Playwright browser automation to create an intelligent system that can understand user intent and execute complex browser tasks.
+
+### Key Technologies
+- **Frontend**: HTML5, CSS3, Vanilla JavaScript
+- **Backend**: FastAPI (Python)
+- **AI Framework**: LangChain
+- **Browser Automation**: Playwright
+- **LLM**: OpenAI GPT models (gpt-4o-mini, gpt-4o, etc.)
+
+---
+
+## System Architecture
+
+The system follows a layered architecture:
+
+1. **Frontend Layer** - User interface for input and results display
+2. **Backend API Layer** - FastAPI server handling HTTP requests
+3. **Browser Agent Layer** - LangChain agent that plans and executes tasks
+4. **Browser Control Layer** - Playwright for browser automation
+
+---
+
+## How It Works
+
+### Frontend Layer
+
+The frontend provides a web-based user interface where users can:
+- Enter natural language prompts describing browser tasks
+- View example prompts for quick reference
+- See real-time loading indicators during task execution
+- View results including:
+  - Success/error status
+  - Agent output messages
+  - Complete action history (all browser actions taken)
+  - Screenshot of the final browser state
+- Track execution statistics (total tasks, success rate, average time)
+
+When a user submits a task:
+1. The JavaScript validates the input
+2. Sends an HTTP POST request to the `/execute` endpoint with the prompt
+3. Shows a loading indicator while waiting for the response
+4. Upon receiving the response, displays all results in the UI
+5. Updates statistics and shows notifications
+
+Statistics are persisted in browser localStorage to maintain session data.
+
+### Backend API Layer
+
+The FastAPI backend serves multiple purposes:
+
+**API Endpoints**:
+- `GET /` - Serves the frontend HTML interface
+- `POST /execute` - Main endpoint that executes browser automation tasks
+- `GET /status` - Returns current browser state and action history
+- `GET /health` - Health check endpoint
+
+**Lifecycle Management**:
+- On startup, initializes a single `BrowserAgent` instance
+- Loads configuration from environment variables (OpenAI API key, model selection, headless mode)
+- Manages browser agent lifecycle (startup and shutdown)
+- On shutdown, properly cleans up browser resources
+
+**Request Processing**:
+When a task execution request is received:
+1. Validates the request payload
+2. Checks that the browser agent is initialized
+3. Calls the agent's `execute_task()` method with the user's prompt
+4. Formats and returns the response with success status, output text, screenshot, and action history
+5. Handles errors appropriately with HTTP status codes
+
+### Browser Agent Layer
+
+The Browser Agent consists of two main components:
+
+#### BrowserController (Low-Level Playwright Wrapper)
+
+This component provides direct access to Playwright browser operations. It handles:
+- Browser initialization (launching Chromium, creating context and page)
+- Navigation to URLs
+- Clicking elements by CSS selectors
+- Typing text into input fields
+- Extracting text from page elements
+- Getting page content (title, URL, visible text)
+- Taking screenshots
+- Executing JavaScript on the page
+- Finding and inspecting elements
+- Scrolling the page
+
+Every action is logged to an action history for transparency and debugging.
+
+#### BrowserAgent (High-Level LangChain Agent)
+
+This component uses LangChain to create an intelligent AI agent that can:
+- Understand natural language prompts
+- Break down complex tasks into steps
+- Select appropriate tools for each step
+- Execute tools in a logical sequence
+- Reason about results and adjust actions accordingly
+- Verify task completion
+
+The agent has access to 8 tools that correspond to browser operations:
+1. **navigate** - Go to URLs
+2. **click** - Click elements by CSS selector
+3. **type_text** - Fill input fields (uses format: "selector|text")
+4. **get_text** - Extract text from specific elements
+5. **get_page_content** - Read current page content
+6. **scroll** - Scroll page in different directions
+7. **get_elements_info** - Find and inspect elements
+8. **execute_javascript** - Run custom JavaScript
+
+Each tool has a detailed description that helps the AI agent understand when and how to use it. The agent uses these descriptions to select the right tool for each task.
+
+**System Prompt**: The agent is given comprehensive instructions on how to approach tasks, when to use each tool, how to verify actions, and CSS selector usage.
+
+**Async/Sync Bridge**: Since LangChain tools are synchronous but Playwright operations are async, wrapper functions use `asyncio.run()` to bridge this gap.
+
+### Task Execution Flow
+
+When a user submits a task like "Go to google.com and search for Python":
+
+1. **Frontend** sends the prompt to the backend API
+2. **Backend** receives the request and calls the agent
+3. **Agent** analyzes the prompt and breaks it down:
+   - Navigate to google.com
+   - Understand the page structure
+   - Find the search input field
+   - Type "Python" into the search field
+   - Click the search button
+   - Verify the results
+4. **Agent** selects and executes tools in sequence:
+   - Uses `navigate()` to go to Google
+   - Uses `get_page_content()` to understand the page
+   - Uses `get_elements_info()` to find the search input
+   - Uses `type_text()` to enter the search query
+   - Uses `click()` to submit the search
+   - Uses `get_page_content()` again to verify success
+5. **Playwright** performs each browser action through the BrowserController
+6. **Results** flow back to the agent after each tool execution
+7. **Agent** reasons about the results and determines when the task is complete
+8. **Screenshot** is captured of the final browser state
+9. **Response** is assembled with success status, output message, base64-encoded screenshot, and action history
+10. **Frontend** displays all results to the user
+
+### Data Flow
+
+The complete flow follows this pattern:
+
+**User Input** → **Frontend JavaScript** → **HTTP POST Request** → **FastAPI Backend** → **LangChain Agent** → **Tool Selection** → **Playwright Browser Actions** → **Results Flow Back** → **Agent Reasoning** → **Screenshot Capture** → **Response Assembly** → **JSON Response** → **Frontend Display** → **User Views Results**
+
+### Key Features
+
+**Action History**: Every browser action is logged with details (action type, selectors, URLs, text entered, etc.). This provides full transparency of what the AI did.
+
+**Screenshot Capture**: After task completion, a screenshot is taken and included in the response as a base64-encoded image, giving users visual confirmation of the results.
+
+**Error Handling**: Errors are handled at every layer:
+- Frontend catches network errors and displays user-friendly messages
+- Backend validates requests and returns appropriate HTTP status codes
+- Browser agent handles Playwright timeouts and element not found errors gracefully
+
+**State Management**: 
+- Browser state persists between tasks (single browser instance)
+- Frontend statistics persist in localStorage
+- Action history accumulates throughout the session
+
+**Modular Architecture**: Each layer is independent, making the system maintainable and extensible. New browser tools can be added by extending the BrowserController and creating corresponding tool wrappers.
+
+---
+
+## Summary
+
+Manus AI Clone transforms natural language instructions into browser automation through a carefully orchestrated pipeline:
+
+1. Users provide natural language prompts through a web interface
+2. The FastAPI backend receives and validates requests
+3. A LangChain AI agent interprets the task and plans a sequence of actions
+4. The agent executes browser tools through Playwright
+5. Results are collected, including screenshots and action history
+6. Everything is displayed back to the user in the frontend
+
+The system demonstrates how AI reasoning can be combined with browser automation to create an intelligent system that can interact with web pages just like a human would, but with the speed and consistency of automation.