first commit
This commit is contained in:
@@ -0,0 +1,185 @@
|
||||
# How It Works: Manus AI Clone - System Summary
|
||||
|
||||
## Overview
|
||||
|
||||
Manus AI Clone is an AI-powered browser automation system that allows users to control a web browser using natural language prompts. The system combines a modern web frontend, FastAPI backend, LangChain AI agent, and Playwright browser automation to create an intelligent system that can understand user intent and execute complex browser tasks.
|
||||
|
||||
### Key Technologies
|
||||
- **Frontend**: HTML5, CSS3, Vanilla JavaScript
|
||||
- **Backend**: FastAPI (Python)
|
||||
- **AI Framework**: LangChain
|
||||
- **Browser Automation**: Playwright
|
||||
- **LLM**: OpenAI GPT models (gpt-4o-mini, gpt-4o, etc.)
|
||||
|
||||
---
|
||||
|
||||
## System Architecture
|
||||
|
||||
The system follows a layered architecture:
|
||||
|
||||
1. **Frontend Layer** - User interface for input and results display
|
||||
2. **Backend API Layer** - FastAPI server handling HTTP requests
|
||||
3. **Browser Agent Layer** - LangChain agent that plans and executes tasks
|
||||
4. **Browser Control Layer** - Playwright for browser automation
|
||||
|
||||
---
|
||||
|
||||
## How It Works
|
||||
|
||||
### Frontend Layer
|
||||
|
||||
The frontend provides a web-based user interface where users can:
|
||||
- Enter natural language prompts describing browser tasks
|
||||
- View example prompts for quick reference
|
||||
- See real-time loading indicators during task execution
|
||||
- View results including:
|
||||
- Success/error status
|
||||
- Agent output messages
|
||||
- Complete action history (all browser actions taken)
|
||||
- Screenshot of the final browser state
|
||||
- Track execution statistics (total tasks, success rate, average time)
|
||||
|
||||
When a user submits a task:
|
||||
1. The JavaScript validates the input
|
||||
2. Sends an HTTP POST request to the `/execute` endpoint with the prompt
|
||||
3. Shows a loading indicator while waiting for the response
|
||||
4. Upon receiving the response, displays all results in the UI
|
||||
5. Updates statistics and shows notifications
|
||||
|
||||
Statistics are persisted in browser localStorage to maintain session data.
|
||||
|
||||
### Backend API Layer
|
||||
|
||||
The FastAPI backend serves multiple purposes:
|
||||
|
||||
**API Endpoints**:
|
||||
- `GET /` - Serves the frontend HTML interface
|
||||
- `POST /execute` - Main endpoint that executes browser automation tasks
|
||||
- `GET /status` - Returns current browser state and action history
|
||||
- `GET /health` - Health check endpoint
|
||||
|
||||
**Lifecycle Management**:
|
||||
- On startup, initializes a single `BrowserAgent` instance
|
||||
- Loads configuration from environment variables (OpenAI API key, model selection, headless mode)
|
||||
- Manages browser agent lifecycle (startup and shutdown)
|
||||
- On shutdown, properly cleans up browser resources
|
||||
|
||||
**Request Processing**:
|
||||
When a task execution request is received:
|
||||
1. Validates the request payload
|
||||
2. Checks that the browser agent is initialized
|
||||
3. Calls the agent's `execute_task()` method with the user's prompt
|
||||
4. Formats and returns the response with success status, output text, screenshot, and action history
|
||||
5. Handles errors appropriately with HTTP status codes
|
||||
|
||||
### Browser Agent Layer
|
||||
|
||||
The Browser Agent consists of two main components:
|
||||
|
||||
#### BrowserController (Low-Level Playwright Wrapper)
|
||||
|
||||
This component provides direct access to Playwright browser operations. It handles:
|
||||
- Browser initialization (launching Chromium, creating context and page)
|
||||
- Navigation to URLs
|
||||
- Clicking elements by CSS selectors
|
||||
- Typing text into input fields
|
||||
- Extracting text from page elements
|
||||
- Getting page content (title, URL, visible text)
|
||||
- Taking screenshots
|
||||
- Executing JavaScript on the page
|
||||
- Finding and inspecting elements
|
||||
- Scrolling the page
|
||||
|
||||
Every action is logged to an action history for transparency and debugging.
|
||||
|
||||
#### BrowserAgent (High-Level LangChain Agent)
|
||||
|
||||
This component uses LangChain to create an intelligent AI agent that can:
|
||||
- Understand natural language prompts
|
||||
- Break down complex tasks into steps
|
||||
- Select appropriate tools for each step
|
||||
- Execute tools in a logical sequence
|
||||
- Reason about results and adjust actions accordingly
|
||||
- Verify task completion
|
||||
|
||||
The agent has access to 8 tools that correspond to browser operations:
|
||||
1. **navigate** - Go to URLs
|
||||
2. **click** - Click elements by CSS selector
|
||||
3. **type_text** - Fill input fields (uses format: "selector|text")
|
||||
4. **get_text** - Extract text from specific elements
|
||||
5. **get_page_content** - Read current page content
|
||||
6. **scroll** - Scroll page in different directions
|
||||
7. **get_elements_info** - Find and inspect elements
|
||||
8. **execute_javascript** - Run custom JavaScript
|
||||
|
||||
Each tool has a detailed description that helps the AI agent understand when and how to use it. The agent uses these descriptions to select the right tool for each task.
|
||||
|
||||
**System Prompt**: The agent is given comprehensive instructions on how to approach tasks, when to use each tool, how to verify actions, and CSS selector usage.
|
||||
|
||||
**Async/Sync Bridge**: Since LangChain tools are synchronous but Playwright operations are async, wrapper functions use `asyncio.run()` to bridge this gap.
|
||||
|
||||
### Task Execution Flow
|
||||
|
||||
When a user submits a task like "Go to google.com and search for Python":
|
||||
|
||||
1. **Frontend** sends the prompt to the backend API
|
||||
2. **Backend** receives the request and calls the agent
|
||||
3. **Agent** analyzes the prompt and breaks it down:
|
||||
- Navigate to google.com
|
||||
- Understand the page structure
|
||||
- Find the search input field
|
||||
- Type "Python" into the search field
|
||||
- Click the search button
|
||||
- Verify the results
|
||||
4. **Agent** selects and executes tools in sequence:
|
||||
- Uses `navigate()` to go to Google
|
||||
- Uses `get_page_content()` to understand the page
|
||||
- Uses `get_elements_info()` to find the search input
|
||||
- Uses `type_text()` to enter the search query
|
||||
- Uses `click()` to submit the search
|
||||
- Uses `get_page_content()` again to verify success
|
||||
5. **Playwright** performs each browser action through the BrowserController
|
||||
6. **Results** flow back to the agent after each tool execution
|
||||
7. **Agent** reasons about the results and determines when the task is complete
|
||||
8. **Screenshot** is captured of the final browser state
|
||||
9. **Response** is assembled with success status, output message, base64-encoded screenshot, and action history
|
||||
10. **Frontend** displays all results to the user
|
||||
|
||||
### Data Flow
|
||||
|
||||
The complete flow follows this pattern:
|
||||
|
||||
**User Input** → **Frontend JavaScript** → **HTTP POST Request** → **FastAPI Backend** → **LangChain Agent** → **Tool Selection** → **Playwright Browser Actions** → **Results Flow Back** → **Agent Reasoning** → **Screenshot Capture** → **Response Assembly** → **JSON Response** → **Frontend Display** → **User Views Results**
|
||||
|
||||
### Key Features
|
||||
|
||||
**Action History**: Every browser action is logged with details (action type, selectors, URLs, text entered, etc.). This provides full transparency of what the AI did.
|
||||
|
||||
**Screenshot Capture**: After task completion, a screenshot is taken and included in the response as a base64-encoded image, giving users visual confirmation of the results.
|
||||
|
||||
**Error Handling**: Errors are handled at every layer:
|
||||
- Frontend catches network errors and displays user-friendly messages
|
||||
- Backend validates requests and returns appropriate HTTP status codes
|
||||
- Browser agent handles Playwright timeouts and element not found errors gracefully
|
||||
|
||||
**State Management**:
|
||||
- Browser state persists between tasks (single browser instance)
|
||||
- Frontend statistics persist in localStorage
|
||||
- Action history accumulates throughout the session
|
||||
|
||||
**Modular Architecture**: Each layer is independent, making the system maintainable and extensible. New browser tools can be added by extending the BrowserController and creating corresponding tool wrappers.
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
Manus AI Clone transforms natural language instructions into browser automation through a carefully orchestrated pipeline:
|
||||
|
||||
1. Users provide natural language prompts through a web interface
|
||||
2. The FastAPI backend receives and validates requests
|
||||
3. A LangChain AI agent interprets the task and plans a sequence of actions
|
||||
4. The agent executes browser tools through Playwright
|
||||
5. Results are collected, including screenshots and action history
|
||||
6. Everything is displayed back to the user in the frontend
|
||||
|
||||
The system demonstrates how AI reasoning can be combined with browser automation to create an intelligent system that can interact with web pages just like a human would, but with the speed and consistency of automation.
|
||||
Reference in New Issue
Block a user