186 lines
8.1 KiB
Markdown
186 lines
8.1 KiB
Markdown
# How It Works: Manus AI Clone - System Summary
|
|
|
|
## Overview
|
|
|
|
Manus AI Clone is an AI-powered browser automation system that allows users to control a web browser using natural language prompts. The system combines a modern web frontend, FastAPI backend, LangChain AI agent, and Playwright browser automation to create an intelligent system that can understand user intent and execute complex browser tasks.
|
|
|
|
### Key Technologies
|
|
- **Frontend**: HTML5, CSS3, Vanilla JavaScript
|
|
- **Backend**: FastAPI (Python)
|
|
- **AI Framework**: LangChain
|
|
- **Browser Automation**: Playwright
|
|
- **LLM**: OpenAI GPT models (gpt-4o-mini, gpt-4o, etc.)
|
|
|
|
---
|
|
|
|
## System Architecture
|
|
|
|
The system follows a layered architecture:
|
|
|
|
1. **Frontend Layer** - User interface for input and results display
|
|
2. **Backend API Layer** - FastAPI server handling HTTP requests
|
|
3. **Browser Agent Layer** - LangChain agent that plans and executes tasks
|
|
4. **Browser Control Layer** - Playwright for browser automation
|
|
|
|
---
|
|
|
|
## How It Works
|
|
|
|
### Frontend Layer
|
|
|
|
The frontend provides a web-based user interface where users can:
|
|
- Enter natural language prompts describing browser tasks
|
|
- View example prompts for quick reference
|
|
- See real-time loading indicators during task execution
|
|
- View results including:
|
|
- Success/error status
|
|
- Agent output messages
|
|
- Complete action history (all browser actions taken)
|
|
- Screenshot of the final browser state
|
|
- Track execution statistics (total tasks, success rate, average time)
|
|
|
|
When a user submits a task:
|
|
1. The JavaScript validates the input
|
|
2. Sends an HTTP POST request to the `/execute` endpoint with the prompt
|
|
3. Shows a loading indicator while waiting for the response
|
|
4. Upon receiving the response, displays all results in the UI
|
|
5. Updates statistics and shows notifications
|
|
|
|
Statistics are persisted in browser localStorage to maintain session data.
|
|
|
|
### Backend API Layer
|
|
|
|
The FastAPI backend serves multiple purposes:
|
|
|
|
**API Endpoints**:
|
|
- `GET /` - Serves the frontend HTML interface
|
|
- `POST /execute` - Main endpoint that executes browser automation tasks
|
|
- `GET /status` - Returns current browser state and action history
|
|
- `GET /health` - Health check endpoint
|
|
|
|
**Lifecycle Management**:
|
|
- On startup, initializes a single `BrowserAgent` instance
|
|
- Loads configuration from environment variables (OpenAI API key, model selection, headless mode)
|
|
- Manages browser agent lifecycle (startup and shutdown)
|
|
- On shutdown, properly cleans up browser resources
|
|
|
|
**Request Processing**:
|
|
When a task execution request is received:
|
|
1. Validates the request payload
|
|
2. Checks that the browser agent is initialized
|
|
3. Calls the agent's `execute_task()` method with the user's prompt
|
|
4. Formats and returns the response with success status, output text, screenshot, and action history
|
|
5. Handles errors appropriately with HTTP status codes
|
|
|
|
### Browser Agent Layer
|
|
|
|
The Browser Agent consists of two main components:
|
|
|
|
#### BrowserController (Low-Level Playwright Wrapper)
|
|
|
|
This component provides direct access to Playwright browser operations. It handles:
|
|
- Browser initialization (launching Chromium, creating context and page)
|
|
- Navigation to URLs
|
|
- Clicking elements by CSS selectors
|
|
- Typing text into input fields
|
|
- Extracting text from page elements
|
|
- Getting page content (title, URL, visible text)
|
|
- Taking screenshots
|
|
- Executing JavaScript on the page
|
|
- Finding and inspecting elements
|
|
- Scrolling the page
|
|
|
|
Every action is logged to an action history for transparency and debugging.
|
|
|
|
#### BrowserAgent (High-Level LangChain Agent)
|
|
|
|
This component uses LangChain to create an intelligent AI agent that can:
|
|
- Understand natural language prompts
|
|
- Break down complex tasks into steps
|
|
- Select appropriate tools for each step
|
|
- Execute tools in a logical sequence
|
|
- Reason about results and adjust actions accordingly
|
|
- Verify task completion
|
|
|
|
The agent has access to 8 tools that correspond to browser operations:
|
|
1. **navigate** - Go to URLs
|
|
2. **click** - Click elements by CSS selector
|
|
3. **type_text** - Fill input fields (uses format: "selector|text")
|
|
4. **get_text** - Extract text from specific elements
|
|
5. **get_page_content** - Read current page content
|
|
6. **scroll** - Scroll page in different directions
|
|
7. **get_elements_info** - Find and inspect elements
|
|
8. **execute_javascript** - Run custom JavaScript
|
|
|
|
Each tool has a detailed description that helps the AI agent understand when and how to use it. The agent uses these descriptions to select the right tool for each task.
|
|
|
|
**System Prompt**: The agent is given comprehensive instructions on how to approach tasks, when to use each tool, how to verify actions, and CSS selector usage.
|
|
|
|
**Async/Sync Bridge**: Since LangChain tools are synchronous but Playwright operations are async, wrapper functions use `asyncio.run()` to bridge this gap.
|
|
|
|
### Task Execution Flow
|
|
|
|
When a user submits a task like "Go to google.com and search for Python":
|
|
|
|
1. **Frontend** sends the prompt to the backend API
|
|
2. **Backend** receives the request and calls the agent
|
|
3. **Agent** analyzes the prompt and breaks it down:
|
|
- Navigate to google.com
|
|
- Understand the page structure
|
|
- Find the search input field
|
|
- Type "Python" into the search field
|
|
- Click the search button
|
|
- Verify the results
|
|
4. **Agent** selects and executes tools in sequence:
|
|
- Uses `navigate()` to go to Google
|
|
- Uses `get_page_content()` to understand the page
|
|
- Uses `get_elements_info()` to find the search input
|
|
- Uses `type_text()` to enter the search query
|
|
- Uses `click()` to submit the search
|
|
- Uses `get_page_content()` again to verify success
|
|
5. **Playwright** performs each browser action through the BrowserController
|
|
6. **Results** flow back to the agent after each tool execution
|
|
7. **Agent** reasons about the results and determines when the task is complete
|
|
8. **Screenshot** is captured of the final browser state
|
|
9. **Response** is assembled with success status, output message, base64-encoded screenshot, and action history
|
|
10. **Frontend** displays all results to the user
|
|
|
|
### Data Flow
|
|
|
|
The complete flow follows this pattern:
|
|
|
|
**User Input** → **Frontend JavaScript** → **HTTP POST Request** → **FastAPI Backend** → **LangChain Agent** → **Tool Selection** → **Playwright Browser Actions** → **Results Flow Back** → **Agent Reasoning** → **Screenshot Capture** → **Response Assembly** → **JSON Response** → **Frontend Display** → **User Views Results**
|
|
|
|
### Key Features
|
|
|
|
**Action History**: Every browser action is logged with details (action type, selectors, URLs, text entered, etc.). This provides full transparency of what the AI did.
|
|
|
|
**Screenshot Capture**: After task completion, a screenshot is taken and included in the response as a base64-encoded image, giving users visual confirmation of the results.
|
|
|
|
**Error Handling**: Errors are handled at every layer:
|
|
- Frontend catches network errors and displays user-friendly messages
|
|
- Backend validates requests and returns appropriate HTTP status codes
|
|
- Browser agent handles Playwright timeouts and element not found errors gracefully
|
|
|
|
**State Management**:
|
|
- Browser state persists between tasks (single browser instance)
|
|
- Frontend statistics persist in localStorage
|
|
- Action history accumulates throughout the session
|
|
|
|
**Modular Architecture**: Each layer is independent, making the system maintainable and extensible. New browser tools can be added by extending the BrowserController and creating corresponding tool wrappers.
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
Manus AI Clone transforms natural language instructions into browser automation through a carefully orchestrated pipeline:
|
|
|
|
1. Users provide natural language prompts through a web interface
|
|
2. The FastAPI backend receives and validates requests
|
|
3. A LangChain AI agent interprets the task and plans a sequence of actions
|
|
4. The agent executes browser tools through Playwright
|
|
5. Results are collected, including screenshots and action history
|
|
6. Everything is displayed back to the user in the frontend
|
|
|
|
The system demonstrates how AI reasoning can be combined with browser automation to create an intelligent system that can interact with web pages just like a human would, but with the speed and consistency of automation.
|