335 lines
7.7 KiB
Markdown
335 lines
7.7 KiB
Markdown
|
|
# Stock Intelligence System - Setup & Usage Guide
|
||
|
|
|
||
|
|
## 📋 Overview
|
||
|
|
|
||
|
|
This system automatically collects comprehensive data on publicly traded stocks from TSX, TSXV, CSE, and CBOE exchanges **without requiring any API keys**. All data is collected via web scraping.
|
||
|
|
|
||
|
|
## 🎯 What This System Does
|
||
|
|
|
||
|
|
For each stock on the target exchanges, it:
|
||
|
|
|
||
|
|
1. **Extracts company listings** with ticker, name, sector, industry
|
||
|
|
2. **Scrapes financial data** from Yahoo Finance (3 years + TTM)
|
||
|
|
3. **Collects news articles** from Google News (last 12 months)
|
||
|
|
4. **Gathers press releases** from GlobeNewswire, Newswire.ca, etc.
|
||
|
|
5. **Stores everything** in a SQLite database
|
||
|
|
6. **Generates text reports** for each stock
|
||
|
|
|
||
|
|
## 🚀 Getting Started
|
||
|
|
|
||
|
|
### Step 1: Install Dependencies
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd /Users/macbook/Desktop/Victor
|
||
|
|
|
||
|
|
# Install Python packages
|
||
|
|
pip install -r requirements.txt
|
||
|
|
|
||
|
|
# Install Playwright browser
|
||
|
|
python3 -m playwright install chromium
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 2: Test the Setup
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Run a quick test on a few stocks
|
||
|
|
python test_extraction.py
|
||
|
|
```
|
||
|
|
|
||
|
|
This will:
|
||
|
|
- Extract CSE listings
|
||
|
|
- Show you what data was captured
|
||
|
|
- Save HTML for debugging if needed
|
||
|
|
|
||
|
|
### Step 3: Run the Full Pipeline
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Test mode (5 stocks for testing)
|
||
|
|
python main.py
|
||
|
|
|
||
|
|
# Full pipeline (all stocks - takes hours!)
|
||
|
|
python main.py --full
|
||
|
|
```
|
||
|
|
|
||
|
|
## 📊 Output Structure
|
||
|
|
|
||
|
|
After running, you'll have:
|
||
|
|
|
||
|
|
```
|
||
|
|
data/
|
||
|
|
├── listings/
|
||
|
|
│ ├── tsx_tsxv_listings.json # TSX/TSXV stocks
|
||
|
|
│ ├── cse_listings.json # CSE stocks
|
||
|
|
│ ├── cboe_listings.json # CBOE stocks
|
||
|
|
│ └── all_listings_combined.json # All stocks combined
|
||
|
|
│
|
||
|
|
├── financials/
|
||
|
|
│ ├── TICKER1_yahoo.json # Financial data per stock
|
||
|
|
│ ├── TICKER2_yahoo.json
|
||
|
|
│ └── ...
|
||
|
|
│
|
||
|
|
├── news/
|
||
|
|
│ ├── TICKER1_news_pr.json # News & press releases
|
||
|
|
│ ├── TICKER2_news_pr.json
|
||
|
|
│ └── ...
|
||
|
|
│
|
||
|
|
├── reports/
|
||
|
|
│ ├── TICKER1_report.txt # Human-readable report
|
||
|
|
│ ├── TICKER2_report.txt
|
||
|
|
│ └── ...
|
||
|
|
│
|
||
|
|
└── stocks.db # SQLite database with all data
|
||
|
|
```
|
||
|
|
|
||
|
|
## 🔧 Individual Components
|
||
|
|
|
||
|
|
You can run each step separately:
|
||
|
|
|
||
|
|
### 1. Extract Listings Only
|
||
|
|
|
||
|
|
```bash
|
||
|
|
python extract_listings.py
|
||
|
|
```
|
||
|
|
|
||
|
|
This will:
|
||
|
|
- Open a browser window (non-headless for debugging)
|
||
|
|
- Navigate to each exchange
|
||
|
|
- Extract all stock listings
|
||
|
|
- Save to `data/listings/`
|
||
|
|
|
||
|
|
### 2. Import to Database
|
||
|
|
|
||
|
|
```bash
|
||
|
|
python database.py
|
||
|
|
```
|
||
|
|
|
||
|
|
This will:
|
||
|
|
- Create the SQLite database schema
|
||
|
|
- Import listings from JSON files
|
||
|
|
- Show statistics
|
||
|
|
|
||
|
|
### 3. Scrape Financials
|
||
|
|
|
||
|
|
```bash
|
||
|
|
python scrape_yahoo_finance.py
|
||
|
|
```
|
||
|
|
|
||
|
|
This will:
|
||
|
|
- Load stocks from listings
|
||
|
|
- Scrape Yahoo Finance for each
|
||
|
|
- Save financial data to JSON
|
||
|
|
- Takes ~2-3 seconds per stock
|
||
|
|
|
||
|
|
### 4. Scrape News & Press Releases
|
||
|
|
|
||
|
|
```bash
|
||
|
|
python scrape_news_pr.py
|
||
|
|
```
|
||
|
|
|
||
|
|
This will:
|
||
|
|
- Search Google News for each stock
|
||
|
|
- Search press release sites
|
||
|
|
- Save all articles/releases
|
||
|
|
- Takes ~10-15 seconds per stock
|
||
|
|
|
||
|
|
## 📝 Understanding the Data
|
||
|
|
|
||
|
|
### Stock Listings Format
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"symbol": "ABC",
|
||
|
|
"name": "ABC Company Inc.",
|
||
|
|
"exchange": "TSX",
|
||
|
|
"sector": "Technology",
|
||
|
|
"industry": "Software",
|
||
|
|
"extracted_at": "2025-11-06T10:30:00"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Financial Data Format
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"ticker": "ABC",
|
||
|
|
"yahoo_ticker": "ABC.TO",
|
||
|
|
"profile": {
|
||
|
|
"current_price": 25.50
|
||
|
|
},
|
||
|
|
"statistics": {
|
||
|
|
"market_cap": "500M",
|
||
|
|
"pe_ratio": "15.2",
|
||
|
|
"forward_eps": "1.68"
|
||
|
|
},
|
||
|
|
"financials": {
|
||
|
|
"total_revenue": ["1.2B", "1.1B", "1.0B"],
|
||
|
|
"net_income": ["120M", "110M", "100M"]
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### News Data Format
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"ticker": "ABC",
|
||
|
|
"news_articles": [
|
||
|
|
{
|
||
|
|
"title": "ABC Company Reports Strong Earnings",
|
||
|
|
"source": "Financial Post",
|
||
|
|
"date": "2 days ago",
|
||
|
|
"url": "https://...",
|
||
|
|
"snippet": "..."
|
||
|
|
}
|
||
|
|
],
|
||
|
|
"press_releases": [
|
||
|
|
{
|
||
|
|
"title": "ABC Announces New Product Line",
|
||
|
|
"source": "GlobeNewswire",
|
||
|
|
"date": "Nov 1, 2025",
|
||
|
|
"url": "https://..."
|
||
|
|
}
|
||
|
|
]
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## 🗄️ Database Schema
|
||
|
|
|
||
|
|
The SQLite database (`data/stocks.db`) contains:
|
||
|
|
|
||
|
|
### Main Tables
|
||
|
|
- **stocks_master** - All stock information
|
||
|
|
- **financial_statements** - Income statement, balance sheet, cash flow
|
||
|
|
- **financial_metrics** - Calculated ratios (P/E, ROE, etc.)
|
||
|
|
- **news_articles** - News from various sources
|
||
|
|
- **press_releases** - Official company releases
|
||
|
|
- **filings** - Regulatory filings (SEDAR+, SEC)
|
||
|
|
- **agm_info** - Annual general meeting details
|
||
|
|
- **tax_disclosures** - Tax-related information
|
||
|
|
- **coverage_report** - Tracks data completeness per stock
|
||
|
|
|
||
|
|
### Query Examples
|
||
|
|
|
||
|
|
```python
|
||
|
|
import sqlite3
|
||
|
|
|
||
|
|
# Connect to database
|
||
|
|
conn = sqlite3.connect('data/stocks.db')
|
||
|
|
cursor = conn.cursor()
|
||
|
|
|
||
|
|
# Get all TSX stocks
|
||
|
|
cursor.execute("SELECT symbol, company_name FROM stocks_master WHERE exchange='TSX'")
|
||
|
|
stocks = cursor.fetchall()
|
||
|
|
|
||
|
|
# Get stocks with complete data
|
||
|
|
cursor.execute("""
|
||
|
|
SELECT ticker, exchange
|
||
|
|
FROM coverage_report
|
||
|
|
WHERE has_financials=1 AND has_news=1
|
||
|
|
""")
|
||
|
|
complete_stocks = cursor.fetchall()
|
||
|
|
|
||
|
|
# Get all news for a specific stock
|
||
|
|
cursor.execute("""
|
||
|
|
SELECT title, source, published_date
|
||
|
|
FROM news_articles
|
||
|
|
WHERE stock_id = (SELECT id FROM stocks_master WHERE symbol='ABC')
|
||
|
|
""")
|
||
|
|
news = cursor.fetchall()
|
||
|
|
```
|
||
|
|
|
||
|
|
## ⚠️ Important Considerations
|
||
|
|
|
||
|
|
### Rate Limiting
|
||
|
|
- Scripts include delays between requests (2-5 seconds)
|
||
|
|
- Respectful of server resources
|
||
|
|
- May take several hours for full dataset
|
||
|
|
|
||
|
|
### Data Quality
|
||
|
|
- Not all stocks have Yahoo Finance data (especially microcaps)
|
||
|
|
- News availability varies by stock
|
||
|
|
- Some press releases may be behind paywalls
|
||
|
|
|
||
|
|
### Troubleshooting
|
||
|
|
|
||
|
|
**Problem: No listings extracted**
|
||
|
|
- Check `data/listings/*_page.html` files
|
||
|
|
- Websites may have changed their structure
|
||
|
|
- May need to update selectors in `extract_listings.py`
|
||
|
|
|
||
|
|
**Problem: Yahoo Finance returns errors**
|
||
|
|
- Stock may not be listed on Yahoo
|
||
|
|
- Try different ticker suffix (.TO vs .V for Canadian stocks)
|
||
|
|
- Check `data/financials/*_yahoo.json` for error messages
|
||
|
|
|
||
|
|
**Problem: News scraping fails**
|
||
|
|
- Google may rate-limit searches
|
||
|
|
- Increase delays in `scrape_news_pr.py`
|
||
|
|
- Consider running in smaller batches
|
||
|
|
|
||
|
|
## 🔄 Automation & Scheduling
|
||
|
|
|
||
|
|
To run this automatically on a schedule:
|
||
|
|
|
||
|
|
### Using cron (macOS/Linux)
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Edit crontab
|
||
|
|
crontab -e
|
||
|
|
|
||
|
|
# Run every week on Sunday at 2 AM
|
||
|
|
0 2 * * 0 cd /Users/macbook/Desktop/Victor && /usr/local/bin/python3 main.py --full >> /tmp/stock_scraper.log 2>&1
|
||
|
|
```
|
||
|
|
|
||
|
|
### Manual Scheduling
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Create a shell script
|
||
|
|
cat > run_scraper.sh << 'EOF'
|
||
|
|
#!/bin/bash
|
||
|
|
cd /Users/macbook/Desktop/Victor
|
||
|
|
python3 main.py --full
|
||
|
|
EOF
|
||
|
|
|
||
|
|
chmod +x run_scraper.sh
|
||
|
|
|
||
|
|
# Run it whenever needed
|
||
|
|
./run_scraper.sh
|
||
|
|
```
|
||
|
|
|
||
|
|
## 📈 Next Steps
|
||
|
|
|
||
|
|
After collecting the data, you can:
|
||
|
|
|
||
|
|
1. **Analyze** - Use pandas/numpy to analyze financial trends
|
||
|
|
2. **Visualize** - Create charts with matplotlib/plotly
|
||
|
|
3. **Screen** - Filter stocks by criteria (P/E ratio, revenue growth, etc.)
|
||
|
|
4. **Monitor** - Track specific stocks over time
|
||
|
|
5. **Export** - Generate Excel reports, CSV exports, etc.
|
||
|
|
|
||
|
|
## 🐛 Known Issues
|
||
|
|
|
||
|
|
1. **Dynamic content** - Some exchange sites may change their layouts
|
||
|
|
2. **Ticker formats** - Canadian stocks have different suffixes (.TO, .V, .CN)
|
||
|
|
3. **Rate limits** - Google/Yahoo may temporarily block if too aggressive
|
||
|
|
4. **JavaScript rendering** - Sites may block headless browsers
|
||
|
|
|
||
|
|
## 💡 Tips
|
||
|
|
|
||
|
|
- Start with test mode to verify everything works
|
||
|
|
- Run during off-peak hours to avoid rate limits
|
||
|
|
- Check coverage_report table to see data completeness
|
||
|
|
- Save HTML files for debugging when extraction fails
|
||
|
|
- Use database queries for efficient data analysis
|
||
|
|
|
||
|
|
## 📞 Support
|
||
|
|
|
||
|
|
Check these files for more info:
|
||
|
|
- `PROGRESS.md` - Current implementation status
|
||
|
|
- `README.md` - Original technical plan
|
||
|
|
- Individual script files have detailed comments
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Last Updated:** November 6, 2025
|