80ee708348
- Added `extract_listings.py` for extracting stock listings from TSX, TSXV, CSE, and CBOE using Playwright. - Created `main.py` to orchestrate the entire stock intelligence system, including extraction, database import, financial scraping, news scraping, and report generation. - Developed `populate_database.py` to populate the database with existing JSON data. - Introduced `scrape_nasdaq_tsx_only.py` for focused scraping of NASDAQ and TSX stocks. - Added `setup.py` for initial setup and testing of the system. - Created `watchlist.txt` template for user-defined stock tracking. - Generated `final_test_output.txt` to log the results of the test run.
7.7 KiB
7.7 KiB
Stock Intelligence System - Setup & Usage Guide
📋 Overview
This system automatically collects comprehensive data on publicly traded stocks from TSX, TSXV, CSE, and CBOE exchanges without requiring any API keys. All data is collected via web scraping.
🎯 What This System Does
For each stock on the target exchanges, it:
- Extracts company listings with ticker, name, sector, industry
- Scrapes financial data from Yahoo Finance (3 years + TTM)
- Collects news articles from Google News (last 12 months)
- Gathers press releases from GlobeNewswire, Newswire.ca, etc.
- Stores everything in a SQLite database
- Generates text reports for each stock
🚀 Getting Started
Step 1: Install Dependencies
cd /Users/macbook/Desktop/Victor
# Install Python packages
pip install -r requirements.txt
# Install Playwright browser
python3 -m playwright install chromium
Step 2: Test the Setup
# Run a quick test on a few stocks
python test_extraction.py
This will:
- Extract CSE listings
- Show you what data was captured
- Save HTML for debugging if needed
Step 3: Run the Full Pipeline
# Test mode (5 stocks for testing)
python main.py
# Full pipeline (all stocks - takes hours!)
python main.py --full
📊 Output Structure
After running, you'll have:
data/
├── listings/
│ ├── tsx_tsxv_listings.json # TSX/TSXV stocks
│ ├── cse_listings.json # CSE stocks
│ ├── cboe_listings.json # CBOE stocks
│ └── all_listings_combined.json # All stocks combined
│
├── financials/
│ ├── TICKER1_yahoo.json # Financial data per stock
│ ├── TICKER2_yahoo.json
│ └── ...
│
├── news/
│ ├── TICKER1_news_pr.json # News & press releases
│ ├── TICKER2_news_pr.json
│ └── ...
│
├── reports/
│ ├── TICKER1_report.txt # Human-readable report
│ ├── TICKER2_report.txt
│ └── ...
│
└── stocks.db # SQLite database with all data
🔧 Individual Components
You can run each step separately:
1. Extract Listings Only
python extract_listings.py
This will:
- Open a browser window (non-headless for debugging)
- Navigate to each exchange
- Extract all stock listings
- Save to
data/listings/
2. Import to Database
python database.py
This will:
- Create the SQLite database schema
- Import listings from JSON files
- Show statistics
3. Scrape Financials
python scrape_yahoo_finance.py
This will:
- Load stocks from listings
- Scrape Yahoo Finance for each
- Save financial data to JSON
- Takes ~2-3 seconds per stock
4. Scrape News & Press Releases
python scrape_news_pr.py
This will:
- Search Google News for each stock
- Search press release sites
- Save all articles/releases
- Takes ~10-15 seconds per stock
📝 Understanding the Data
Stock Listings Format
{
"symbol": "ABC",
"name": "ABC Company Inc.",
"exchange": "TSX",
"sector": "Technology",
"industry": "Software",
"extracted_at": "2025-11-06T10:30:00"
}
Financial Data Format
{
"ticker": "ABC",
"yahoo_ticker": "ABC.TO",
"profile": {
"current_price": 25.50
},
"statistics": {
"market_cap": "500M",
"pe_ratio": "15.2",
"forward_eps": "1.68"
},
"financials": {
"total_revenue": ["1.2B", "1.1B", "1.0B"],
"net_income": ["120M", "110M", "100M"]
}
}
News Data Format
{
"ticker": "ABC",
"news_articles": [
{
"title": "ABC Company Reports Strong Earnings",
"source": "Financial Post",
"date": "2 days ago",
"url": "https://...",
"snippet": "..."
}
],
"press_releases": [
{
"title": "ABC Announces New Product Line",
"source": "GlobeNewswire",
"date": "Nov 1, 2025",
"url": "https://..."
}
]
}
🗄️ Database Schema
The SQLite database (data/stocks.db) contains:
Main Tables
- stocks_master - All stock information
- financial_statements - Income statement, balance sheet, cash flow
- financial_metrics - Calculated ratios (P/E, ROE, etc.)
- news_articles - News from various sources
- press_releases - Official company releases
- filings - Regulatory filings (SEDAR+, SEC)
- agm_info - Annual general meeting details
- tax_disclosures - Tax-related information
- coverage_report - Tracks data completeness per stock
Query Examples
import sqlite3
# Connect to database
conn = sqlite3.connect('data/stocks.db')
cursor = conn.cursor()
# Get all TSX stocks
cursor.execute("SELECT symbol, company_name FROM stocks_master WHERE exchange='TSX'")
stocks = cursor.fetchall()
# Get stocks with complete data
cursor.execute("""
SELECT ticker, exchange
FROM coverage_report
WHERE has_financials=1 AND has_news=1
""")
complete_stocks = cursor.fetchall()
# Get all news for a specific stock
cursor.execute("""
SELECT title, source, published_date
FROM news_articles
WHERE stock_id = (SELECT id FROM stocks_master WHERE symbol='ABC')
""")
news = cursor.fetchall()
⚠️ Important Considerations
Rate Limiting
- Scripts include delays between requests (2-5 seconds)
- Respectful of server resources
- May take several hours for full dataset
Data Quality
- Not all stocks have Yahoo Finance data (especially microcaps)
- News availability varies by stock
- Some press releases may be behind paywalls
Troubleshooting
Problem: No listings extracted
- Check
data/listings/*_page.htmlfiles - Websites may have changed their structure
- May need to update selectors in
extract_listings.py
Problem: Yahoo Finance returns errors
- Stock may not be listed on Yahoo
- Try different ticker suffix (.TO vs .V for Canadian stocks)
- Check
data/financials/*_yahoo.jsonfor error messages
Problem: News scraping fails
- Google may rate-limit searches
- Increase delays in
scrape_news_pr.py - Consider running in smaller batches
🔄 Automation & Scheduling
To run this automatically on a schedule:
Using cron (macOS/Linux)
# Edit crontab
crontab -e
# Run every week on Sunday at 2 AM
0 2 * * 0 cd /Users/macbook/Desktop/Victor && /usr/local/bin/python3 main.py --full >> /tmp/stock_scraper.log 2>&1
Manual Scheduling
# Create a shell script
cat > run_scraper.sh << 'EOF'
#!/bin/bash
cd /Users/macbook/Desktop/Victor
python3 main.py --full
EOF
chmod +x run_scraper.sh
# Run it whenever needed
./run_scraper.sh
📈 Next Steps
After collecting the data, you can:
- Analyze - Use pandas/numpy to analyze financial trends
- Visualize - Create charts with matplotlib/plotly
- Screen - Filter stocks by criteria (P/E ratio, revenue growth, etc.)
- Monitor - Track specific stocks over time
- Export - Generate Excel reports, CSV exports, etc.
🐛 Known Issues
- Dynamic content - Some exchange sites may change their layouts
- Ticker formats - Canadian stocks have different suffixes (.TO, .V, .CN)
- Rate limits - Google/Yahoo may temporarily block if too aggressive
- JavaScript rendering - Sites may block headless browsers
💡 Tips
- Start with test mode to verify everything works
- Run during off-peak hours to avoid rate limits
- Check coverage_report table to see data completeness
- Save HTML files for debugging when extraction fails
- Use database queries for efficient data analysis
📞 Support
Check these files for more info:
PROGRESS.md- Current implementation statusREADME.md- Original technical plan- Individual script files have detailed comments
Last Updated: November 6, 2025