# Stock Intelligence System - Setup & Usage Guide ## 📋 Overview This system automatically collects comprehensive data on publicly traded stocks from TSX, TSXV, CSE, and CBOE exchanges **without requiring any API keys**. All data is collected via web scraping. ## 🎯 What This System Does For each stock on the target exchanges, it: 1. **Extracts company listings** with ticker, name, sector, industry 2. **Scrapes financial data** from Yahoo Finance (3 years + TTM) 3. **Collects news articles** from Google News (last 12 months) 4. **Gathers press releases** from GlobeNewswire, Newswire.ca, etc. 5. **Stores everything** in a SQLite database 6. **Generates text reports** for each stock ## 🚀 Getting Started ### Step 1: Install Dependencies ```bash cd /Users/macbook/Desktop/Victor # Install Python packages pip install -r requirements.txt # Install Playwright browser python3 -m playwright install chromium ``` ### Step 2: Test the Setup ```bash # Run a quick test on a few stocks python test_extraction.py ``` This will: - Extract CSE listings - Show you what data was captured - Save HTML for debugging if needed ### Step 3: Run the Full Pipeline ```bash # Test mode (5 stocks for testing) python main.py # Full pipeline (all stocks - takes hours!) python main.py --full ``` ## 📊 Output Structure After running, you'll have: ``` data/ ├── listings/ │ ├── tsx_tsxv_listings.json # TSX/TSXV stocks │ ├── cse_listings.json # CSE stocks │ ├── cboe_listings.json # CBOE stocks │ └── all_listings_combined.json # All stocks combined │ ├── financials/ │ ├── TICKER1_yahoo.json # Financial data per stock │ ├── TICKER2_yahoo.json │ └── ... │ ├── news/ │ ├── TICKER1_news_pr.json # News & press releases │ ├── TICKER2_news_pr.json │ └── ... │ ├── reports/ │ ├── TICKER1_report.txt # Human-readable report │ ├── TICKER2_report.txt │ └── ... │ └── stocks.db # SQLite database with all data ``` ## 🔧 Individual Components You can run each step separately: ### 1. Extract Listings Only ```bash python extract_listings.py ``` This will: - Open a browser window (non-headless for debugging) - Navigate to each exchange - Extract all stock listings - Save to `data/listings/` ### 2. Import to Database ```bash python database.py ``` This will: - Create the SQLite database schema - Import listings from JSON files - Show statistics ### 3. Scrape Financials ```bash python scrape_yahoo_finance.py ``` This will: - Load stocks from listings - Scrape Yahoo Finance for each - Save financial data to JSON - Takes ~2-3 seconds per stock ### 4. Scrape News & Press Releases ```bash python scrape_news_pr.py ``` This will: - Search Google News for each stock - Search press release sites - Save all articles/releases - Takes ~10-15 seconds per stock ## 📝 Understanding the Data ### Stock Listings Format ```json { "symbol": "ABC", "name": "ABC Company Inc.", "exchange": "TSX", "sector": "Technology", "industry": "Software", "extracted_at": "2025-11-06T10:30:00" } ``` ### Financial Data Format ```json { "ticker": "ABC", "yahoo_ticker": "ABC.TO", "profile": { "current_price": 25.50 }, "statistics": { "market_cap": "500M", "pe_ratio": "15.2", "forward_eps": "1.68" }, "financials": { "total_revenue": ["1.2B", "1.1B", "1.0B"], "net_income": ["120M", "110M", "100M"] } } ``` ### News Data Format ```json { "ticker": "ABC", "news_articles": [ { "title": "ABC Company Reports Strong Earnings", "source": "Financial Post", "date": "2 days ago", "url": "https://...", "snippet": "..." } ], "press_releases": [ { "title": "ABC Announces New Product Line", "source": "GlobeNewswire", "date": "Nov 1, 2025", "url": "https://..." } ] } ``` ## 🗄️ Database Schema The SQLite database (`data/stocks.db`) contains: ### Main Tables - **stocks_master** - All stock information - **financial_statements** - Income statement, balance sheet, cash flow - **financial_metrics** - Calculated ratios (P/E, ROE, etc.) - **news_articles** - News from various sources - **press_releases** - Official company releases - **filings** - Regulatory filings (SEDAR+, SEC) - **agm_info** - Annual general meeting details - **tax_disclosures** - Tax-related information - **coverage_report** - Tracks data completeness per stock ### Query Examples ```python import sqlite3 # Connect to database conn = sqlite3.connect('data/stocks.db') cursor = conn.cursor() # Get all TSX stocks cursor.execute("SELECT symbol, company_name FROM stocks_master WHERE exchange='TSX'") stocks = cursor.fetchall() # Get stocks with complete data cursor.execute(""" SELECT ticker, exchange FROM coverage_report WHERE has_financials=1 AND has_news=1 """) complete_stocks = cursor.fetchall() # Get all news for a specific stock cursor.execute(""" SELECT title, source, published_date FROM news_articles WHERE stock_id = (SELECT id FROM stocks_master WHERE symbol='ABC') """) news = cursor.fetchall() ``` ## ⚠️ Important Considerations ### Rate Limiting - Scripts include delays between requests (2-5 seconds) - Respectful of server resources - May take several hours for full dataset ### Data Quality - Not all stocks have Yahoo Finance data (especially microcaps) - News availability varies by stock - Some press releases may be behind paywalls ### Troubleshooting **Problem: No listings extracted** - Check `data/listings/*_page.html` files - Websites may have changed their structure - May need to update selectors in `extract_listings.py` **Problem: Yahoo Finance returns errors** - Stock may not be listed on Yahoo - Try different ticker suffix (.TO vs .V for Canadian stocks) - Check `data/financials/*_yahoo.json` for error messages **Problem: News scraping fails** - Google may rate-limit searches - Increase delays in `scrape_news_pr.py` - Consider running in smaller batches ## 🔄 Automation & Scheduling To run this automatically on a schedule: ### Using cron (macOS/Linux) ```bash # Edit crontab crontab -e # Run every week on Sunday at 2 AM 0 2 * * 0 cd /Users/macbook/Desktop/Victor && /usr/local/bin/python3 main.py --full >> /tmp/stock_scraper.log 2>&1 ``` ### Manual Scheduling ```bash # Create a shell script cat > run_scraper.sh << 'EOF' #!/bin/bash cd /Users/macbook/Desktop/Victor python3 main.py --full EOF chmod +x run_scraper.sh # Run it whenever needed ./run_scraper.sh ``` ## 📈 Next Steps After collecting the data, you can: 1. **Analyze** - Use pandas/numpy to analyze financial trends 2. **Visualize** - Create charts with matplotlib/plotly 3. **Screen** - Filter stocks by criteria (P/E ratio, revenue growth, etc.) 4. **Monitor** - Track specific stocks over time 5. **Export** - Generate Excel reports, CSV exports, etc. ## 🐛 Known Issues 1. **Dynamic content** - Some exchange sites may change their layouts 2. **Ticker formats** - Canadian stocks have different suffixes (.TO, .V, .CN) 3. **Rate limits** - Google/Yahoo may temporarily block if too aggressive 4. **JavaScript rendering** - Sites may block headless browsers ## 💡 Tips - Start with test mode to verify everything works - Run during off-peak hours to avoid rate limits - Check coverage_report table to see data completeness - Save HTML files for debugging when extraction fails - Use database queries for efficient data analysis ## 📞 Support Check these files for more info: - `PROGRESS.md` - Current implementation status - `README.md` - Original technical plan - Individual script files have detailed comments --- **Last Updated:** November 6, 2025