Capstone: Stock Price Prediction

Pascual Vila
Data Science Instructor // Code Syllabus
Predicting financial markets is notoriously difficult due to noise and the Efficient Market Hypothesis. However, applying robust time series forecasting models allows us to identify trends, model volatility, and manage financial risk effectively.
Feature Engineering is King
Raw stock prices are highly non-stationary. A machine learning model like XGBoost struggles to extrapolate raw values. To fix this, you must engineer features that describe the state of the market.
We utilize concepts like Simple Moving Averages (SMA) to smooth out daily noise, Relative Strength Index (RSI) for momentum, and Log Returns to normalize variance.
The Danger of Look-Ahead Bias
The most common mistake in algorithmic trading models is data leakage. If you use `shuffle=True` in Scikit-Learn's `train_test_split`, your model learns from Wednesday to predict Tuesday. This is impossible in reality.
Always split chronologically. Train your model on data from 2018-2022, and test it strictly on 2023-2024 to simulate real-world forward-facing predictions.
Evaluation Metrics
Accuracy is meaningless in continuous prediction. We use metrics like RMSE (Root Mean Squared Error) which heavily penalizes large deviations, mimicking real financial risk where a massive miss costs capital.
❓ Forecasting FAQ
Can AI accurately predict the stock market?
No model can predict exact future prices due to unforeseen macroeconomic events (Black Swans). However, Time Series models like LSTMs or Prophet can identify statistical edges, volatility ranges, and trends that generate positive expected value (EV) in trading strategies.
ARIMA vs. Prophet vs. XGBoost?
ARIMA: Best for pure, univariate linear trends. Requires strict stationarity.
Prophet: Excellent at handling multiple seasonalities and holidays. Very robust to missing data.
XGBoost: Highly performant when you have many engineered features (like SMAs, macroeconomic indicators). Requires converting time series into a supervised learning format.