Capstone: Stock Price Prediction

Pascual Vila

Data Science Instructor // Code Syllabus

Predicting financial markets is notoriously difficult due to noise and the Efficient Market Hypothesis. However, applying robust time series forecasting models allows us to identify trends, model volatility, and manage financial risk effectively.

Feature Engineering is King

Raw stock prices are highly non-stationary. A machine learning model like XGBoost struggles to extrapolate raw values. To fix this, you must engineer features that describe the state of the market.

We utilize concepts like Simple Moving Averages (SMA) to smooth out daily noise, Relative Strength Index (RSI) for momentum, and Log Returns to normalize variance.

The Danger of Look-Ahead Bias

The most common mistake in algorithmic trading models is data leakage. If you use `shuffle=True` in Scikit-Learn's `train_test_split`, your model learns from Wednesday to predict Tuesday. This is impossible in reality.

Always split chronologically. Train your model on data from 2018-2022, and test it strictly on 2023-2024 to simulate real-world forward-facing predictions.

Evaluation Metrics

Accuracy is meaningless in continuous prediction. We use metrics like RMSE (Root Mean Squared Error) which heavily penalizes large deviations, mimicking real financial risk where a massive miss costs capital.

❓ Forecasting FAQ

Can AI accurately predict the stock market?

No model can predict exact future prices due to unforeseen macroeconomic events (Black Swans). However, Time Series models like LSTMs or Prophet can identify statistical edges, volatility ranges, and trends that generate positive expected value (EV) in trading strategies.

ARIMA vs. Prophet vs. XGBoost?

ARIMA: Best for pure, univariate linear trends. Requires strict stationarity.
Prophet: Excellent at handling multiple seasonalities and holidays. Very robust to missing data.
XGBoost: Highly performant when you have many engineered features (like SMAs, macroeconomic indicators). Requires converting time series into a supervised learning format.

Forecasting Glossary

Stationarity

A time series whose statistical properties such as mean, variance, and autocorrelation are all constant over time.

code.py

Look-Ahead Bias

An error in modeling where information or data is used that would not have been known or available during the period being analyzed.

code.py

RMSE (Root Mean Square Error)

A standard way to measure the error of a model in predicting quantitative data. $\text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$

code.py

Rolling Mean (SMA)

An unweighted mean of the previous n data points, heavily used to identify trends while smoothing out short-term fluctuations.

code.py

Stock Price Prediction

Pipeline Architecture

Stage 1: Data Prep

Validation Check

Capstone Challenges

Quant Community Hub

Share Your Models