top of page

 Predicting profitable stocks in the short term based on market sentiment

In this project, I have built an ML system to download data from different sources, train a model and deploy it on new data. The objective of this model is to predict those stocks whose closing prices will increase by 5% or more in the next week.

Screenshot 2024-09-29 at 6.34.29 PM.png
Problem Description :
​
Predicting stock prices has always been a challenge. There are many factors that can determine the price of a stock even outside the performance of the company. What then becomes essential is that we narrow down to what we want to achieve. For at-home 'traders' such as myself, the goal is sometimes looking at short term gains.

So, we quantify that : "Build a classification model that can tell us which stocks will have a stock price increase by 5% at the end of next week."

Getting the data :

Since our objective was restricted more on short term data.
we focus on trying to get information more from market sentiment rather than the performance of the company.
Post research about market sentiment and various trials and errors we decided on categorizing the required information in the following ways : 

 

  • Stock Tickers : These are the unique strings that we identify publicly traded companies.

    Source : We scrape 'https://en.wikipedia.org/wiki/Russell_1000_Index' for the best performing 1000 stocks.

     

  • Moving averages : These are calculated over the last 200, 100, 50, 20 and 10 days. The general idea being that if the moving average is higher in the recent past as compared to the last 200 or 100 days then there is a positive sentiment associated.


    Source : Yahoo Finance library online gives us historical data of closing prices using which we calculate the averages for all the stock tickers

​      Future idea : Use exponential moving average instead of simple moving average

​

​​

  • Sentiment Analysis : We use python's  nltk.sentiment.vader 's Sentiment Intensity Analyzer to score the sentiment associated with headlines of publicly traded companies.

    Source : We use the polygon.io API that has historical article headlines available for most companies.

 

  • Trading Volume metrics : Changes in volumes traded are also a good indicator of interest in a particular stocks. To quantify this we use 2 metrics.

 

RSI (Relative Strength Index):
It’s a popular momentum oscillator that measures the speed and change of price movements. RSI is typically calculated over a period of 14 days, although this can be adjusted based on trading preferences. 


Reference : https://medium.com/@farrago_course0f/using-python-and-rsi-to-generate-trading-signals-a56a684fb1

Source :
The data is collected from yfinace and calculated later.

​​

​​Changes in volume traded : We compare the volume traded this week against the volume traded over the last month hence giving us an understanding if the volume of that stock traded has increased or decreased.

​

Source : polygon.io API has historical volume traded options

​

Future Ideas : Use open contracts options data to gage the sentiment better.

​​

​​

Pre-processing data for model training :

​

It is important that for the above collected we extract the right information to create appropriate features. So we apply the following steps

​

Feature Scaling :

​

  • When we calculate moving averages for the last 200, 100, 50, 20 and 10 days they are subjective to the original stock price of a particular company. This can vary across the board. Example

​

​

​

​​

 

​​​

 

​
 

Screenshot 2024-09-22 at 8.33.54 PM.png

As you can see the range is extensive from 0 to almost 4000. A feature with such a range an outliers can impact our model.

To resolve this we create new features that are moving averages relative to the current stock price.

​

​

Screenshot 2024-09-22 at 8.46.54 PM.png
Screenshot 2024-09-22 at 8.48.56 PM.png

Now our distribution comes in the region of -1 to 1. 

  • We use the same principal for volume change feature to get the relative percent difference between last week mean volume traded and last month mean volume traded

Screenshot 2024-09-22 at 8.58.12 PM.png
Screenshot 2024-09-22 at 9.00.15 PM.png

This too gives a good relative range.

  • We also code our RSI signal in a way that it gives us categorical values of -1, 0 & 1.

Screenshot 2024-09-22 at 9.11.54 PM.png
  • Finally we use the SentimentIntensityAnalyzer to get a score of the sentiments of different news articles to get median compound score ranging between -1 and 1.

Screenshot 2024-09-22 at 9.21.28 PM.png

Overall Data ingestion

Screenshot 2024-09-22 at 9.23.30 PM.png

Training the model

  • Given all the features created, we finally define the core of our problem : the label

Screenshot 2024-09-27 at 3.04.05 PM.png
  • Post our data pre-processing, our training set looks like this.

Screenshot 2024-09-27 at 2.56.30 PM.png
  • Give out the data that we have, it made most sense to use a Random Forest classification model due to the following reasons.

    •  Since multiple decision trees are used and the final prediction is majority vote across, hence reducing the chances of over-fitting.

    • They are effective in capturing complex, non-linear interactions between features

    • Unlike algorithms such as logistic regression, random forests do not assume a linear relationship between features and the outcome. They are effective in capturing complex, non-linear interactions between features..

Hyper Parameter tuning

  • Since we are experimenting, we try and train model in 2 different ways. One without any hyper-parameters and with optimized parameters for a the best precision score.   

Screenshot 2024-09-27 at 3.21.54 PM.png

The following are the results of testing the 2 models.

Random Forest Generic

Screenshot 2024-09-28 at 5.35.27 PM.png
Screenshot 2024-09-28 at 5.36.31 PM.png

Random Forest Hyper Parameterized for best precision score

Screenshot 2024-09-28 at 5.40.17 PM.png
Screenshot 2024-09-28 at 5.48.26 PM.png

Seems like the 2 results are almost the same with hyperparmetrized Random forest giving us a slightly better precision.

In the final part of the project, we apply this model to new data a week later with the same features to see how our model would perform in actuality.

Screenshot 2024-09-28 at 10.03.26 PM.png

The model performed relatively well with a 93% precision. 

Screenshot 2024-09-28 at 10.05.04 PM.png

The above graph shows how we scored the probability and where we decided to cut the threshold. If we move our threshold to 0.6, we stand to have lesser false positives.

Next steps :

​

​

  • Add time-series CNN predictor 

  • Productionalize the entire project to moving it to the cloud

bottom of page