Stock Market Data/Price Prediction Script Python

New Reply

Programming Stock Market Data/Price Prediction Script - Python

Posted: Wed Feb 15, 2023 10:34 pm

Programming Stock Market Data/Price Prediction Script - PythonPosted: Wed Feb 15, 2023 10:34 pm

SiDev

Summer 2023

Status: Offline

Joined: Dec 13, 20203Year Member

Posts: 288

Reputation Power: 567

Status: Offline

Joined: Dec 13, 20203Year Member

Posts: 288

Reputation Power: 567

https://gyazo.com/21fd330e78c30c521278bdaa54e82e93.png

Project Background Information

This script was a project I worked on for my Python class during college. I have not yet gotten this script to work 100% correctly just yet. There are also some errors with the price prediction. It will accurately display the current days opening price; however, it will not accurately predict next day opening price as intended.

The purpose of this script is to datascrape yahoo finance and other finance sources to output financial data for a given ticker that you give it. This data will include charts for open/close/volume/high/low points, graph the rolling mean / standard deviation. It will also display results from the Dickey-Fuller test and ARIMA models.

*This tutorial will NOT show you how to install libraries or your python environment, but it will walk you through the code and thought process of the script. If you need help installing libraries or python work environment, there are tons of tutorials on YouTube.

*Note: All data/graphs shown are using data from 'TSLA' ticker. This data is not up to date because I had retrieved it months ago. Script should work to receive new data - for most of the features.

*Full script at bottom of post*

What is the Dickey-Fuller test?

In statistics, the Dickey-Fuller test tests the null hypothesis that a unit root is present in an autoregressive time series model. The alternative hypothesis is different depending on which version of the test is used, but is usually stationarity or trend-stationarity. The test is named after the statisticians David Dickey and Wayne Fuller, who developed it in 1979.[1]

What are ARIMA models?

ARIMA models provide another approach to time series forecasting. Exponential smoothing and ARIMA models are the two most widely used approaches to time series forecasting, and provide complementary approaches to the problem. While exponential smoothing models are based on a description of the trend and seasonality in the data, ARIMA models aim to describe the autocorrelations in the data.

ARIMA models are used for machine learning. There are tons of different ways they can be utilized. For our case, they help with price prediction.

Now For the Code

Installing Libraries

#import libraries


import random


import tensorflow as tf


from yahoo_finance import Share


import matplotlib.pyplot as plt


import matplotlib as mpl


import numpy as np


import pandas as pd


import yfinance as yf


from pandas_datareader import data, wb


import datetime


from statsmodels.tsa.stattools import adfuller


from statsmodels.tsa.arima_model import ARIMA


from keras import metrics


from sklearn.metrics import mean_squared_error


import matplotlib.dates as mdates


import matplotlib.cbook as cbook


import datetime as dt


from pandas_datareader import data as pdr

This section imports the many libraries related to data scraping / data visualization and allows us access to financial data to pull from.

Allowing Ticker Input / Data Fetching / Price Display

# Ticker Input / Pricing Display / Data Fetching


ticker = input('Enter stock ticker: ')


start = pd.to_datetime('2020-02-04')


end = pd.to_datetime('today')


stock0 = yf.Ticker(ticker)


hist = stock0.history(period="max")


hist.to_csv(ticker + '.csv')


ticker0 = pd.read_csv(ticker + '.csv')


ticker0['Date'] = pd.to_datetime(ticker0['Date'])


stock = data.DataReader(ticker, 'yahoo', start , end)


stock

This section defines how a ticker is given to our script and what to do once it is given one.

Cleaning / Sorting the Data

# Data Cleaning / Sorting


# Set target series


series = ticker0['Close']


# Create train data set


train_split_date = '2020-12-31'


train_split_index = np.where(ticker0.Date == train_split_date)[0][0]


x_train = ticker0.loc[ticker0['Date'] <= train_split_date]['Close']


# Create test data set


test_split_date = '2021-06-15'


test_split_index = np.where(ticker0.Date == test_split_date)[0][0]


x_test = ticker0.loc[ticker0['Date'] >= test_split_date]['Close']


# Create valid data set


valid_split_index = (train_split_index.max(),test_split_index.min())


x_valid = ticker0.loc[(ticker0['Date'] < test_split_date) & (ticker0['Date'] > train_split_date)]['Close']


#printed index values are: 


#0-5521(train), 5522-6527(valid), 6528-6947(test)

This section defines where the data is split to determine which data is shown and which data is hidden.

Stationary Test

# Stationary Test


def test_stationarity(timeseries, window = 12, cutoff = 0.01):


    #Determing rolling statistics


    rolmean = timeseries.rolling(window).mean()


    rolstd = timeseries.rolling(window).std()


    #Plot rolling statistics:


    fig = plt.figure(figsize=(12, 8))


    orig = plt.plot(timeseries, color='blue',label='Original')


    mean = plt.plot(rolmean, color='red', label='Rolling Mean')


    std = plt.plot(rolstd, color='black', label = 'Rolling Std')


    plt.legend(loc='best')


    plt.title('Rolling Mean & Standard Deviation')


    plt.show()


    #Perform Dickey-Fuller test:


    print('Results of Dickey-Fuller Test:')


    dftest = adfuller(timeseries, autolag='AIC', maxlag = 20 )


    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])


    for key,value in dftest[4].items():


        dfoutput['Critical Value (%s)'%key] = value


        pvalue = dftest[1]


        if pvalue < cutoff:


            print('p-value = %.4f. The series is likely stationary.' % pvalue)


        else:


            print('p-value = %.4f. The series is likely non-stationary.' % pvalue)


        print(dfoutput)

Call Stationary Test

test_stationarity(series)

This section defines our stationary test and displays a graph for output.

Stationary Test Output

https://gyazo.com/a98c2e99adf8f8de88f52f1217b58926.png

Stationary Test w/ Adjusted Close Point

# Get the difference of each Adj Close point


ticker0_close_diff_1 = series.diff()


# Drop the first row as it will have a null value in this column


ticker0_close_diff_1.dropna(inplace=True)

Call Stationary Test w/ Adjusted Close Point

test_stationarity(ticker0_close_diff_1)

Stationary Test w/ Adjusted Close Point Output Graph

https://gyazo.com/9211ba88dfd127f619bb07fdfe7a0b06.png

Graphing (Partial) Autocorrelation

from statsmodels.graphics.tsaplots import plot_acf,plot_pacf


plot_acf(ticker0_close_diff_1)


plt.xlabel('Lags (Days)')


plt.show()


# Break these into two separate cells


plot_pacf(ticker0_close_diff_1)


plt.xlabel('Lags (Days)')


plt.show()

(Partial) Autocorrelation Graph Output

https://gyazo.com/42668ffbc6310cbb847a4a0d70ca48ca.png

AMIRA Models

# Use this block to


# fit model


ticker0_arima = ARIMA(x_train, order=(1,1,1))


ticker0_arima_fit = ticker0_arima.fit(disp=0)


print(ticker0_arima_fit.summary())

AMIRA Output (with warnings)

https://gyazo.com/eed0d9085dcc8121daf3e4f51bd89ee9.png

Create List for Predictions / data points

# Create list of x train valuess


history = [x for x in x_train]


# establish list for predictions


model_predictions = []


# Count number of test data points


N_test_observations = len(x_test)


# loop through every data point


for time_point in list(x_test.index):


    model = ARIMA(history, order=(1,1,1))


    model_fit = model.fit(disp=0)


    output = model_fit.forecast()


    yhat = output[0]


    model_predictions.append(yhat)


    true_test_value = x_test[time_point]


    history.append(true_test_value)


MAE_error = metrics.mean_absolute_error(x_test, model_predictions).numpy()


print('Testing Mean Squared Error is {}'.format(MAE_error))


# store model_predictions


model_fit.save(ticker + '.pkl')

This model tests the mean squared and saves it.

Output Image

https://gyazo.com/be562425f0116d026be57548b396d9e9.png

Check Model

# Check to see if it reloaded


model_predictions[:5]


# Load model


from statsmodels.tsa.arima.model import ARIMAResults


loaded = ARIMAResults.load(ticker + '.pkl')


arima_mae = mean_squared_error(x_test,model_predictions)


arima_mae


plt.rcParams['figure.figsize'] = [10, 10]


plt.plot(x_test.index[-100:], model_predictions[-100:], color='blue',label='Predicted Price')


plt.plot(x_test.index[-100:], x_test[-100:], color='red', label='Actual Price')


plt.title(ticker + ' Price Prediction')


plt.xlabel('Date')


plt.ylabel('Prices')


# plt.xticks(np.arange(881,1259,50), df.Date[881:1259:50])


plt.legend()


plt.figure(figsize=(10,6))


plt.show()

Next Day Price Predictions

print("next day predicted value: ",model_predictions[-1])

Output Image

https://gyazo.com/694c130c7abe8ad6ab1f47522b5caccf.png

Green Line Setup / Get Monthly Data / Green Line Indicator Definition

Green Line

# Green Line


yf.pdr_override() # <== that's all it takes :-)


start =dt.datetime(1980,12,1)


now = dt.datetime.now()


stockline = ticker

Get Monthly Data

# Get Monthly Data


def get_monthly_data(stockline, start, end):


    df = pdr.get_data_yahoo(stock, start, end)


    df.to_csv(stock +'.csv', index=False)


    df.drop(df[df["Volume"]<1000].index, inplace=True)


    


    dfmonth = df.groupby(pd.Grouper(freq="M"))["High"].max()


    return dfmonth

Green Line Indicator Definition

# Green Line Indicator Definition


def calculate_GreenLine(dfmonth):


    glDate=0


    lastGLV=0 #last green line value


    currentDate=""


    curentGLV=0 # current greenline value


    for index, value in dfmonth.items():


        if value > curentGLV: #current greenline value


            curentGLV=value #update


            currentDate=index #update


            counter=0 #reset the counter


        if value < curentGLV: 


            counter=counter+1 # update the counter for the three month 





            if counter==3 and ((index.month != now.month) or (index.year != now.year)):


                #if curentGLV != lastGLV:


                #    print(curentGLV)


                glDate=currentDate


                lastGLV=curentGLV


                counter=0





    if lastGLV==0:


        message=stock+" has not formed a green line yet"


    else:


        message=("Last Green Line: "+str(lastGLV)+" on "+str(glDate))


    print(message)

Call Green Line Calculation

calculate_GreenLine(dfmonth)

Errors I am getting / Help Wanted / Additional Information

Additional Information
This project was a project I worked on for my python class. It was made in Jupyter Notebook which is why the code format looks kind of odd. I can share the .ipynb file for anyone that may be interested in helping get this project where it is intended to be.

Help Wanted
I am needing help situating the warnings/errors you have seen in these images. I am also wanting to get the next day price prediction to work properly. It currently gets same day opening price, not next day as intended. I would also like to eventually add some sort of weighted calculation to give meaning to the data it receives/outputs. This way, it can provide a recommendation of good/bad investment or something of the sort.

Any suggestions / Feedback for improvement is always welcome!

Error Images

https://gyazo.com/f3684d1ab8d6b3217813e36979e08487.png
https://gyazo.com/8caa5e8c48379c7a19579a89213500a3.png

Full Script

#import libraries


import random


import tensorflow as tf


from yahoo_finance import Share


import matplotlib.pyplot as plt


import matplotlib as mpl


import numpy as np


import pandas as pd


import yfinance as yf


from pandas_datareader import data, wb


import datetime


from statsmodels.tsa.stattools import adfuller


from statsmodels.tsa.arima_model import ARIMA


from keras import metrics


from sklearn.metrics import mean_squared_error


import matplotlib.dates as mdates


import matplotlib.cbook as cbook


import datetime as dt


from pandas_datareader import data as pdr





# Ticker Input / Pricing Display / Data Fetching


ticker = input('Enter stock ticker: ')


start = pd.to_datetime('2020-02-04')


end = pd.to_datetime('today')


stock0 = yf.Ticker(ticker)


hist = stock0.history(period="max")


hist.to_csv(ticker + '.csv')


ticker0 = pd.read_csv(ticker + '.csv')


ticker0['Date'] = pd.to_datetime(ticker0['Date'])


stock = data.DataReader(ticker, 'yahoo', start , end)


stock





# Data Cleaning / Sorting


# Set target series


series = ticker0['Close']


# Create train data set


train_split_date = '2020-12-31'


train_split_index = np.where(ticker0.Date == train_split_date)[0][0]


x_train = ticker0.loc[ticker0['Date'] <= train_split_date]['Close']


# Create test data set


test_split_date = '2021-06-15'


test_split_index = np.where(ticker0.Date == test_split_date)[0][0]


x_test = ticker0.loc[ticker0['Date'] >= test_split_date]['Close']


# Create valid data set


valid_split_index = (train_split_index.max(),test_split_index.min())


x_valid = ticker0.loc[(ticker0['Date'] < test_split_date) & (ticker0['Date'] > train_split_date)]['Close']


#printed index values are: 


#0-5521(train), 5522-6527(valid), 6528-6947(test)





# Stationary Test


def test_stationarity(timeseries, window = 12, cutoff = 0.01):


    #Determing rolling statistics


    rolmean = timeseries.rolling(window).mean()


    rolstd = timeseries.rolling(window).std()


    #Plot rolling statistics:


    fig = plt.figure(figsize=(12, 8))


    orig = plt.plot(timeseries, color='blue',label='Original')


    mean = plt.plot(rolmean, color='red', label='Rolling Mean')


    std = plt.plot(rolstd, color='black', label = 'Rolling Std')


    plt.legend(loc='best')


    plt.title('Rolling Mean & Standard Deviation')


    plt.show()


    #Perform Dickey-Fuller test:


    print('Results of Dickey-Fuller Test:')


    dftest = adfuller(timeseries, autolag='AIC', maxlag = 20 )


    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])


    for key,value in dftest[4].items():


        dfoutput['Critical Value (%s)'%key] = value


        pvalue = dftest[1]


        if pvalue < cutoff:


            print('p-value = %.4f. The series is likely stationary.' % pvalue)


        else:


            print('p-value = %.4f. The series is likely non-stationary.' % pvalue)


        print(dfoutput)


test_stationarity(series)





# Get the difference of each Adj Close point


ticker0_close_diff_1 = series.diff()


# Drop the first row as it will have a null value in this column


ticker0_close_diff_1.dropna(inplace=True)


test_stationarity(ticker0_close_diff_1)





from statsmodels.graphics.tsaplots import plot_acf,plot_pacf


plot_acf(ticker0_close_diff_1)


plt.xlabel('Lags (Days)')


plt.show()


# Break these into two separate cells


plot_pacf(ticker0_close_diff_1)


plt.xlabel('Lags (Days)')


plt.show()





# Use this block to


# fit model


ticker0_arima = ARIMA(x_train, order=(1,1,1))


ticker0_arima_fit = ticker0_arima.fit(disp=0)


print(ticker0_arima_fit.summary())





# Create list of x train valuess


history = [x for x in x_train]


# establish list for predictions


model_predictions = []


# Count number of test data points


N_test_observations = len(x_test)


# loop through every data point


for time_point in list(x_test.index):


    model = ARIMA(history, order=(1,1,1))


    model_fit = model.fit(disp=0)


    output = model_fit.forecast()


    yhat = output[0]


    model_predictions.append(yhat)


    true_test_value = x_test[time_point]


    history.append(true_test_value)


MAE_error = metrics.mean_absolute_error(x_test, model_predictions).numpy()


print('Testing Mean Squared Error is {}'.format(MAE_error))


# store model_predictions


model_fit.save(ticker + '.pkl')





# Check to see if it reloaded


model_predictions[:5]


# Load model


from statsmodels.tsa.arima.model import ARIMAResults


loaded = ARIMAResults.load(ticker + '.pkl')


arima_mae = mean_squared_error(x_test,model_predictions)


arima_mae





plt.rcParams['figure.figsize'] = [10, 10]


plt.plot(x_test.index[-100:], model_predictions[-100:], color='blue',label='Predicted Price')


plt.plot(x_test.index[-100:], x_test[-100:], color='red', label='Actual Price')


plt.title(ticker + ' Price Prediction')


plt.xlabel('Date')


plt.ylabel('Prices')


# plt.xticks(np.arange(881,1259,50), df.Date[881:1259:50])


plt.legend()


plt.figure(figsize=(10,6))


plt.show()





print("next day predicted value: ",model_predictions[-1])





# Green Line


yf.pdr_override() # <== that's all it takes :-)


start =dt.datetime(1980,12,1)


now = dt.datetime.now()


stockline = ticker





# Get Monthly Data


def get_monthly_data(stockline, start, end):


    df = pdr.get_data_yahoo(stock, start, end)


    df.to_csv(stock +'.csv', index=False)


    df.drop(df[df["Volume"]<1000].index, inplace=True)


    


    dfmonth = df.groupby(pd.Grouper(freq="M"))["High"].max()


    return dfmonth





# Green Line Indicator Definition


def calculate_GreenLine(dfmonth):


    glDate=0


    lastGLV=0 #last green line value


    currentDate=""


    curentGLV=0 # current greenline value


    for index, value in dfmonth.items():


        if value > curentGLV: #current greenline value


            curentGLV=value #update


            currentDate=index #update


            counter=0 #reset the counter


        if value < curentGLV: 


            counter=counter+1 # update the counter for the three month 





            if counter==3 and ((index.month != now.month) or (index.year != now.year)):


                #if curentGLV != lastGLV:


                #    print(curentGLV)


                glDate=currentDate


                lastGLV=curentGLV


                counter=0





    if lastGLV==0:


        message=stock+" has not formed a green line yet"


    else:


        message=("Last Green Line: "+str(lastGLV)+" on "+str(glDate))


    print(message)


calculate_GreenLine(dfmonth)

Closing Remarks

If you have made it this far, I'd like to thank you for your time and interest in this script. The stock market and programming have always been of interest to me. More tutorials / programming posts to come. Stay tuned.

Last edited by SiDev Thu Feb 16, 2023 2:23 am; edited 3 times in total

The following 2 users thanked SiDev for this useful post:

CriticaI (08-19-2023), Scizor (02-15-2023)

#2. Posted: Wed Feb 15, 2023 10:36 pm

SiDev

Summer 2023

Status: Offline

Joined: Dec 13, 20203Year Member

Posts: 288

Reputation Power: 567

Status: Offline

Joined: Dec 13, 20203Year Member

Posts: 288

Reputation Power: 567

This project also slightly utilizes TensorFlow. I did not mention this in the post. TensorFlow deserves its own post dedicated to just that. It has tons of capabilities and I would recommend anyone to look into it if interested in AI/Machine Learning.

0useful
0not useful

#3. Posted: Wed Feb 15, 2023 10:43 pm

SiDev

Resident Elite

Status: Offline

Joined: Dec 13, 20203Year Member

Posts: 288

Reputation Power: 567

Status: Offline

Joined: Dec 13, 20203Year Member

Posts: 288

Reputation Power: 567

Jupyter Notebook .ipynb file

.ipynb Download Link

Virus Scan

Virus Scan Image

https://gyazo.com/8ff022b22bf32074a16da643935fbac8.png

0useful
0not useful

#4. Posted: Thu Jul 27, 2023 7:11 am

Blind Luck

Status: Offline

Joined: May 24, 201410Year Member

Posts: 1,111

Reputation Power: 2706

Status: Offline

Joined: May 24, 201410Year Member

Posts: 1,111

Reputation Power: 2706

I know nothing about this but if you could get it to accurately predict or come close enough to make a very educated guess of next day numbers this would be very useful and game changer. I would expect this to be hard due to so many variables so I am to say the least extremely impressed!

0useful
0not useful

#5. Posted: Sat Aug 19, 2023 2:40 pm

CriticaI

Blind Luck

Status: Offline

Joined: Nov 05, 201311Year Member

Posts: 2,749

Reputation Power: 452

Status: Offline

Joined: Nov 05, 201311Year Member

Posts: 2,749

Reputation Power: 452

Cool project!

For getting around errors, my advice is to read the error messages multiple times and really think about what you want your code to do.

It says a variable is not defined, meaning you did not create the variable yet, or it is not available in the scope you think it is.

Split is a method available to strings. It turns a string into an array (list) of smaller strings. If your variable is not a string but instead something like a DataFrame, number, or list, that method is unavailable because Python does not know how to split something like a DataFrame.

Also, if you haven't already, definitely check out ChatGPT. It is great for getting things explained in laymen's terms.

0useful
0not useful

#6. Posted: Sat Aug 19, 2023 8:49 pm

TCAR

Ultra Gifter

Status: Offline

Joined: Jun 15, 201410Year Member

Posts: 1,019

Reputation Power: 20477

Motto: There's magic on the other side of fear.

Status: Offline

Joined: Jun 15, 201410Year Member

Posts: 1,019

Reputation Power: 20477

Motto: There's magic on the other side of fear.

not even gonna read all of it but looks like alot

0useful
0not useful

New Reply

Users browsing this topic: None

Programming Stock Market Data/Price Prediction Script - Python

Programming Stock Market Data/Price Prediction Script - PythonPosted: Wed Feb 15, 2023 10:34 pm

The following 2 users thanked SiDev for this useful post:

RECENT POSTS

HOT TOPICS

Posts

30,634,390

Members

2,930,694

New Today

274

Topics

5,883,619

ProgrammingStock Market Data/Price Prediction Script - Python

ProgrammingStock Market Data/Price Prediction Script - PythonPosted: Wed Feb 15, 2023 10:34 pm

The following 2 users thanked SiDev for this useful post:

RECENT POSTS

HOT TOPICS

Posts

30,634,390

Members

2,930,694

New Today

274

Topics

5,883,619

Programming Stock Market Data/Price Prediction Script - Python

Programming Stock Market Data/Price Prediction Script - PythonPosted: Wed Feb 15, 2023 10:34 pm