ProgrammingStock Market Data/Price Prediction Script - Python
Posted:

ProgrammingStock Market Data/Price Prediction Script - PythonPosted:

SiDev
  • Resident Elite
Status: Offline
Joined: Dec 13, 20203Year Member
Posts: 288
Reputation Power: 567
Status: Offline
Joined: Dec 13, 20203Year Member
Posts: 288
Reputation Power: 567
https://gyazo.com/21fd330e78c30c521278bdaa54e82e93.png

Project Background Information

This script was a project I worked on for my Python class during college. I have not yet gotten this script to work 100% correctly just yet. There are also some errors with the price prediction. It will accurately display the current days opening price; however, it will not accurately predict next day opening price as intended.

The purpose of this script is to datascrape yahoo finance and other finance sources to output financial data for a given ticker that you give it. This data will include charts for open/close/volume/high/low points, graph the rolling mean / standard deviation. It will also display results from the Dickey-Fuller test and ARIMA models.

*This tutorial will NOT show you how to install libraries or your python environment, but it will walk you through the code and thought process of the script. If you need help installing libraries or python work environment, there are tons of tutorials on YouTube.

*Note: All data/graphs shown are using data from 'TSLA' ticker. This data is not up to date because I had retrieved it months ago. Script should work to receive new data - for most of the features.

*Full script at bottom of post*




What is the Dickey-Fuller test?
In statistics, the Dickey-Fuller test tests the null hypothesis that a unit root is present in an autoregressive time series model. The alternative hypothesis is different depending on which version of the test is used, but is usually stationarity or trend-stationarity. The test is named after the statisticians David Dickey and Wayne Fuller, who developed it in 1979.[1]


What are ARIMA models?
ARIMA models provide another approach to time series forecasting. Exponential smoothing and ARIMA models are the two most widely used approaches to time series forecasting, and provide complementary approaches to the problem. While exponential smoothing models are based on a description of the trend and seasonality in the data, ARIMA models aim to describe the autocorrelations in the data.


ARIMA models are used for machine learning. There are tons of different ways they can be utilized. For our case, they help with price prediction.





Now For the Code

Installing Libraries
#import libraries
import random
import tensorflow as tf
from yahoo_finance import Share
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import pandas as pd
import yfinance as yf
from pandas_datareader import data, wb
import datetime
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima_model import ARIMA
from keras import metrics
from sklearn.metrics import mean_squared_error
import matplotlib.dates as mdates
import matplotlib.cbook as cbook
import datetime as dt
from pandas_datareader import data as pdr


This section imports the many libraries related to data scraping / data visualization and allows us access to financial data to pull from.





Allowing Ticker Input / Data Fetching / Price Display
# Ticker Input / Pricing Display / Data Fetching
ticker = input('Enter stock ticker: ')
start = pd.to_datetime('2020-02-04')
end = pd.to_datetime('today')
stock0 = yf.Ticker(ticker)
hist = stock0.history(period="max")
hist.to_csv(ticker + '.csv')
ticker0 = pd.read_csv(ticker + '.csv')
ticker0['Date'] = pd.to_datetime(ticker0['Date'])
stock = data.DataReader(ticker, 'yahoo', start , end)
stock


This section defines how a ticker is given to our script and what to do once it is given one.





Cleaning / Sorting the Data
# Data Cleaning / Sorting
# Set target series
series = ticker0['Close']
# Create train data set
train_split_date = '2020-12-31'
train_split_index = np.where(ticker0.Date == train_split_date)[0][0]
x_train = ticker0.loc[ticker0['Date'] <= train_split_date]['Close']
# Create test data set
test_split_date = '2021-06-15'
test_split_index = np.where(ticker0.Date == test_split_date)[0][0]
x_test = ticker0.loc[ticker0['Date'] >= test_split_date]['Close']
# Create valid data set
valid_split_index = (train_split_index.max(),test_split_index.min())
x_valid = ticker0.loc[(ticker0['Date'] < test_split_date) & (ticker0['Date'] > train_split_date)]['Close']
#printed index values are:
#0-5521(train), 5522-6527(valid), 6528-6947(test)


This section defines where the data is split to determine which data is shown and which data is hidden.





Stationary Test
# Stationary Test
def test_stationarity(timeseries, window = 12, cutoff = 0.01):
    #Determing rolling statistics
    rolmean = timeseries.rolling(window).mean()
    rolstd = timeseries.rolling(window).std()
    #Plot rolling statistics:
    fig = plt.figure(figsize=(12, 8))
    orig = plt.plot(timeseries, color='blue',label='Original')
    mean = plt.plot(rolmean, color='red', label='Rolling Mean')
    std = plt.plot(rolstd, color='black', label = 'Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.show()
    #Perform Dickey-Fuller test:
    print('Results of Dickey-Fuller Test:')
    dftest = adfuller(timeseries, autolag='AIC', maxlag = 20 )
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
        pvalue = dftest[1]
        if pvalue < cutoff:
            print('p-value = %.4f. The series is likely stationary.' % pvalue)
        else:
            print('p-value = %.4f. The series is likely non-stationary.' % pvalue)
        print(dfoutput)


Call Stationary Test
test_stationarity(series)


This section defines our stationary test and displays a graph for output.

Stationary Test Output


Stationary Test w/ Adjusted Close Point
# Get the difference of each Adj Close point
ticker0_close_diff_1 = series.diff()
# Drop the first row as it will have a null value in this column
ticker0_close_diff_1.dropna(inplace=True)


Call Stationary Test w/ Adjusted Close Point
test_stationarity(ticker0_close_diff_1)


Stationary Test w/ Adjusted Close Point Output Graph






Graphing (Partial) Autocorrelation
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
plot_acf(ticker0_close_diff_1)
plt.xlabel('Lags (Days)')
plt.show()
# Break these into two separate cells
plot_pacf(ticker0_close_diff_1)
plt.xlabel('Lags (Days)')
plt.show()


(Partial) Autocorrelation Graph Output






AMIRA Models
# Use this block to
# fit model
ticker0_arima = ARIMA(x_train, order=(1,1,1))
ticker0_arima_fit = ticker0_arima.fit(disp=0)
print(ticker0_arima_fit.summary())


AMIRA Output (with warnings)






Create List for Predictions / data points
# Create list of x train valuess
history = [x for x in x_train]
# establish list for predictions
model_predictions = []
# Count number of test data points
N_test_observations = len(x_test)
# loop through every data point
for time_point in list(x_test.index):
    model = ARIMA(history, order=(1,1,1))
    model_fit = model.fit(disp=0)
    output = model_fit.forecast()
    yhat = output[0]
    model_predictions.append(yhat)
    true_test_value = x_test[time_point]
    history.append(true_test_value)
MAE_error = metrics.mean_absolute_error(x_test, model_predictions).numpy()
print('Testing Mean Squared Error is {}'.format(MAE_error))
# store model_predictions
model_fit.save(ticker + '.pkl')


This model tests the mean squared and saves it.

Output Image





Check Model
# Check to see if it reloaded
model_predictions[:5]
# Load model
from statsmodels.tsa.arima.model import ARIMAResults
loaded = ARIMAResults.load(ticker + '.pkl')
arima_mae = mean_squared_error(x_test,model_predictions)
arima_mae
plt.rcParams['figure.figsize'] = [10, 10]
plt.plot(x_test.index[-100:], model_predictions[-100:], color='blue',label='Predicted Price')
plt.plot(x_test.index[-100:], x_test[-100:], color='red', label='Actual Price')
plt.title(ticker + ' Price Prediction')
plt.xlabel('Date')
plt.ylabel('Prices')
# plt.xticks(np.arange(881,1259,50), df.Date[881:1259:50])
plt.legend()
plt.figure(figsize=(10,6))
plt.show()





Next Day Price Predictions
print("next day predicted value: ",model_predictions[-1])


Output Image





Green Line Setup / Get Monthly Data / Green Line Indicator Definition
Green Line
# Green Line
yf.pdr_override() # <== that's all it takes :-)
start =dt.datetime(1980,12,1)
now = dt.datetime.now()
stockline = ticker

Get Monthly Data
# Get Monthly Data
def get_monthly_data(stockline, start, end):
    df = pdr.get_data_yahoo(stock, start, end)
    df.to_csv(stock +'.csv', index=False)
    df.drop(df[df["Volume"]<1000].index, inplace=True)
   
    dfmonth = df.groupby(pd.Grouper(freq="M"))["High"].max()
    return dfmonth

Green Line Indicator Definition
# Green Line Indicator Definition
def calculate_GreenLine(dfmonth):
    glDate=0
    lastGLV=0 #last green line value
    currentDate=""
    curentGLV=0 # current greenline value
    for index, value in dfmonth.items():
        if value > curentGLV: #current greenline value
            curentGLV=value #update
            currentDate=index #update
            counter=0 #reset the counter
        if value < curentGLV:
            counter=counter+1 # update the counter for the three month

            if counter==3 and ((index.month != now.month) or (index.year != now.year)):
                #if curentGLV != lastGLV:
                #    print(curentGLV)
                glDate=currentDate
                lastGLV=curentGLV
                counter=0

    if lastGLV==0:
        message=stock+" has not formed a green line yet"
    else:
        message=("Last Green Line: "+str(lastGLV)+" on "+str(glDate))
    print(message)

Call Green Line Calculation
calculate_GreenLine(dfmonth)





Errors I am getting / Help Wanted / Additional Information

Additional Information
This project was a project I worked on for my python class. It was made in Jupyter Notebook which is why the code format looks kind of odd. I can share the .ipynb file for anyone that may be interested in helping get this project where it is intended to be.

Help Wanted
I am needing help situating the warnings/errors you have seen in these images. I am also wanting to get the next day price prediction to work properly. It currently gets same day opening price, not next day as intended. I would also like to eventually add some sort of weighted calculation to give meaning to the data it receives/outputs. This way, it can provide a recommendation of good/bad investment or something of the sort.

Any suggestions / Feedback for improvement is always welcome!

Error Images





Full Script
#import libraries
import random
import tensorflow as tf
from yahoo_finance import Share
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import pandas as pd
import yfinance as yf
from pandas_datareader import data, wb
import datetime
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima_model import ARIMA
from keras import metrics
from sklearn.metrics import mean_squared_error
import matplotlib.dates as mdates
import matplotlib.cbook as cbook
import datetime as dt
from pandas_datareader import data as pdr

# Ticker Input / Pricing Display / Data Fetching
ticker = input('Enter stock ticker: ')
start = pd.to_datetime('2020-02-04')
end = pd.to_datetime('today')
stock0 = yf.Ticker(ticker)
hist = stock0.history(period="max")
hist.to_csv(ticker + '.csv')
ticker0 = pd.read_csv(ticker + '.csv')
ticker0['Date'] = pd.to_datetime(ticker0['Date'])
stock = data.DataReader(ticker, 'yahoo', start , end)
stock

# Data Cleaning / Sorting
# Set target series
series = ticker0['Close']
# Create train data set
train_split_date = '2020-12-31'
train_split_index = np.where(ticker0.Date == train_split_date)[0][0]
x_train = ticker0.loc[ticker0['Date'] <= train_split_date]['Close']
# Create test data set
test_split_date = '2021-06-15'
test_split_index = np.where(ticker0.Date == test_split_date)[0][0]
x_test = ticker0.loc[ticker0['Date'] >= test_split_date]['Close']
# Create valid data set
valid_split_index = (train_split_index.max(),test_split_index.min())
x_valid = ticker0.loc[(ticker0['Date'] < test_split_date) & (ticker0['Date'] > train_split_date)]['Close']
#printed index values are:
#0-5521(train), 5522-6527(valid), 6528-6947(test)

# Stationary Test
def test_stationarity(timeseries, window = 12, cutoff = 0.01):
    #Determing rolling statistics
    rolmean = timeseries.rolling(window).mean()
    rolstd = timeseries.rolling(window).std()
    #Plot rolling statistics:
    fig = plt.figure(figsize=(12, 8))
    orig = plt.plot(timeseries, color='blue',label='Original')
    mean = plt.plot(rolmean, color='red', label='Rolling Mean')
    std = plt.plot(rolstd, color='black', label = 'Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.show()
    #Perform Dickey-Fuller test:
    print('Results of Dickey-Fuller Test:')
    dftest = adfuller(timeseries, autolag='AIC', maxlag = 20 )
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
        pvalue = dftest[1]
        if pvalue < cutoff:
            print('p-value = %.4f. The series is likely stationary.' % pvalue)
        else:
            print('p-value = %.4f. The series is likely non-stationary.' % pvalue)
        print(dfoutput)
test_stationarity(series)

# Get the difference of each Adj Close point
ticker0_close_diff_1 = series.diff()
# Drop the first row as it will have a null value in this column
ticker0_close_diff_1.dropna(inplace=True)
test_stationarity(ticker0_close_diff_1)

from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
plot_acf(ticker0_close_diff_1)
plt.xlabel('Lags (Days)')
plt.show()
# Break these into two separate cells
plot_pacf(ticker0_close_diff_1)
plt.xlabel('Lags (Days)')
plt.show()

# Use this block to
# fit model
ticker0_arima = ARIMA(x_train, order=(1,1,1))
ticker0_arima_fit = ticker0_arima.fit(disp=0)
print(ticker0_arima_fit.summary())

# Create list of x train valuess
history = [x for x in x_train]
# establish list for predictions
model_predictions = []
# Count number of test data points
N_test_observations = len(x_test)
# loop through every data point
for time_point in list(x_test.index):
    model = ARIMA(history, order=(1,1,1))
    model_fit = model.fit(disp=0)
    output = model_fit.forecast()
    yhat = output[0]
    model_predictions.append(yhat)
    true_test_value = x_test[time_point]
    history.append(true_test_value)
MAE_error = metrics.mean_absolute_error(x_test, model_predictions).numpy()
print('Testing Mean Squared Error is {}'.format(MAE_error))
# store model_predictions
model_fit.save(ticker + '.pkl')

# Check to see if it reloaded
model_predictions[:5]
# Load model
from statsmodels.tsa.arima.model import ARIMAResults
loaded = ARIMAResults.load(ticker + '.pkl')
arima_mae = mean_squared_error(x_test,model_predictions)
arima_mae

plt.rcParams['figure.figsize'] = [10, 10]
plt.plot(x_test.index[-100:], model_predictions[-100:], color='blue',label='Predicted Price')
plt.plot(x_test.index[-100:], x_test[-100:], color='red', label='Actual Price')
plt.title(ticker + ' Price Prediction')
plt.xlabel('Date')
plt.ylabel('Prices')
# plt.xticks(np.arange(881,1259,50), df.Date[881:1259:50])
plt.legend()
plt.figure(figsize=(10,6))
plt.show()

print("next day predicted value: ",model_predictions[-1])

# Green Line
yf.pdr_override() # <== that's all it takes :-)
start =dt.datetime(1980,12,1)
now = dt.datetime.now()
stockline = ticker

# Get Monthly Data
def get_monthly_data(stockline, start, end):
    df = pdr.get_data_yahoo(stock, start, end)
    df.to_csv(stock +'.csv', index=False)
    df.drop(df[df["Volume"]<1000].index, inplace=True)
   
    dfmonth = df.groupby(pd.Grouper(freq="M"))["High"].max()
    return dfmonth

# Green Line Indicator Definition
def calculate_GreenLine(dfmonth):
    glDate=0
    lastGLV=0 #last green line value
    currentDate=""
    curentGLV=0 # current greenline value
    for index, value in dfmonth.items():
        if value > curentGLV: #current greenline value
            curentGLV=value #update
            currentDate=index #update
            counter=0 #reset the counter
        if value < curentGLV:
            counter=counter+1 # update the counter for the three month

            if counter==3 and ((index.month != now.month) or (index.year != now.year)):
                #if curentGLV != lastGLV:
                #    print(curentGLV)
                glDate=currentDate
                lastGLV=curentGLV
                counter=0

    if lastGLV==0:
        message=stock+" has not formed a green line yet"
    else:
        message=("Last Green Line: "+str(lastGLV)+" on "+str(glDate))
    print(message)
calculate_GreenLine(dfmonth)


Closing Remarks

If you have made it this far, I'd like to thank you for your time and interest in this script. The stock market and programming have always been of interest to me. More tutorials / programming posts to come. Stay tuned.


Last edited by SiDev ; edited 3 times in total

The following 2 users thanked SiDev for this useful post:

CriticaI (08-19-2023), Scizor (02-15-2023)
#2. Posted:
SiDev
  • Resident Elite
Status: Offline
Joined: Dec 13, 20203Year Member
Posts: 288
Reputation Power: 567
Status: Offline
Joined: Dec 13, 20203Year Member
Posts: 288
Reputation Power: 567
This project also slightly utilizes TensorFlow. I did not mention this in the post. TensorFlow deserves its own post dedicated to just that. It has tons of capabilities and I would recommend anyone to look into it if interested in AI/Machine Learning.
#3. Posted:
SiDev
  • Resident Elite
Status: Offline
Joined: Dec 13, 20203Year Member
Posts: 288
Reputation Power: 567
Status: Offline
Joined: Dec 13, 20203Year Member
Posts: 288
Reputation Power: 567
Jupyter Notebook .ipynb file

.ipynb Download Link

Virus Scan

Virus Scan Image
#4. Posted:
TK
  • Summer 2023
Status: Offline
Joined: May 24, 201410Year Member
Posts: 1,111
Reputation Power: 2706
Status: Offline
Joined: May 24, 201410Year Member
Posts: 1,111
Reputation Power: 2706
I know nothing about this but if you could get it to accurately predict or come close enough to make a very educated guess of next day numbers this would be very useful and game changer. I would expect this to be hard due to so many variables so I am to say the least extremely impressed!
#5. Posted:
CriticaI
  • Summer 2018
Status: Offline
Joined: Nov 05, 201311Year Member
Posts: 2,749
Reputation Power: 452
Status: Offline
Joined: Nov 05, 201311Year Member
Posts: 2,749
Reputation Power: 452
Cool project!

For getting around errors, my advice is to read the error messages multiple times and really think about what you want your code to do.

It says a variable is not defined, meaning you did not create the variable yet, or it is not available in the scope you think it is.

Split is a method available to strings. It turns a string into an array (list) of smaller strings. If your variable is not a string but instead something like a DataFrame, number, or list, that method is unavailable because Python does not know how to split something like a DataFrame.

Also, if you haven't already, definitely check out ChatGPT. It is great for getting things explained in laymen's terms.
#6. Posted:
TCAR
  • KY Flood Relief
Status: Offline
Joined: Jun 15, 201410Year Member
Posts: 1,019
Reputation Power: 20477
Motto: There's magic on the other side of fear.
Motto: There's magic on the other side of fear.
Status: Offline
Joined: Jun 15, 201410Year Member
Posts: 1,019
Reputation Power: 20477
Motto: There's magic on the other side of fear.
not even gonna read all of it but looks like alot
Users browsing this topic: None
Jump to:


RECENT POSTS

HOT TOPICS