EDA on Jane Street Market Predictions

Richard Mei
4 min readDec 21, 2020

Nothing is certain when it comes to trading, but it is certain that there is money to be made. Jane Street is well known for using technology to do trades and has challenged the Kaggle community to create a model that can identify trading opportunities. Jane Street is well known for making their own trading models and use of machine learning to capitalize on the inefficiencies of the market, and now I want to go through the experience of trying to do the same!

The data we are given has 100+ features of data that represents a trade. Each trade has an action of 1 being trade and 0 being don’t trade. Given this, the model will be evaluated based on a utility score.

First Steps

After downloading the train csv, I loaded up a Jupyter Notebook and brought out the classic numpy, Pandas, matplotlib and seaborn. I used Pandas to load the csv and since the file was so large, it took quite some time given my computer specs. This made me wary to look for ways to cut down the resources used or finding ways to get more resources. The options I had for the future were to use a GoogleCollab book or breakdown the train file into multiple parts. For now I just loaded the csv once and changed the data type of everything from a float64 to float32, hoping that would make the handling faster.

EDA

The data set had a total of about 2.4 million rows with features of “date”, “resp”, “ts_id”, and 129 features. We had 500 days worth of data, each with a variable amount of rows in each. For example, day 0 had 5587 rows while day 1 had 9401. Every row of data had a unique id in “ts_id”, which represented the continuous data from time 0. 128 features were continuous numbers while the very first feature was either -1 or 1.

Example Day 0 Dataframe

There were weights with a value of 0 included, and overall the weight’s distribution was heavily skewed. Most of the data lies between , and there are some outliers with weights past 25. Most of our weights lie between 0 and 10.

I wanted to look at the distributions of all the “resp” provided and saw that they were all normally distributed near 0 and very similar. Given that it the data was historical and we given a time series id, I decided to look at the cumulative sum of each of the responses.

Cumulative Sum of resp for 500 days

The similarity between some of the responses were important to point out. The first two “resp_1” and “resp_2”, orange and green, follow very closely. Next our actual target of the blue line follows very closely to the purple response_4.

The last thing I worked on/am still working on was to look at all the features. It’ll take a while to go through each of the features one-by-one, but I wanted to look at graph all of the distributions. For example:

Lots of the distributions look very similar, and it is easy to see what shapes these features look like. From just this snippet, I would like to first look into feature_52 rather than the others, since it has a shape that is quite unique to the others. A majority of the data has been normalized, so a distribution like that is definitely worth looking into. Moreover, I can also use this to look for features with potential outliers since the graphs are very skewed.

Conclusion

Despite starting late, there’s 2 more months for this project. There is still time to keep on doing EDA and hopefully I can find some more meaning to each feature or can get rid of some noise. I didn’t talk about it above, but there are a number of missing values. Besides more EDA, another great next step would be figuring out what to do with some missing feature values.

--

--