SIF Crypto Data Collection
This semester, our goal was to establish a pipeline for collecting, storing, and reading crypto trade and orderbook data. Crypto prices are highly volatile and fluctuate rapidly, on the order of seconds as opposed to hours for equity data. Its fine-grained nature requires us to construct this system separately from our existing infrastructure.
Data collection is accomplished using the Cryptofeed API, which allows us to stream crypto data using sockets. This API supports a wide array of exchanges and currencies, and allows us to access trades, multiple levels of order books, and more. Their GitHub documentation is also surprisingly well-developed, which allowed us to quickly develop a system of asynchronous callbacks to filter and transform the collected data.
Trades data is relatively straightforward, and requires little overhead. For orderbook collection, Cryptofeed gives us hundreds of price/quantity pairs for both the bid and ask sides as part of the Level II (L2) orderbook data. We chose L2 data since it provides the top few entries for bid/ask, achieving a balance between usefulness and scale as opposed to the bare-bones L1 or superfluous L3 data. We set up a system to filter for the top bid and lowest ask prices, with the exact quantity being up to researchers’ discretion. Optionally, we only perform processing if the orderbook has changed, so as to minimize wasted storage space.
We use a Postgres database with a Django backend to store the data, with Django adding the benefit of easy database management as well as built-in CRUD (create, read, update, delete) wrappers in python for database queries. There are separate tables for trades and orderbooks, with each entry also containing a timestamp as well as foreign keys to tables describing the currencies and exchanges we have collected data for.
A significant issue during development was the use of Python’s asyncio library and Django database calls. The preprocessing callback functions described earlier must be asynchronous as described by Cryptofeed’s documentation, but Django calls cannot run in an asynchronous context. Our solution was to use a queueing system, whereby incoming trades and orderbooks would be entered into a writing queue asynchronously, and every so often the queue would be emptied and written to the database synchronously in a separate context. This queue juggling also allowed for increased efficiency in the callbacks, and was made thread-safe by use of asyncio locks.
Database reading will largely mimic database writing, but will offer significant flexibility in filtering as well as processing/manipulation with pandas. The full design will incorporate researcher feedback to provide maximum utility and ease of access; we plan to finalize its design and implement it next semester. At that point, we hope to have run the data collection pipeline long enough to experiment on production-scale datasets. The following semesters hold great promise, and we look forward to seeing the excellent crypto strategies that rise from our infrastructure work.
Sumit Nawathe
Koby Adu-Bonnah
Arin Zeng