Big Data for Traders: Storing and Querying Tick Data with Time-Series Databases

Big Data for Traders: Storing and Querying Tick Data with Time-Series Databases

Most trading education stops at the strategy. But anyone who has tried to backtest on real tick data hits a wall the textbooks ignore: the sheer volume. A single liquid futures contract can generate millions of ticks a day; a multi-year, multi-symbol research dataset runs into billions of rows. At that scale, "just load it into a spreadsheet" — or even a naive pandas DataFrame — falls apart. This is where the Big data side of quantitative trading begins.

Why market data is a hard data problem

Tick and order-book data have a few properties that break ordinary tools:

Volume. Billions of rows is normal. Memory and disk become real constraints, not afterthoughts.
Append-heavy, append-only. New data arrives constantly in time order; old data is rarely updated. That is a very different workload from a typical business database.
Time is the primary axis. Almost every query is "give me this symbol between these two timestamps." A database that is not organized around time will scan far too much.
High cardinality and irregular spacing. Ticks do not arrive on a neat clock; gaps, bursts and microsecond timestamps are the norm.

Why a time-series database helps

A time-series database (TSDB) is purpose-built for exactly this shape of data. Compared to a general-purpose relational database, a good TSDB gives you:

Time-partitioned storage so a date-range query touches only the relevant chunks instead of the whole table.
Columnar layout — values for one field are stored together, so reading just "price" and "size" skips everything else and compresses far better.
Aggressive compression tuned for timestamps and slowly-changing numeric series, often shrinking data many-fold.
Built-in time operations — resampling, last-value-as-of, gap filling and windowed aggregates as first-class queries.

The role of columnar file formats

You do not always need a running database server. For research, a columnar file format like Parquet is often enough: it stores data by column, compresses well, preserves schema and types, and is read efficiently by analysis tools. A common pattern is to keep raw history as partitioned Parquet files (for example, one file per symbol per day) and layer query tools on top. It is cheap, portable, and plays nicely with distributed engines when you outgrow one machine.

Bars are a sampling decision, not a given

Raw ticks are rarely what a model consumes directly — you aggregate them into bars. But how you sample is itself a statistical choice:

Time bars (1-minute, 1-hour) are simple but sample the market unevenly — quiet and frantic periods get the same number of bars.
Tick, volume and dollar bars sample by activity instead of the clock, producing returns with better statistical properties (closer to normal, more stable variance) — which matters a lot for downstream models.

Storing the raw ticks means you can always re-derive any bar type later; storing only one fixed bar type throws that flexibility away.

A practical pipeline

Ingest raw ticks from your feed, timestamped and validated.
Store them append-only in a TSDB or partitioned Parquet, organized by symbol and date.
Clean — handle bad prints, duplicates, timezone alignment and corporate actions. Garbage in, garbage backtest.
Resample into the bar type your research needs, on demand.
Serve fast date-range slices to your backtester and models.

Bottom line

Serious backtesting and research are as much a data-engineering problem as a strategy problem. Treat tick data as the Big-data workload it is: store it append-only and time-partitioned, lean on columnar formats and compression, keep the raw ticks so you can re-sample freely, and clean obsessively. Get the data layer right and every model you build on top of it gets faster, cheaper and more trustworthy.