A couple of redress points here.
Regarding On-Disk File Size(s)
I too was looking for a solution for this, which is why I looked toward Parquet / Avro as a solution. Both are binary file formats that have the schemas embedded with them. From them, you could derive your Fact Tables (things needed for plot rendering). Typically, you have a Master DataSet set containing all rows / columns, then create a sub-set fact tables, or in some cases, a separate Parquet DataSet that you serve your plots with. This could be any combination of the Master columns and rows. Reducing that set down to only whats needed for a particular plot can yield huge disk saving and read speed increases.
Using Parquet / Avro file formats dramatically saves on long-term disk space usage. This is why I created the Pandas Parquet Compression Test. As you can see, the base file size was about 3.7 GB and Snappy Compression (the default Parquet compression) comes in at 667 MB, or roughly a 5 to 1 reduction. Gzip and Brotli come in a couple hundred MB smaller (440MB to 470 MB ish) if one is really crunched for disk space.
With those high compression levels, I was concerned about read speeds, but that turned out to be a non-issue. During my PyArrow Read Tests, I was able to read 47+ million rows, do a grouby and count in =< 2.01 second with Snappy and Brotli. That's fast considering I was reading all rows and all columns. Read times would be much faster on a limited DataFrame either by tale of select columns.
In any case, there's lots of ways to clean this fish, but having a good idea of what your output needs will be, at least initially, can help define your back end source file strategy. While Databases certainly make it easy (initially) they aren't always the best long term solution with large datasets. I've been breaking my groups up into yearly blocks. The cool thing about parquet is, you can append to the storage rather easily. If I need multiple years, I just do two year group queries. You could add them all together, but, that can get really large and DataFrames need to fit into Memory, as that's where Spark does it's processing.