New to WSPR Daemon


Greg Beam <ki7mt01@...>
 

Hello All,

I just recently found WSPR Daemon ( Very Cool !! ), and I'm very interested in how this will progress over the next SS cycle. I had  started a similar project (on a much, much smaller scale) for WSPRnet data. I am (was) mainly focused on using Spark (Scala), PySpark, Apache Arrow, and various utilities to convert those monster CSV files to Parquet / Avro  formats; not so much on the visualization ( but that may change now :-) )

I have truck-load of content to add, but, I did start working on a db-init-script for a local PostgreSQL DB (wsprnet, tutorial) and put up a few Scala apps / Py Sample Scripts for CSV conversion testing. I've not added the tutorial tables to the init script yet, but will do soon.

The Docs site is far from complete, but, here are few links so folks can see the direction I'm heading (In time there will be visualization plots). I've added a WSPR-Daemon section where things related to this project will go. The concepts are the same for both WSPRnet and WSPR-Daemon in terms of processing data in volume.

Some examples:

As I said earlier, there's a ton to be added, including a full tools-installation section for those interested in either project using Spark related tooling. I was planning on doing some Vagrant testing of the Daemon script before the holidays ended, but, I did not get to it. There's just to many fun things to do, and not enough time to do them.

Anyway, I look forward to seeing how things progress here with WSPR Daemon !

73's
Greg, KI7MT


Gwyn Griffiths
 

Hello Greg and welcome to WsprDaemon

I'm pleased that you have found this resource and look forward to future updates on what you have been able to do with the analysis tools you've outlined. I have to say, I was completely unaware of the packages and utilities you list. In part, that's my limited horizon, but I guess it's also an indication of just how many different approaches and tools there are available. 

Are you happy for me to add the links above, with a short introduction, to the Annex in my Timescale documentation that covers data access options?

73
Gwyn G3ZIL


Greg Beam <ki7mt01@...>
 

Hi Gwyn,

Sure, you can add whatever you think would be helpful. It's a slow process documenting everything due to work but I'll slowly get there.

73's
Greg, KI7MT


Greg Beam <ki7mt01@...>
 

Hi Gwyn,

I should probably clarify the use of these type of tools a bit more so as to not confuse folks. What I've added so far is targeting WSPRnet CSV files. I'll be adding the same or similar for WSPR Daemon Schemas.

Their primary purpose of Spark is to Map Reduce a given DataFrame / DataSet. Say, for example, you have a years worth of spot data and want to plot, compare, or otherwise process. The steps would go something like:

  • Select just the columns you want from the Parquet Partitions (timestamp, field1, field2, filed3, etc)
  • Perform the aggregations (SUM, AVG, Count, STATS, Compare, or whatever you need)
  • Plot or save the results to csv/json or post to a DB Fact Table.
At the plot or save stage is where the performance increase comes in as it's all done in parallel on a Spark cluster (standalone or nodes). While this doesn't sound overly impressive, it is. Consider the November 2020 WSPRnet CSV file. It has 70+ million rows of data * 15 Columns, When one adds the remainder of the year, you could easily be over 500 Million rows of data. Doing aggregate functions on datasets of that scale can be very expensive time wise. If one has 20 or so results they want process every day of every month down to the hour level in a rolling fashion, it becomes impractical to do in a single thread call.

I've not added any Streaming Functions, but, Spark also allows for continuous ingestion of data from file/folder monitoring, UDP ports, channels, and others. I can see many use-cases with WSPR Daemon and Spark Stream Processing of spot data from multiple Kiwi's with multi-channel monitoring on each device. You could use it to process the data, or simply post it to a staging table for downstream analytics. Staging data for downstream activity is a commonly used for things like server logs or Web-Page clicks from millions of users. However, it doesn't' matter what the source data is, only that it's coming in on intervals or continuously.

If you're into Machine Learning and predictive analytics, the Spark ML-Lib provides a powerful set of tools also.

Essentially, Spark provides (in clusters or stand alone modes)

- DataSet / Dataframe Map Reduction capabilities
- Stream Processing of data and files
- Machine Learning tests and predictive analytics

73's
Greg, KI7MT


Gwyn Griffiths
 
Edited

Hello Greg
Thanks for the additional explanations and details. They give a useful picture of where you are with csv file data and the approaches that you take. It is clear that once past the first stage of getting the data columns you want the subsequent steps from wsprnet csv or wsprdaemon TimescaleDB files will be the same. As fields such as numerical lat and lon for tx and rx are already in WsprDaemon this may reduce the load at the analysis steps - that was our hope.

I am in no doubt that multithread (and cluster) approaches are needed with this data. WsprDaemon already has 390 million spots online (from July 2020 onward) and this can, and does, result in slow response to a number of queries and hence Grafana graphics. For now, in the scheme of things, being able to see a plot of spot count per hour of day for each day over six months in a few tens of seconds is still useful, and a marvel.

But, these 390 million spots are only taking up 138 GB of the 7TB disk space Rob has made available - so different approaches such as those you describe are going to be needed to look at data over a whole sunspot cycle that the WsprDaemon should be able to hold.

Thanks for permission to abstract from your posts on this topic for our TimescaleDB guide.

73
Gwyn G3ZIL


Greg Beam <ki7mt01@...>
 

Hi Gwyn,

A couple of redress points here.

Regarding On-Disk File Size(s)
I too was looking for a solution for this, which is why I looked toward Parquet / Avro as a solution. Both are binary file formats that have the schemas embedded with them. From them, you could derive your Fact Tables (things needed for plot rendering). Typically, you have a Master DataSet set containing all rows / columns, then create a sub-set fact tables, or in some cases, a separate Parquet DataSet that you serve your plots with. This could be any combination of the Master columns and rows. Reducing that set down to only whats needed for a particular plot can yield huge disk saving and read speed increases.

File Compression
Using Parquet / Avro file formats dramatically saves on long-term disk space usage. This is why I created the Pandas Parquet Compression Test. As you can see, the base file size was about 3.7 GB and Snappy Compression (the default Parquet compression) comes in at 667 MB, or roughly a 5 to 1 reduction. Gzip and Brotli come in a couple hundred MB smaller (440MB to 470 MB ish) if one is really crunched for disk space.

Read Speeds
With those high compression levels, I was concerned about read speeds, but that turned out to be a non-issue. During my PyArrow Read Tests, I was able to read 47+ million rows, do a grouby and count in =< 2.01 second with Snappy and Brotli. That's fast considering I was reading all rows and all columns. Read times would be much faster on a limited DataFrame either by tale of select columns.

In any case, there's lots of ways to clean this fish, but having a good idea of what your output needs will be, at least initially, can help define your back end source file strategy. While Databases certainly make it easy (initially) they aren't always the best long term solution with large datasets. I've been breaking my groups up into yearly blocks. The cool thing about parquet is, you can append to the storage rather easily. If I need multiple years, I just do two year group queries. You could add them all together, but, that can get really large and DataFrames need to fit into Memory, as that's where Spark does it's processing.

73's
Greg, KI7MT


Rob Robinett
 

Hi,

I am very pleased to see this discussion.  A recent email from Timescale suggests that long time span queries can be accelerated by defining continuous aggregate tables.  I wonder if those would help us as our database grows?


Rob

On Sat, Jan 9, 2021 at 3:53 AM Greg Beam <ki7mt01@...> wrote:

Hi Gwyn,

A couple of redress points here.

Regarding On-Disk File Size(s)
I too was looking for a solution for this, which is why I looked toward Parquet / Avro as a solution. Both are binary file formats that have the schemas embedded with them. From them, you could derive your Fact Tables (things needed for plot rendering). Typically, you have a Master DataSet set containing all rows / columns, then create a sub-set fact tables, or in some cases, a separate Parquet DataSet that you serve your plots with. This could be any combination of the Master columns and rows. Reducing that set down to only whats needed for a particular plot can yield huge disk saving and read speed increases.

File Compression
Using Parquet / Avro file formats dramatically saves on long-term disk space usage. This is why I created the Pandas Parquet Compression Test. As you can see, the base file size was about 3.7 GB and Snappy Compression (the default Parquet compression) comes in at 667 MB, or roughly a 5 to 1 reduction. Gzip and Brotli come in a couple hundred MB smaller (440MB to 470 MB ish) if one is really crunched for disk space.

Read Speeds
With those high compression levels, I was concerned about read speeds, but that turned out to be a non-issue. During my PyArrow Read Tests, I was able to read 47+ million rows, do a grouby and count in =< 2.01 second with Snappy and Brotli. That's fast considering I was reading all rows and all columns. Read times would be much faster on a limited DataFrame either by tale of select columns.

In any case, there's lots of ways to clean this fish, but having a good idea of what your output needs will be, at least initially, can help define your back end source file strategy. While Databases certainly make it easy (initially) they aren't always the best long term solution with large datasets. I've been breaking my groups up into yearly blocks. The cool thing about parquet is, you can append to the storage rather easily. If I need multiple years, I just do two year group queries. You could add them all together, but, that can get really large and DataFrames need to fit into Memory, as that's where Spark does it's processing.

73's
Greg, KI7MT



--
Rob Robinett
AI6VN
mobile: +1 650 218 8896


Gwyn Griffiths
 

Rob, Greg
Rob - Continuous aggregates are on my to list of topics to add to the list of examples I have in my TimescaleDB Guide. I'll need to check whether they can be used with aggregates such as percentiles as well as the example aggregates provided by Timescale. Hourly counts and averages spring immediately to mind. One approach would be to figure out what would be a useful Grafana dashboard that only used aggregates.

Greg - Thank you for more thought-provoking points. I wonder if discussion during the weekly Wednesday WsprDaemon Zoom meetings that Rob holds might be useful? For me personally, I have learnt as I've implemented the WsprDaemon database with Influx then Timescale; not the best approach, but it is a good (if not the 'best') approach to data storage and serving data to users with a whole host of applications via a growing number of interfaces.

73
Gwyn G3ZIL


Greg Beam <ki7mt01@...>
 

Hi Rob, Gwyn,

Apologies for the spam, I somehow sent the message before I was done writing it (two many thumbs I guess)

In any case, I suspect your storage needs will differ substantially based on use cases. Like I was saying:

  • How much data do you want to provide on your real-time endpoints (hot storage) in the PostgreSQL DB's
  • How much, where and what format to use for long term archives (cold storage), gz, zip, parquet,avro, etc.

I've used Timescale some, but only for personal learning / testing, never in a production environment. The Aggregate Functions (continuous or triggered) looks to be a really cool feature. The materialized tables would be what I was referring to above as Fact Tables. I would interested in seeing how that works with a constant in-flow of data as Materalized Views in PostgreSQL can put a heavy load on servers with large datasets.

I would think, at some point, you'll need/want an API on the front end rather than going to the database directly for public users (could be wrong). That could help determine what Materialized Views (Aggregates) you want to provide via public API's and which you provide instruction to users for building their own datasets via cold storage files. Either way, It would take a Hefty PostgreSQL server to handle years of data at the scale you're forecasting here.

I saw the VK7JJ and WSPR Watch 3rd party interfaces. I don't know their implementation details (I suspect DB direct), so it's hard to say, but, Parquet files would not be a good solution for that type of dynamic need.

73's
Greg, KI7MT


Gwyn Griffiths
 
Edited

Hello Greg
On your recent points:
1. We have not discussed archival, our current offering is access to online, uncompressed data for an 11 year sunspot cycle, as we've described in a 2020 TAPR/ARRL Digital Communications Conference paper at https://files.tapr.org/meetings/DCC_2020/2020DCC_G3ZIL.pdf

2. To support that we have an Enterprise licence from TimescaleDB allowing automatic data tiering between main memory (192GB), SSD disk (550GB) and the 7TB RAID. Both are already pretty hefty ... See https://docs.timescale.com/latest/using-timescaledb/data-tiering

3. We're using 30 day 'chunks' in TimescaleDB jargon. Hence they can be variable in size. The current chunk is entirely in main memory.

4. There's an outline diagram at the bottom of the page at http://wsprdaemon.org/technical.html
    You'll see two Rob owned servers at independent sites. They take in data independently and provide resilience. There's also a third machine, a rented Digital Ocean Droplet with just the latest 7 days data to serve immediate, 'now' data needs should there be problems with both main servers.

5. Thanks for your comments on Aggregates - I'll post a comment when I have some results to share.

6. As for public APIs VK7JJ, WSPRWatch and Jim Lill's site at http://jimlill.com:8088/today_int.html already access WsprDaemon using three different methods (node.js, Swift, and bash/psql) and we've had a recent post in this forum on using R. This is how we would like to work - leaving the public facing interfaces to others.

7. My documentation at
http://wsprdaemon.org/ewExternalFiles/Timescale_wsprdaemon_database_queries_and_APIs_V2.pdf
currently provides detailed instructions for access for node.js, Python, bash/psql, KNIME and Octave and provides links for seven other methods. I'd envisage adding a detailed section on the method you intend to use when available, and I'll be adding a detailed section on R this coming week based on material from Andi on this forum.

best wishes
Gwyn G3ZIL


Greg Beam <ki7mt01@...>
 
Edited

Hi Gwyn,

I've been slammed with web / domain transfer work the last few weeks and haven't had much time for radio related activity, though I did manage to splash out for a new rig: Elecraft => K4HD

Thanks for the additional links to your project objectives as that clarified a number of things.

Regarding PySpark - I used that because it was / is easy to get up and running. However, the real power / speed with Spark (IMHO) comes in with Scala. It's much faster as it's compiled rather than interrupted at runtime. It's also, like Java, Type-Safe, so there's little-to-no concern about mangling data types during one's process steps.

Pairing down (map reduce) larger data sets into fact tables is where I'll probably focus most of my efforts as that allows any number of rendering / graphing engines to easily consume the output. Getting the data into a usable structure is my first goal. After thinking on this more, I'll probably create an install-able Linux package that wraps each Scala Jar File Assembly with a bash script so it's easier for end users to call. This is what I'm pondering, but, I've not set anything into motion yet.

Re ClickHouse -  looks very interesting (and fast). That may be the fastest solution I've seen to date for consuming large chunks of data in a columnar format. I glanced through the docs, it looks to be very powerful indeed and definitely warrants a more thorough read.

73's
Greg, KI7MT