Of Algorithms and Employment Status

While Uber prepares to bring its worker status case to the UK Supreme Court, there is a newly emerging issue. Does its price-determining mechanism have anti-competitive effects, and does it have…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




How to Do Data Science Using SQL on Raw JSON

This post outlines how to use SQL for querying and joining raw data sets like nested JSON and CSV — for enabling fast, interactive data science.

Data scientists and analysts deal with complex data. Much of what they analyze could be third-party data, over which there is little control. In order to make use of this data, significant effort is spent in data engineering. Data engineering transforms and normalizes high-cardinality, nested data into relational databases or into an output format that can then be loaded into data science notebooks to derive insights. At many organizations, data scientists or, more commonly, data engineers implement data pipelines to transform their raw data into something usable.

Data pipelines, however, regularly get in the way of data scientists and analysts getting to insights with their data. They are time-consuming to write and maintain, especially as the number of pipelines grows with each new data source added. They are often brittle, don’t handle schema changes well, and add complexity to the data science process. Data scientists are typically dependent on others-data engineering teams-to build these pipelines as well, reducing their speed to value with their data.

Once we have understood the overall structure of the nested JSON data set, we can start unpacking the parts we’re interested in using the UNNEST command in SQL. In our case, we care about the app name, the percentage increase in MAU month over month, and the company that makes the app.

Once we have gotten to this table, we can do some basic statistical calculations by exporting the data to dataframes. Dataframes can be used to visualize the percentage growth in MAU over the data set for a particular month.

This final large query does several operations one after another. In order, the operations that it performs and the intermediate SQL query names are:

We can run one final query on the named query we have, ventureFundedAllRegions to generate the best prospective investments for the investment management firm.

As we see above, we get data that can help with decision making from an investment perspective. We started with applications that have posted significant growth in active users month over month. Then we performed some filtering to impose some constraints to improve the relevance of our list. Then we also extracted other details about the companies that created those applications and came up with a final list of prospects above. In this entire process, we did not make use of any ETL processes that transform the data from one format to another or wrangle it. The last query which was the longest took less than 4 seconds to run, due to Rockset’s indexing of all fields and using those indexes to speed up the individual queries.

Add a comment

Related posts:

Get CoinStats Chrome extension for instant access to crypto prices and your portfolio!

For all the strength that the internet offers, one of its downsides is how distracting it can be. No matter where you click, most of the time it’s so easy to be pulled into a black hole of…

Kevin Love Shoes

Kevin Wesley Love is a professional basketball player born on 7th September 1998, in Santa Monica, California. The power forward is currently playing in National Basketball Association for Cleveland…

Het Polimec Ambassadeur programma is officieel live!

Word lid van het team en steun ons bij het creëren van een transparant, inclusief en betrouwbaar Web3-fundraising platform. Laten we dieper ingaan op de details van het programma en waarom u er deel…