Exploring query types – Connecting to Data Sources
Exploring query types
When Power BI connects to any data source, it makes one of two types of connections or queries.
DirectQuery
The first type is called DirectQuery, or a live connection to the data. This means that Power BI is storing credentials and a connection string to the data source. For example, this could be the relational database storing inventory data used in a warehouse. This DirectQuery connection allows Power BI to query the inventory database and retrieve information such as the tables and views available in the source database, the schema of those tables and views, as well as records contained in the tables and views. Once a connection has been made, the data retrieved is used and displayed by Power BI. DirectQuery is useful when the underlying data has the potential to be changed quickly and the most up-to-date information is needed for use in Power BI reports and dashboards. For example, if a real-time inventory system has been implemented in the warehouse and the inventory system itself keeps an up-to-the-second record of goods for sale, then it’s possible the purchasing team may also need up-to-the-second reports that allow them to make the most informed decisions when it comes to refilling stock in the warehouse.
Import Query
The second type of connection is called Import Query. Import will connect to a data source and store the same credentials and connection string, but it will also import the data and store that in memory in Power BI (either the Desktop authoring tool or the Power BI online service). Import is useful when all the necessary data can fit into memory, but can also dramatically speed up the performance of reports inside Power BI since the underlying data source does not need to be queried for each report/dashboard view. Import Query makes Power BI do all the work (storing the data as well as calculating fields, rendering the dynamic report visuals, and so on), while DirectQuery allows Power BI to share some of this work with the underlying data source using a pushdown query. Import Query is generally the best place to start (from a performance perspective) as Power BI can optimally store data in its own internal VertiPaq format when using Import Query. VertiPaq is the proprietary storage engine inside Power BI and is highly optimized for data compression as well as calculations needed to render reports. For example, if a business analyst wants to create a new report for management and the data comes from four or five different Microsoft Excel files, then it’s likely that Import Query should be used. The performance of the report will be optimized since Power BI is handling the queries/calculations, storage of the data imported from Excel, and rendering of the visuals. This means no network latency and no dependency upon external data storage.
Import Query also has an additional feature that is important to note: it allows data to be refreshed. When using Power BI Desktop, this can be completed by clicking the Refresh button but once a report is published to the Power BI service, then it’s possible to schedule a refresh at a regular or timed interval. This is useful when reports need to import data into Power BI to provide the best report performance experience, but also need to show updated data (that doesn’t need to be up to the second). There are limitations on the number of times data can be refreshed by the Power BI service depending on the capacity where the report is deployed. For many organizations, there is a need for fresh data in reports and dashboards, but it doesn’t need to be up to the second, so for those use cases, a daily or few-times-a-day refresh will work and make for an excellent report viewing experience.
It should be noted that DirectQuery and Import Query support different data sources in different ways. For example, when the data source is a CSV file stored on a local filesystem, it must use Import Query because there is no underlying query engine to which Power BI can send a pushdown query. CSV data will get imported into the Power BI storage engine and used for reporting. Upon clicking Refresh, this data will again be pulled by Power BI from the source CSV file into the Power BI storage engine. Another example would be an Azure Synapse Analytics SQL dedicated pool (a common cloud-based data warehouse); in this case, the creator of the Power BI report has the option to choose either DirectQuery or Import Query. Knowing the trade-offs between DirectQuery and Import Query is important in this case, as they will have an impact on both the Power BI report and the underlying data source.
Figure 2.2 – Query performance versus freshness comparison
For example, if Import Query uses Synapse Analytics SQL dedicated as a source, but those who manage the service regularly pause the SQL dedicated pool over the weekends, then it would be important to note that Power BI would not be able to do a scheduled refresh at this time. Additionally, there would be more load on the SQL dedicated pool when the Power BI refresh is running.
When a query is created in Power BI, then a definition of the data is created in the data model. This component of the Power BI data model is called a Power BI dataset. Next, we’ll look at Power BI datasets in depth.