SQL WHILE Loop is Slow and Takes a While (WRITELOG Wait Type)

SQL WHILE Loops can be a convenient way to add various amounts of mock data to a table for testing.  But the WHILE Loop itself can also be a good way to test different functionality within SQL Server itself.  When using a WHILE Loop to insert hundreds of thousands of rows, you might find it taking a lot longer than you expect; even minutes just to insert half a million narrow rows.  Now of course this could be caused by any number of issues, but in this post I am going to focus on a particular cause of the WRITELOG wait type that I recently ran in to.

Setup

Lets start with a simple example. First create a simple example table.


CREATE TABLE [dbo].[MyTable](
[Id] [int] IDENTITY(1,1) NOT NULL,
[SomeText] [varchar](10) NULL,
[MoreText] [varchar](50) NULL
) ON [PRIMARY]

Then execute a WHILE Loop that will insert 500k records.

TRUNCATE TABLE dbo.MyTable
DECLARE @Iterations int = 500000
DECLARE @Counter int = @Iteration
WHILE @Counter > 0
BEGIN
INSERT INTO [dbo].[MyTable]
([SomeText]
,[MoreText])
VALUES
('Num:' + CONVERT(varchar,@Counter)
,CONVERT(varchar,@Counter) + ' bottles of beer on the wall')
SET @Counter = @Counter - 1
END
SELECT *
FROM dbo.MyTable

Your results will vary, but for me that took ~ 1 minute on an otherwise idle Core i7 laptop with 16GB of RAM and an SSD – a lot longer than you might expect.

Why So Slow?

Well, lets first take a look at Activity Monitor while we run the above script again. It will TRUNCATE and reload the table. You’ll see that after a few seconds the INSERT query gets suspended with a Wait Type of WRITELOG.

What is WRITELOG?

The WRITELOG wait type occurs while, wait for it…, “waiting for a log flush to complete. Common operations that cause log flushes are checkpoints and transaction commits.”  Ok, so what is a log flush?  Without going too far off topic here, we need to very quickly discuss a bit about the SQL Server Transaction Log (much more detail here).

When data in SQL Server is modified, the first thing that happens is SQL checks if the needed data page is in memory, and if not, it reads it from disk into memory.  Once the required data page is stored in memory, the changes are applied, and the changes are written to the Log Buffer, also in memory, which stores log records until they are written to disk.  Before being able to write the “dirty” pages to the data files on disk, first the Log Buffer records must be “flushed” and written to disk in the log file itself.  It is at this point where the WRITELOG wait type is encountered as more log flushes for more data changes occur.

Per Paul Randal, log flushes basically occur when either a transaction is committed, or when the maximum of 60KB of log records have been generated since the last log flush.  In order to maximize performance, we generally would want the most efficient flushes of log files to disk – which often means we actually want fewer, larger, log flushes versus tons of very tiny ones causing repeated calls back to disk.

Which brings us back to our WHILE Loop.

How Can We Tell What The WHILE Loop is Doing?

So with that background out of the way, lets take a closer look at exactly what is happening when our above script is running.  Microsoft was kind enough to provide us with a number of Performance Monitor (in Windows 10, just start typing “Perf” at the Start Menu to find it) counters that we can use to see the details of the log flush activity within SQL Server.

Click on the + to add new performance counters, scroll down to the SQLServer:Databases category, and then select the below three log flush related items.

Select Log Flush Wait Time, Log Flush Write Time, and Log Flushes/sec (Log Flush Waits/sec tends to overlap on the graph with Log Flushes/sec, so we will leave it out).

Now click OK and lets re-run the WHILE Loop script from above to INSERT another 500k records.

Your mileage will vary, but you should get graphs similar to the above.  You can see that right around 10:57:45 the query starts, log flushes/sec jumps and so do the wait and write times.  It’s a little confusing because the scales are different for each counter, but if you highlight one of the counters in the pane at the bottom, you can see the Last/Average/Min/Max values.  You can see above (if its not too small to read) that the max value for Log Flushes/sec (blue line, just above 10 on the vertical axis) was ~12,500.  You can’t tell exactly from the graph, but the query ran for 44 seconds.  Of course we need to take into account the roughness of the calculation, but 44sec x 12,500 log flushes/sec = 550,000 log flushes.  Which is right around our 500k row mark!  So from this we can infer that there was essentially one log flush, and one write to disk, for every. Single. Row.  Which makes sense because, with the WHILE Loop, each iteration of the loop – unless otherwise specified – is treated as a single transaction to be committed.

*Honestly I was kind of surprised when I did the math and it worked out…

Wrapping it Up

Now lets tweak the query by wrapping it in a BEGIN TRAN and COMMIT.

DECLARE @Iterations int = 500000
DECLARE @Counter int = @Iterations
BEGIN TRAN
WHILE @Counter > 0
BEGIN
INSERT INTO [dbo].[MyTable]
([SomeText]
,[MoreText])
VALUES
('Num:' + CONVERT(varchar,@Counter)
,CONVERT(varchar,@Counter) + ' bottles of beer on the wall')
SET @Counter = @Counter - 1
END
COMMIT TRAN

Now lets clear the performance counters, select the same ones, and run the new query.

This time the query took 17 seconds, and as you can see (*the scale is different by a factor of 100 this time), there were substantially less log flushes this time.  In fact, the max this time was only 500 log flushes/sec.  What this tells us is that by wrapping the WHILE Loop of INSERTS inside one transaction, SQL waits until the log buffer is full with ~60KB of log records before flushing it to disk.  This leads to many fewer I/O calls, and thus a big improvement in performance by doing fewer, but larger, writes to the log file.

I’m not exactly sure why there is the small ramp up, then the jump to 500 flushes/sec, but I was able to reproduce a similar pattern during multiple attempts.  If anyone has an explanation, please leave it in the comments!

Summary

In this example we saw that depending on how you work with transactions, and by extension log flushes, a WHILE Loop can behave very differently.  Depending on how transactions are structured, you may run in to different bottlenecks, such as the WRITELOG wait type, which cause your query to run slower than expected.  Using Performance Monitor, we were able to find the cause of the WRITELOG wait type, log flushes, and measure how small changes to the query can affect this underlying performance issue.

 

What happens to an Identity column value when an INSERT is rolled back in SQL?

What happens to an Identity column in a table when an INSERT statement fails and is rolled back before it is committed in SQL Server?  I came across this question while playing around with some unit tests in SQL and looking at how the unit testing framework rolled back transactions during teardown.  In a unit test, I was using an explicit Id value in a WHERE clause – say WHERE Id = 2 – and was not getting a result.

To see exactly what happens with the identity column, we can walk through a few simple transactions.

--Create table with an Identity column seeded to 1
CREATE TABLE IdentityRollbackTest (
Id int IDENTITY(1,1)
,SomeText varchar(10)
,MoreText varchar(50)
)
--Insert first record
INSERT INTO IdentityRollbackTest (SomeText,MoreText)
VALUES ('First','Identity rollback test row')
--Check that row was inserted
SELECT *
FROM IdentityRollbackTest

First we create the table, IdentityRollbackTest, INSERT one row, and check the results.  As expected there is one row, Id = 1, in the table.

And DBCC CHECKIDENT (IdentityRollbackTest) returns the below.  Remember that DBCC CHECKIDENT  returns the current, eg: last used, identity value – in this case, 1.

Now lets try to INSERT a second row, but rollback this transaction.


--Try to insert a second row, but rollback the transaction
BEGIN TRAN
INSERT INTO IdentityRollbackTest (SomeText,MoreText)
VALUES ('Second','Identity rollback test row')
ROLLBACK TRAN
--Check that second row was correctly rolled back
SELECT *
FROM IdentityRollbackTest

As expected, we still only get 1 row returned from the SELECT *.

BUT, this time when we run DBCC CHECKIDENT(IdentityRollbackTest)

We get a current identity value = 2!

Now to confirm what happened, we run a third INSERT statement, which we commit.


--Try to insert a third row, allow this one to commit
INSERT INTO IdentityRollbackTest (SomeText,MoreText)
VALUES ('Third','Identity rollback test row')
--Check that the third insert attempt did succeed, and that the Id column value is 3. This confirms that the identity value was
--incremented even though the second row insert was rolled back
SELECT *
FROM IdentityRollbackTest

When we SELECT * FROM IdentityRollbackTest this time, we see the Id has in fact incremented to 3, but we only have the two rows from the two statements we allowed to commit.

Why does this happen?

At first this behavior had me confused – why did the ROLLBACK not also rollback the incrementing of the Identity column?  Until I did a bit of googling and found the explanation – https://stackoverflow.com/questions/282451/sql-identity-autonumber-is-incremented-even-with-a-transaction-rollback .  Since the identity column’s only requirement is to be unique, not sequential, if transaction A starts, then transaction B starts, B can’t wait to see if A commits before “claiming” its own identity value.  Once an INSERT command has been issued, an identity value is “claimed” by a transaction, even if the transaction is never committed.  If the transaction never commits, that identity value is “lost” for good.

It’s a feature, not a bug.

Writing SQL Server Unit Tests Using Visual Studio, NUnit, and SqlTest

Unit testing SQL development is generally more difficult than other, more “traditional”, application development, such as with .Net.  The nature of databases and transactions don’t always lend themselves to easily run a large number of small individual tests of things such as complicated queries or stored procedures.  Some of this is due to our ingrained design patterns as SQL developers, and some is due to a lack of tools.  Thankfully, on the subject of tools, there has been a lot of progress over the last several years.

A colleague of mine, Mark Wojciechowicz, recently took it upon himself to create a library that can be used in Visual Studio with the NUnit framework and the NUnit3TestAdaptor (both can be downloaded via Nuget) to conduct unit tests of SQL Server objects – aptly named SqlTest.  You can read his introductory post about SqlTest, and some of the deficiencies of existing tools, at his website –   https://www.thisintelligentlife.net/introducing-sqltest and check out the code on github.

In this post I am going to walk through my own experience setting up some simple unit tests with NUnit and SqlTest.  I only recently became familiar with unit testing for SQL Server myself,  so hopefully this post can help beginners to get started.   I hope to expand on this with a series of posts on more CICD in the near future including using SqlTest to test SSIS packages and integrating SqlTest based unit tests with TFS.

Setup

First we are going to need to set up a Database Project in Visual Studio and add a few NuGet packages to our solution.  In this example I used SQL Server 2016 and VS 2015, and I have a Solution called SandboxDB which contains a SandboxDB SQL Server DB project, and a SandboxDBTests C# class library project.  The SandboxDB database project is a simple local database with just a few basic tables and a stored procedure, InsertCustomer which just takes 3 parameters and INSERTs a record in to dbo.Customer.  The CREATE scripts for this sample db can be found here.  Any db project will work as long as it builds and the Tests are modified accordingly.

Before you are able to select the NuGet package manager option in VS, you first need to add a new C# class library project to your solution.  Once you have added a class library project, you can go to the below menu.

NuGetPackageManagerOption

The three packages we need to install are:

–     NUnit
–     NUnit3TestAdaptor
–     SqlTest
 UnitTestPackages

After installing these 3 packages, your references should look like the below.  The SqlTest package should add a Samples folder within your class library project containing sample unit tests for SQL and SSIS.  The SandboxDBTests project contains the references to NUnit and SqlTest, as well as the Samples folder installed by SqlTest.  For now, we will exclude the SsisSample.cs file from the project and focus on the SQL db unit tests.

 

References

Writing a Test

Once installed, the next step is to edit the samples to fit our own db project, including setting the App.config file.  Below we set the connection string name to Sandbox, the Data Source to my named instance of SQL Server (localhost\MSSQLSERVER2016), and the Initial Catalog to Sandbox for our demo.  Later, this is also where we would configure the settings for our SSIS Server.

<configuration>
<connectionStrings>
<add name=”Sandbox” connectionString=”Data Source=localhost\MSSQLSERVER2016;Initial Catalog=Sandbox;Provider=sqloledb;Integrated Security=SSPI;” />
</connectionStrings>
<appSettings>
<add key=”ssisServer” value=”localhost” />
</appSettings>
</configuration>

Next, in the file generated as an example when we installed SqlTest, SqlSample.cs, we modify the SqlTestTarget and connection strings to match our project.  Such as:


public class SQLUnitUnderTest
{
TransactionScope scope;
SqlTest.SqlTestTarget SandboxDb;

[SetUp]
public void Setup()
{
scope = new TransactionScope();
SandboxDb = new SqlTest.SqlTestTarget("Sandbox");
}


Thanks to NUnit, we have a robust framework to Setup, TearDown, Arrange, Act, and Assert our tests. Setup prepares a new transaction and performs any common Db preperations, TearDown returns the database to its original state by rolling back any transactions used in the test. Arrange, Act, and Assert correspond to a Test’s specific preparation, execution of the unit of code that is being tested, and validating that the results are as expected, respectively.

Each [Test] block designates a separate unit test. It is helpful to name each method/Test with a naming standard such as UnitUnderTest_Action_ExpectedOutcome; such as InsertCustomer_Insert_OneRow – Stored procedure “InsertCustomer”, Action “INSERT”, expected outcome “one row”. After the Assert, the teardown method is called to rollback the transaction.

In the below example test, first we TRUNCATE the Customer table to set it up for the test, next we call the dbo.InsertCustomer stored procedure and pass in the required parameters, and lastly we do a SELECT COUNT(1) to make sure we return the expected value, one row.


[Test]
public void InsertCustomer_Insert_OneRow()
{
//Arrange
SandboxDb.ExecuteAdhoc($"TRUNCATE TABLE Customer;");
//Act
SandboxDb.ExecuteAdhoc($"Exec dbo.InsertCustomer 'Steve', 'Rogers','7/4/1920';");
//Assert
var actual = SandboxDb.GetActual("SELECT COUNT(1) AS TotalRows FROM dbo.Customer");
Assert.That(actual, Is.EqualTo(1));
//The teardown method runs after every test and will rollback all actions in one transaction
}

The NUnit framework is what provides the Assert.That method, and the various arguments, such as Is.EqualTo, which we can use to check our results.  In the above example, we are checking that the variable “actual” is equal to the integer 1.  There are a ton of possibilities built into NUnit assertions that you can use for your tests.

Running a Test

Once the code is set up and built, we are ready to run our test. Having installed the NuGet NUNIT3TestAdapter above, our NUnit SqlTests can be run directly from within Visual Studio. If the Test Explorer window is not open, go to Test > Windows > Test Explorer to open the pane. The Test Explorer should display something like the below. If it says there are no tests, make sure to build the project containing the tests. The blue “!” indicates that the test has yet to be run for this build.

TestExplorer

Right click on the InserCustomer_Insert_OneRow test and click Run Test.  If all goes well, the test will pass.

PassedTest

Congrats! That’s all there is to it to run a simple Unit Test of a stored procedure using NUnit and SQLTest.  From here you can copy/paste the [Test] code block (this is how NUnit recognizes that a method is a test) and modify it accordingly to run various tests on various objects.  For each [Test] code block you add, you will get another test listed in the test explorer.  You can even sort the tests based on what class they are in to organize them for each object you want to test.

TestExplorerGrouping

Had we messed up any of the SQL statements, or if a result is different than what is expected in the Assert, the test would fail, and throw an error with the SQL error message.  In this screenshot you can see I left out the FROM in the Assert SQL statement.

FailedTest

Summary

In this post we got started writing unit test for SQL Server DB projects using VS, NUnit, and SqlTest.  After writing one test, you’ll want to write another.  After you get started with these frameworks, you can begin to copy/paste much of the code in order to write more unit tests.  Adding another test for SqlTest is as simple as copying the [Test] code block and changing the tables and queries.  The small amount of upfront cost will return huge dividends later with improved code reliability and confidence.

In future posts I will review how to use SqlTest to write unit tests for SSIS packages, and hopefully how to integrate these SQL Server unit tests into a full TFS CICD solution.

Lastly, another shout out to Mark for taking the time to create the SqlTest library to help make unit testing SQL Server easier!

Problem Connecting to Remote SQL Server – “doesn’t support requested protocol”

Problem:

After installing a new named instance of SQL Server 2016 onto a development machine, I was unable to connect via a remote machine, but could connect locally.  The named instance, MSSQLSERVER2016, was the 2nd instance to be installed on the server – the original was the default instance of SQL 2012 Standard.

I got the error message:

“A network-related or instance-specific error occurred while establishing a connection to SQL Server.  The server was not found or was not accessible.  Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: SQL Network Interfaces, error: 28 – Server doesn’t support requested protocol) (Microsoft SQL Server)”

I was still able to connect successfully to the original 2012 default instance.

 

Solution:

The first thing I checked, as suggested in the error message, was that remote connections were enabled in the server properties of the named instance.

The named instance was correctly set to allow remote connections.  So started to search for information regarding SQL Server network protocols and came across this MS document on SQL Server Network Protocol Configuration – https://docs.microsoft.com/en-us/sql/database-engine/configure-windows/default-sql-server-network-protocol-configuration .

Apparently, different editions of SQL Server set network protocols differently for security purposes, as you can see from the table below taken from the documentation.

In this case, I was in fact using Developer Edition for the MSSQLSERVER2016 named instance, so TCP/IP connections – which would be used for remote connections – was disabled, while Shared Memory – which would be used locally – was enabled.  This differs from the installation defaults of Enterprise and Standard editions.

So, to fix the connection problem, I opened up SQL Server Configuration Manager on the server, went to SQL Server Network Configuration, and sure enough, for the MSSQLSERVER2016 named instance TCP/IP was set to disabled.

After changing the TCP/IP Protocol Status to Enabled, all I had to do was restart the SQL Server Service for the named instance, and then I could connect successfully from a remote machine.

Setting Up a Raspberry Pi 3 as a Print Server

Since the dawn of the digital age, man has been trapped in a never ending struggle with printers.  Total disaster.  I hate printers.  But this post isn’t about ranting about how terrible printers are in general – its about getting one to work wirelessly.  Since my wife and I switched exclusively to laptops, and I decommissioned (read – drive failed and its sat in various states of disassembly for 3 years) my last desktop that acted as a file and print server in our home office, we searched for a “wireless printer” that actually consistently works.  We’ve gone through several of various brands and while they “work” part of the time, we’ve never found one that was reliable and consistently stayed connected to our home network.  The latest, a Brother MFC-L2740DW multi-function laser, had the nasty habit of going “offline” to save power approximately every 24 hours.  Searches online yielded tons of posts with the same issue, and apparently there being no solution other than literally turning it off and turning it back on again.  Sure, you have to physically go to a printer to pick up what you printed, but wtf?  My thermostat can stay connected 24/7 but a several hundred dollar printer cant? Sorry, I said I wasn’t going to rant about this…

So the solution that I decided to try was to set up another dedicated print server to physically attach the printer to, hopefully keeping it “online” all the time and allowing the server to manage the wireless connectivity instead of the printer itself.  But I did not want to have to power and manage an entire desktop just so the printer is always at the ready.  Enter the Raspberry Pi.  For those unaware, Raspberry Pi is the name of a popular company that makes a series of small single board computers of the same name.  For this project I purchased a Raspberry Pi 3 Model B along with a kit containing a power supply, micro usb card pre-loaded with Raspbian (a Debian based Linux distribution for Raspberry Pi), a tiny little case, and even tiny little heatsinks.  The total cost for the kit was ~$50.  The Raspberry Pi 3 itself is about $35.

(Raspberry Pi 3 Model B)
This was my first project with a Raspberry Pi, and the first time I dabbled with Linux at all in a number of years.  So I had to fumble my way through getting it set up, connecting to it remotely from Windows, getting the printer working on the Pi, and finally getting our Windows machines to print through it.  Though in the end it was pretty simple, hopefully this post will help overcome some bumps I had along the road.
Pi Set Up
Getting the RPi up and running was simple enough.  The kit I got was from https://www.loverpi.com/ and it came with an 8GB Micro SD card preloaded with Raspbian as well as a bunch of software.  All I had to do was put in the SD card, dig up an old USB keyboard and mouse, hook it up via HDMI to a TV, and plug it in.  It booted right up to the GUI and was straightforward to connect the built in Wifi.

Enabling VNC
First order of business was getting VNC set up so I could stop using a terrible old USB keyboard while sitting on the floor in front of the TV.  From what I had read, Real VNC should have been included with the version of Raspbian that had come preloaded, and all I had to do was go to Menu > Preferences > Raspberry Pi Configuration, click on Interfaces, and Enable VNC.  But when I went to this screen, the VNC option wasn’t there.  So I figured I would try to update Raspbian to get the latest packages.  Again, its been a while since I’ve used Linux, and it was never very much, so I’ll show the terminal commands I used, but I’m not going to get into exactly what everything does here – because I probably don’t even really know.
sudo apt-get update
sudo apt-get upgrade

After updating Raspbian, which took a while because it downloaded a lot of updates, after I rebooted the RPi, the above VNC option WAS there for me to enable.  Then I just had to download the RealVNC viewer for Windows – https://www.realvnc.com/download/viewer/ –  on to my laptop and I was able to connect remotely to the RPi just by using the IP Address in the VNC Viewer address bar.

 Now I was able to safely put the RPi into its tiny little clear case, connect it to the printer via USB cable, and hide it away in the office.  I also attached an mSATA USB adapter for some extra storage with a spare mSATA drive, but that’s not required for this project.

Setting up CUPS

So next was to get the printer working locally with the RPi.  For that I turned to CUPS (Common UNIX Printing System), because that’s what the internet told me to use to set up a Linux system as a print server.  CUPS was already installed as part of the pre-loaded software on Raspbian, and I knew it was updated from earlier, so I should have just been able to go to its web interface and configure the printer.  Wrong.

To administer CUPS, you point your web browser (on the RPi) to http://localhost:631/ to get the admin screen shown below.

But clicking on “Adding Printers and Classes”, popped up a username/pw screen, which the instructions said to log into with your root user account, but even that yielded: “Unable to add printer: Forbidden”.

This was a spot where Linux knowledge would probably have helped.  Basically the CUPS software creates a user group called lpadmin on the machine, and a user must be a member of it to do anything via the web interface – but it doesn’t automatically add anyone to this group, not even the root user.  Though its a different Linux distrobution, I found the Ubuntu CUPS help page helpful https://help.ubuntu.com/lts/serverguide/cups.html

So as instructed in that link, I had to add the root user (in the case of the Rpi, root username Pi) to the lpadmin group.

sudo usermod -aG lpadmin username

NOW I was able to click on Adding Printers and Classes > Add Printer and select my already recognized printer.

You then have to select your make and model of printer to pick the correct driver.  The Brother MFC-L2740 was not listed.  I was able to find a Debian printer driver for this model on Brother’s support page, and although it installed on Raspbian, it is built for a different architecture and would not print.  Many people online seemed to have luck selecting similar model printers, so I gave the Brother MFC-9600 – CUPS+Gutenprint v5.2.10 driver a shot, and amazingly it worked!

I was able to click the Maintenance drop-down and successfully print a test page.

Adding the Printer to Windows Machines

After getting the printer set up on the RPi, the last thing to do was be able to add the new network printer to our Windows laptops.  Viewing the Network tab on Windows did not list the RPi on the network.  What I needed to do was allow file and print services to be shared from the RPi to Windows.  To do this, I needed to install Samba which is explained simply enough here – http://www.deviceplus.com/how-tos/raspberrypi-guide/raspberrypi_entry_006/

Install Samba:

sudo apt-get install samba

Edit Samba config file with nano text editor to set your Work Group and enable WINS support:

sudo nano /etc/samba/smb.conf

rasp06_img04

Scroll down through the config file and set the below attributes, then press CTRL+O to “WriteOut” aka Save the changes.

workgroup = your_workgroup_name
wins support = yes
 After making these changes we need to restart Samba.
service smbd restart
Now when we go to the Network screen in Windows, we can see the RPi on the network!

And within the RPi on the network we can find, and add, our printer!

In order to properly add the printer to Windows, the Windows machines should have the necessary printer drivers installed so you can select them when you Right Click > Connect on the printer you want.

Summary

After following the above steps, you should be up and running with a Raspberry Pi print server that you can connect to wirelessly with a Windows machine.  With this set up, the printer and the print server should be able to remain on and ready for use while only consuming a very small amount of power.  I have not yet tried to set up and use the scanning functions of the multi-function printer, but by now having Samba set up, I should be able to easily share files and folders as well and be able to scan directly to the RPi and access the files from any other machine on the network.  Along the same lines, the mSATA adapter pictured above can be used as shared storage via the RPi as a file share.

Extract Multiple Access DBs (with different schemas) via SSIS Script Task into SQL Server

*Another post moved over from my old blog.  Since I wrote this, a coworker modified the code to extract data from multiple Excel workbooks and tabs in a similar fashion.  At some point I’ll try to add that addendum.

Problem:

We’ve all heard it before – “Well I get the crucial financial numbers from this Access Database…”. Or even worse, “we have hundreds of access databases all over the place with varying schemas”. On a recent project I was faced with the later – several hundred Access DBs (mdb files) that had accumulated over a number of years, in various places across the network, all with different schemas. In the client’s defense, some of this was an outside 3rd party’s doing and they had since implemented an enterprise class solution for the process going forward. Never the less, I was tasked with extracting all of this legacy Access data into SQL Server.

   🙁

 

I figured I would just be able to use a For Each Loop within SSIS to extract each mdb file and just drop it to SQL Server. I quickly realized however that because every Access DB could potentially have a different schema, I would need to dynamically handle the schemas in each source file. I briefly thought about using BIML (see my recent posts about Business Intelligence Markup Language) to dynamically create the SSIS packages, but I really didn’t want hundreds of packages or components since this was a one and done job. So for better or for worse I turned to the trusty Script Task.

Solution:

So what I needed to do was:

  1. Connect to each Access DB file in different locations across the network
  2.  Create the table schema for that specific mdb file in SQL Server
  3.  Copy the data from the Access file to SQL Server
  4.  Run some other processing stored procedures not related to this post

Turns out there was also a 1b) – map each file location as a network drive using a separate Active Directory account. There was no need to keep the data after the last processing step was complete, so I was able to drop the created SQL Server tables when I was done.

Here is an overview of the control flow of my completed SSIS package:

First I use an Execute SQL Task to select all of the file names and locations of the mdb files from a pre-populated table (thankfully I had a spreadsheet of all the file names and locations). To set up the For Each Loop to loop through all the rows returned in the Execute SQL Task, we need to first create a User Variable of type Object – here I called it “mdbsToProcess”. Then in the Execute SQL Task properties, we set the ResultSet property to Full Result Set.

Then we click on the Result Set option and set the result set to the object variable we created.

Next I use a For Each Loop to loop through each of the mdb files I need to process. The For Each Loop is set up as follows to use the mdbsToProcess variable, and loop through each record to get information we need for each mdb file.

For the final code sample – click here. I’ll walk through it below. This is just a copy/paste from the SSIS Script Task, so please ignore all the SSIS default stuff.

After the Script Component generates and loads all of the tables from the mdb file into SQL Server, I simply get the current time, execute some other stored procedures to process the data unrelated to this post, then log the results.

Code Explanation:

By no means am I am App Dev guy, so feel free to improve upon this and let me know. But below is an overview of the more interesting parts of the Script Task code. First we set up all our variables (its quick and dirty so some are hard coded, others are passed in from SSIS).


 public void Main()

{

string FullFilePath = Dts.Variables["User::FullFilePathFromDB"].Value.ToString(); //Full network file path and name of mdb file
string TargetServer = Dts.Variables["$Package::TargetServer"].Value.ToString(); //Target SQL Server
string TargetDB = Dts.Variables["$Package::TargetDB"].Value.ToString(); //Target SQL DB
string tableName = null; //name of table in source
string TargetTableName = null; //name of table being created/populated
string DropTableSQL = null;
string NetworkDrivePassword = "MyNetworkDrivePassword";
string NetworkDriveUser = "MyNetworkDriveUserName";
string NetworkSharePath = "\\MyNetworkPath\Share$";

intCountCreatedTables = 0; //Used in log file

string SourceConnString = ("Provider=Microsoft.Jet.OLEDB.4.0;" + "Data Source=" + FullFilePath);
string TargetConnString = ("Data Source=" + TargetServer + "; Password=MyPassword;User ID=MyUserName;Initial Catalog=" + TargetDB + ";");

string command;
command = "/C NET USE " + NetworkSharePath + "/delete"; //If share mapped, disconnect to prevent multi "user" error

//WriteToLog("Executing Cmd Line: " + command);

ExecuteCommand(command, 5000);

command = "/C NET USE " + NetworkSharePath + " /user:" + NetworkDriveUser + " " + NetworkDrivePassword; //grant access

//WriteToLog("Executing Cmd Line: " + command);
ExecuteCommand(command, 5000);

 

Then we execute 2 NET USE commands to first delete any existing mapped network drive by the given name, and then to map it based on the given credentials. This is only necessary if the credentials you run the SSIS package under do not have access to the network shares. I found I had to first delete it if it existed or I could run into a “multi user” error. These commands are executed by the ExecuteCommand() method which just runs the NET USE commands via cmd.exe.

Next we use the GetSchema method on the Source Connection (AKA our Access file/OleDbConnection) to populate a data table with every table within the mdb file. Note that we specify the “Tables” schema object type.

OleDbConnection SourceConn = new OleDbConnection(SourceConnString);
SqlConnection TargetConn = new SqlConnection(TargetConnString);

SourceConn.Open();

WriteToLog("Connect to File # " + Dts.Variables["User::FileNumber"].Value + " in " + FullFilePath);

DataTable DBTabsTables = SourceConn.GetSchema("Tables");

SourceConn.Close();

Now that we have a table of all of the Access file tables, we iterate through each one in the DBTabsTables DataTable and generate our DROP and CREATE scripts which we pass to the CreateTable() method. In this case I specified the tables I wanted, but you can remove the IF statement if you want them all.

foreach (DataRow row in DBTabsTables.Rows) //For each table in mdb file

{

tableName = row["table_name"].ToString();

if(tableName == "AccessTable1" || tableName == "AccessTable2" || tableName == "AccessTable3") //only get specified tables from Access file. Or remove IF to get all tables

{

DropTableSQL = "IF OBJECT_ID('dbo."+ tableName + "') IS NOT NULL BEGIN DROP TABLE " + tableName + " END; "; //build drop table if exists SQL

CreateTable(SourceConn, tableName, TargetConn, DropTableSQL, tableName);  //For the initial create, we want to use the source tableName

++CountCreatedTables;

BulkLoadTable(SourceConn, tableName, TargetConn, tableName); //For the initial bulk load, we want to use the source tableName

}

}

 

Within CreateTable() we again see that we use GetSchema on the connection object, but this time we use the “Columns” object type and another variable called “restrictions”. If you look a few lines from the top of the code snippet below, you can see that “restrictions” is a 4 value array of strings. You can read about the options here, but I am passing the name of the current table within the loop into the 3rdposition in the array (remember, its 0 based), which is then used in the GetSchema call to restrict it to only 1 table. So now we have a DataTable called ColumnDataTable which contains 1 record for each column in the designated table.

public void CreateTable(OleDbConnection SourceConn, string tableName, SqlConnection TargetConn, string DropTablesSQL, string TargetTableName)

{

string[] restrictions = new string[4];

restrictions[2] = tableName; //restrict which table information is returned by GetSchema

string accessTableFieldName;

DataRow myRow;

SourceConn.Open();

DataTable ColumnsDataTable = SourceConn.GetSchema("Columns", restrictions); //Fill DataTable with columns information

SourceConn.Close();

 

When using the schema information to build the CREATE scripts for the new SQL tables, you need to be careful about data types between OleDB and SQL Server. There is probably a better way to do this, but I used the below switch statement to brute force change the datatype names to their SQL counterparts.

//For every row in the table

for (int i = 0; i < SortedColumnsDataTable.Rows.Count; i++)

{ //Get column name and type

myRow = SortedColumnsDataTable.Rows[i];

accessTableFieldName = "[" + myRow["column_name"] + "] ";

switch (((OleDbType)myRow["data_type"]).ToString())//Change OleDBType to SQL datatypes
{
case"Boolean": accessTableFieldName += "bit";
break;
case"Currency": accessTableFieldName += "money";
break;
case"Date":
case"DBDate":
case"DBTimeStamp": accessTableFieldName += "datetime";
break;
case"VarWChar":
case"WChar":
case"VarChar":
case"Char": accessTableFieldName += "nvarchar(" + myRow["character_maximum_length"] + ")";
break;
case"UnsignedTinyInt": accessTableFieldName += "int";
break;
case"Double": accessTableFieldName += "Float";
break;
default: accessTableFieldName += ((OleDbType)myRow["data_type"]).ToString();
break;

}

Lastly, in BulkLoadTable() we execute a BCP command to copy the data from the mdb file to the newly created SQL Server table. Be sure to set the batchsize and timeout length to something that can handle the amount of data you are transferring.

public void BulkLoadTable(OleDbConnection SourceConn, string tableName, SqlConnection TargetConn, string TargetTableName) //use bcp to load data from source to target
{
OleDbCommand SelectFromCmd = newOleDbCommand("SELECT * FROM " + tableName, SourceConn);

OleDbDataReader rdr;

SourceConn.Open();

TargetConn.Open();

rdr = SelectFromCmd.ExecuteReader();

SqlBulkCopy sbc = new SqlBulkCopy(TargetConn);

sbc.DestinationTableName = TargetTableName;

sbc.BatchSize = 50000;

sbc.BulkCopyTimeout = 120;

sbc.WriteToServer(rdr);

sbc.Close();

rdr.Close();

SourceConn.Close();

TargetConn.Close();

WriteToLog("Load  " + TargetTableName);

}

 

Summary

In this post we discussed how to use SSIS and a Script Task to iterate through any number of Access mdb files in order to import the data into SQL Server. The code used can handle Access dbs in varying schemas. This type of process can be very useful when you are dealing with a large number of legacy Access dbs and the Import Wizard isn’t practical. Hopefully this post will help someone else to upgrade/migrate a legacy business process into a more enterprise ready solution – not to allow an “Access for the Enterprise” application to continue to exist!

Erik

@ErikH_BI

 

Having an SSIS Component “Wait” Until a Condition is met or a Time Period Passes

*This is an old post of mine from my employer’s old blog site that I had not moved here.  I remembered about it when I was trying to do something similar and figured I’d post it.  Today I would probably do it in C# using a timer or Thread.sleep.

Background
For a recent project, my colleague (@craigblevin) and I found ourselves reloading a QA environment of a Data Warehouse from scratch. In this case, our source data consisted of all flat files – upwards of 70 gigs worth. In normal operation, our ETL easily handled the incremental data, but we found that reloading all of the data at once choked up our minimally resourced SQL Server. So we broke the files up into more manageable chunks and ran them one at a time (in this case going back to modify our DW ETLs themselves to process one file at a time was not an expedient option). After spending the better part of a weekend taking turns checking on the progress, manually moving a file, and starting the SQL job again, we realized we could build another SSIS package to take care of this for us. But how to make sure the previous run was complete before copying over the next file and restarting the job?

Set up a For Each File Loop

The first step was to set up a simple For Each File Loop in SSIS. In this case, as depicted below, we wanted to move 1 file at a time from our staging directory into our processing folder.

Once the file was in the processing folder, our package fired the SQL Job to start the deployed SSIS packages to load the DW.

Check for the job to finish

Once the deployed job was running, we had to know when it was completed so the package could start the next file sequentially. For this we created a For Loop which contained an Execute SQL Task which checked our custom DW Audit tables to see if the job was completed. The SQL Task simply set variable @IsLoadComplete == 1 when the last run was complete, and the For Loop continued until @IsLoadComplete = 1.

This led to a problem though – the Loop Until Complete container would execute the SQL Task continuouslyuntil the loop condition was met.  We did not want to constantly query the server tens of thousands of times until the job was complete.

Have SSIS wait a specified period of time

Since SSIS does not have a native “Wait” task, our quick solution was to create a nested For Loop that does nothing with an evaluation expression that checks the current time until its 10 minutes after the start time of the container. This causes the “Check” SQL statement to run only once per 10 minutes.

EvalExpression: DATEADD(“mi”, 10, @[System::ContainerStartTime]) > GETDATE()

(The finished product is below (ignore the warning on the send mail task).

How does it perform?

In our case, since we were just loading a QA environment and wanted a quick and dirty solution, we were not concerned with performance.  We successfully allowed this design to run in VS on our local medium powered machine overnight while processing multiple files without issue. However, a quick check of CPU usage did show a 30% spike in CPU usage – so in the end this solution won’t win any performance awards.

In hindsight, was there a better way? (hint: yes)

While writing this post, and checking the CPU usage, I started thinking there had to be a more efficient way. Since I am not a programmer by nature, I went straight for available SSIS components instead of opening up a script task. But a quick search for “SSIS Script Task Wait” found several examples (such as Mike Davis) of doing this with a Script Task – and sure enough using a Script Task with code such as below, performs the same function with much less CPU usage.

Credit to http://www.bidn.com/blogs/MikeDavis/ssis/158/make-an-ssis-package-delay-or-wait-for-data for the below Script Task code.

Public Sub Main()
'
Dimsec As Double= Convert.ToDouble(Dts.Variables("intDelayTime").Value)
Dimms AsInt32 = Convert.ToInt32(sec * 1000)
System.Threading.Thread.Sleep(ms)
'
Dts.TaskResult = ScriptResults.Success
End Sub

So if we had to do it again, replacing the “Do Nothing” For Loop with a Script Task such as the above would probably be less resource intensive and accomplish the same thing!

Duplicate objects in SSIS (ErrorCode Column)

In SSIS (multiple versions) you can get a seemingly random error when trying to create an OLEDB Source Component.

Full Error:
The package contains two objects with the duplicate name of “OLE DB Source.Outputs[OLE DB Source Error Output].Columns[ErrorCode]” and “OLE DB Source.Outputs[OLE DB Source Error Output].Columns[ErrorCode]”
The error is caused by a dumb, but in hindsight sort of obvious, problem.  The source query I was using is below.
SELECT [SurveyID]
      ,[VendorSurveyID]
      ,[SurveyStatusID]
      ,[SurveyStatusName]
    ,[ErrorCode]
     ,[ErrorDescription]
      ,[ErrorPriorityID]
      ,[ErrorPriorityDescription]
      ,[InsertDate]
  FROM [CurrentSurveyStatus] s
The problem is that my source query had a column explicitly named ErrorCode – which happens to be the name of one of the default built in columns that are piped to the component error output.  The same would be true if I had a column in my source query called “ErrorColumn”.  It’s not readily apparent that these output columns exist because they only would show up if you use the component error path, or if you look at the Advanced Editor as shown below.

SSIS won’t let you remove the default SSIS ErrorCode column from above, but it will let you rename it in the Advanced Editor.
So you have two options –
  • Rename your column
  • Rename the default SSIS column
Below I renamed the default column as you can see in the Meta Data Viewer, and the component no longer has errors.
Note: The warning is just that no rows would currently be sent to the error output in this example.

Letter to Elected Officials re: ACA Repeal and CF

*UPDATE 9/21/17: With the added traffic to this post lately from Twitter, I wanted to add:

Next month, October 14th, I am doing a Stair Climb fundraiser for the Cystic Fibrosis Foundation to raise money for those affected with this fatal genetic lung disease – like my wife Mary.

In addition to fighting for the ACA, can you help me fight for a cure for Cystic Fibrosis? Thank you! Please visit http://fightcf.cff.org/goto/ehudzikCFClimb2017

—————

Below is a copy of a letter that I am sending to various federal government officials to raise my concerns about the pending fight over changes to the Affordable Care Act (ACA) aka ObamaCare.  This isn’t political, although it references party affiliation, but it is advocacy for those with chronic illnesses and other healthcare issues.  Feel free to share.

——————————————

To whom it may concern,

My name is Erik Hudzik, and although I do not live in your district or state, as you are _____________ I am contacting you regarding a matter that is deeply important to my wife Mary and I – The Affordable Care Act.  In this letter I am going to set aside partisan bickering and try to explain to you some of the very real consequences that you and your party’s cavalier attitudes towards repealing the ACA will have on millions of Americans like us.  I’m going to do this by telling our story.

My wife Mary has Cystic Fibrosis (CF).  CF is a fatal genetic lung disease that, among other things, slowly, but surely, degrades lung function until the inevitable need for a lung transplant which brings its own complications.  She is one of 30,000 Americans with this fatal disease.  When my wife was born in 1982, her parents were told she would not make it to adulthood.  Thanks to science (a quick side note – please continue funding NIH research), the life expectancy for patients with CF now hovers around 40 years old and continues to improve.  She has spent her life chasing her life expectancy.  Despite these wonderful advances, my wife has to fight to breathe every day.  Her entire life she has had to spend hours in the morning and hours at night, every day, doing treatments to stay alive.  Every year, she is subject to several week or longer hospitalizations and several courses of home IV antibiotics.  Suffice to say that we are extremely familiar with health insurance and the costs of medical care.

Mary is the youngest of three girls and is the only sibling affected by CF.  When she was born in 1982, her parents also had to very quickly become familiar with our country’s odd health insurance system now that they had a child requiring hundreds of thousands of dollars in medical care a year.  Her father was lucky enough to get a new job with better insurance, and went on to take over the business so he was able to ensure he – and the rest of his employees – could have quality health insurance.  The ACA made it easier for a small business owner like Mary’s father to offer quality insurance.  Many Americans are not this lucky.

Fast forward to 2006 when Mary and I got married at the age of 23, and health insurance now became our responsibility.  Since this was prior to the ACA, Mary could not continue on her parent’s insurance.  So while we continued on to graduate school, I had to make sure to also find a full time job that offered very good health insurance.  Many Americans are not this lucky.

As you might have guessed, CF is a pre-existing condition.  Pre-existing since her birth.  There is no way we would have been able to get insurance on the old individual market.  After getting our graduate degrees, I was able to start my career with a good job for a mid-size company that offered good health benefits.  I stayed there for 5 years until they changed to a much lower quality health insurance plan that would not work for my wife.  I was forced to quit within 3 months, and thankfully found another job with better health benefits – again, many Americans are not this lucky.

My next job unfortunately lasted only 10 months before the employer folded back into its parent company and eliminated several jobs.  Thankfully, I was again able to quickly find another job with good health insurance that would cover my wife.  All of these job changes would have been much less stressful had we had the ACA exchanges to fall back on.

Now I find myself in a situation where I have several other promising opportunities to further my career – one of which would involve being an independent contractor for a period of time.  Now, thanks to your repeal and your “plan” to eventually do… something, I cannot take the risk of doing independent work because we cannot be sure that we will continue to have the option to purchase reasonably priced individual market insurance.  You’re ACA repeal efforts have already directly harmed the individualism and small business potential of hundreds of thousands of Americans.

Adding to all of this, because of the ACA, my father, who has been battling kidney cancer for years, was finally able to retire at age 66 as now there was an option, via the exchanges, to get insurance for my 63 year old mother, an endometrial cancer survivor.  Again, with a pre-existing condition, she would not have been able to obtain affordable, quality coverage on the old individual market.  With the pending repeal, my father may have to return to work, while still battling cancer, just to make sure my mother has insurance until she reaches 65.

We followed the health reform battle very intently.  I mean no disrespect when I say that I probably have a better understanding of health insurance and related policy and economics than many members of Congress in both parties.  It was incredibly disappointing, and frankly infuriating, that your party took up the tactic of complete and utter resistance to any efforts to work together to solve the agreed upon problems in the American Health Insurance system.  Despite starting with a market based framework which was originally developed by conservatives, and the Obama administration making every effort to include Republicans and their good faith ideas, your party refused to do anything to improve the pending law.  Instead your party uniformly and blatantly simply said “no” while trotting out lies about “death panels”.  These partisan and childish tactics should, frankly, be enough to completely disqualify you from even discussing changes 8 years later – you had your chance to shape policy, and flatly turned it down.

When the ACA was signed into law, and implemented, we were ecstatic.  We finally knew we would ALWAYS be able to purchase quality health insurance, no matter my employment situation.  It opened up countless possibilities for us and our economic future.  Despite many claims to the contrary, every year when we have looked at gold and platinum plans on the exchange in our area, the costs are extremely competitive with my employer provided insurance once the portion my employer pays is taken into account.  I can’t tell you the sense of relief the ACA gave us.

But that brings us to today.  In a matter of weeks, you have begun the process of destroying something that tens of millions of Americans have come to rely on – whether they know it or not.  You are upending people’s lives.  Every vague “policy paper”, press release, or speech, presented by Republicans has come nowhere near covering as many people with as high quality insurance.  Industry leaders from insurance companies to the AMA, and health care policy experts from across the political spectrum have warned you not proceed so quickly without having a real plan to move forward.  I fear that your party either does not fully understand what you are doing, or worse, simply do not care.  I know you have run on nothing but “Obamacare Bad!” for 8 years and you have boxed yourselves in – but please, don’t ignore the real life consequences of what you are doing.  Take what works – and is popular – within the ACA and improve upon it with reasonable good faith discussions.  Take this chance to have Republicans become part of helping to craft changes to the ACA that improve it for generations – not tear it down for short sighted political gain. Please, please don’t destroy what has brought health insurance coverage, care, and peace of mind to so many millions of Americans just for blatantly partisan political reasons.

I hope you carry the thoughts of my family, the positive impacts the ACA has had on us, and the fear we have for its pending repeal, with you for every vote that you make.  I plan to continue to update you on how your actions directly affect our family – just one of millions.

Power BI Forecasting Feature (and when your “Data is too irregular to forecast”)

In September, one of the new items Microsoft rolled out for Power BI, was a preview version of their Forecasting statistical analysis feature.  Since then they have moved the feature out of “preview” and it is readily usable on the Analytics tab on both Power BI desktop and web versions.  In this post we will look at the steps to utilize this new feature in Power BI Desktop to add some quick statistical analysis to your visualizations.

What is Power BI Forecasting?

Forecasting in Power BI gives you a quick and simple way to add a predictive forecast to a time series line chart.  Power BI Forecasting uses a method called exponential smoothing to come up with its predictions based on a time series dataset.  You cannot view or change the actual algorithm, but you can get more of an explanation of how it works on the Power BI Blog – https://powerbi.microsoft.com/en-us/blog/describing-the-forecasting-models-in-power-view/

Dataset

The data set I used for this demo is a collection of “Check-ins” from the beer rating app Untapped.  The full dataset had about 800 beer check-ins, but the full dataset is not posted here to protect the innocent.   I have a few rows of sample data below, but any time series data should do.  The “ExcelNumericDate” column is the MS Excel integer representation of a date, which I had to use as a work around for a problem I’ll explain below.  I’m sure there is a better way to do what I needed, but instead of fighting with formats, I just did it this way.

Let’s try to forecast how many beer check-ins I will have in the future based on my check-in data.

BeerName ExcelNumericDate CheckInDate
Tripel Karmeliet 42272 9/25/2015
Big Gruesome Chocolate Peanut Butter Stout 42272 9/25/2015
60 Minute IPA 42273 9/26/2015
Not Your Father’s Root Beer (5.9%) 42274 9/27/2015
Rumpkin (2015) 42275 9/28/2015
Oktoberfest 42279 10/2/2015
420 Extra Pale Ale 42284 10/7/2015
Samuel Adams OctoberFest 42287 10/10/2015
Corona Light 42287 10/10/2015
60 Minute IPA 42292 10/15/2015

 

Using Power BI Forecasting

First we need to load our data set into Power BI.  As shown below, from the Home tab in Power BI Desktop, select Get Data, and the type of format your file is in.  For this demo I used a .csv file that I had saved as a copy of an .xlsx file.  You can see I have added two columns, beer_name and CheckInDate to the canvas.

Next, in order to get discrete dates, instead of the default Power BI Date Hierarchy, we need to click on CheckInDate and change the selection from “Date Hierarchy” to the column name, “CheckInDate”.

Now change the Visualization to a Line Chart, and make sure the CheckInDate (or whatever your date field is called) is set as the Axis, and the Count of beer_name is set as the value.

(the huge spikes are tastings at Beer Fests or great brewery tours with Liberty Brew Tours… I swear)

Now we just click on the Analytics magnifying glass, and select the new Forecast option, right?

 

My Data is too Irregular??

Unfortunately, no.  When you click Add, you won’t get anything on your Line Chart except a tiny red “X” in the upper left hand corner.  Click on it to get a bit more information.

“Data is too irregular to forecast.” “Your data isn’t evenly spaced … dates or numbers in a valid timeline must have a consistent gap between consecutive points”

After more trial and error than I’d like to admit, I realized what Power BI is telling us is that it can’t create a forecast because it doesn’t have a data point for each date.  If we go back and look at the sample data I have above, you can see that on 10/2/2015 the beer Oktoberfest was checked in, and then there was not another check-in until 420 Extra Pale Ale on 10/7/2015.

Oktoberfest 42279 10/2/2015
420 Extra Pale Ale 42284 10/7/2015

 

This is where I kind of have to assume I’m missing a way to easily have Power BI just assume a 0 value for any missing dates, but I wasn’t able to find it.  This becomes clear if you look at the Line Chart above – there are no data points at 0, only at 1 or greater.  An interesting side note, if you “drill up” to the month level with this data, you CAN add a forecast because in that case each month, when all check-ins are summed by month, have values.

My work around for this seemed simple enough, merge in another data set with all possible dates for the date range I wanted for the report, then use the dates from that 2nd data set for the X-Axis so every date in the range would have a data point.  Fast forward through a wasted hour fighting with 2 .xlsx files and differing date formats (eg: January 2nd 2017 10:00 AM vs. 1/2/2017 etc…), and I just converted the datetimes in both data sets, beer check-ins and All Dates, to integers in Excel and used that for the join.  Ugly, I know.

So we import the 2nd dataset of all possible dates we want and their Excel integer equivalent as shown below.

ExcelNumericDate Date
42005 1/1/2015
42006 1/2/2015

 

Click on Relationships on the left hand side, Get Data, and select your 2nd data set formatted as above.

Once we have the 2nd data set with all possible dates, we need to merge (AKA: Join) the 2 data sets.  So click on Edit Queries on the ribbon –

On the Query Editor screen, we go to Merge Queries > Merge Queries as New –

On the Merge dialog box, shown below, we select our Join/Merge key (ExcelNumericDate), and select our “Join Kind”, eg: INNER, LEFT, RIGHT etc… Be careful here to pick the correct type of join, Right or Left, depending on which order the data sets are in.

Now we close out of the Merge dialog box, and just have a couple other small things to do.

Adding Our Forecast

First, in the Query Editor, we can rename our new Query, here I’ve called it BeerCheckins.  Then we need to go to the far right of the list of columns to find our new “column”, or rather collection of columns, we “merged” and click the double arrows shown below to expand the new columns.  We only care about the Date column now because we will use this for our X-axis.  Remember, this is now the date from the AllDates data set so there is one row for each day.

We can then click Close and Apply in the upper left to return to our data, where we are going to add one last column to simplify summing our check-ins.  Right click on our new BeerCheckins table, and select New Column.

 

For this new column, named Counter, we want to make sure that every date that does NOT have a check-in (eg: no beer), we have a data point of 0.  So we can check the beer_name column to see if there was a check-in on each date using the below DAX function:

Counter = IF(ISBLANK(BeerCheckins[beer_name]),0,1)

This new column is added to the far right, and has a value of 1 if there was a check-in on that date, or 0 if there was not (*note: though we used the [beer_name] column for the function, it is not shown)

Now we can finally go back to our Report tab, add a new Line Chart, set the NewColumn.Date as the Axis (don’t forget to again toggle off the Date Hierarchy default), and the Counter column as the Value.

Answering the Question

And this time when we go to the Analytics tab and try to add a Forecast – we can create it successfully and we find that within a 95% confidence interval I will, in fact, have 4 beer check-ins on Saturday, January 14th 2017!

Always drink responsibly.