“Big Data” Entity Framework 6

Today I’m going to write about handling much data in Entity Framework. On 3/12/2015 I did my first hackathon (Hack The Future @ Antwerp). We had to process a SQL database with 1 000 000 records.

I never had any experience in handling so much data. We are going to use WPF, MVVM and Entity Framework 6 DB First!

First, Hack The Future was an amazing event, I learnt a lot and had some good talks with professionals. For more thank you’s look at the bottom of my post 🙂

First problem.

Normally I load all my records into my memory.

Context.People.Load();

Of course this takes a long time. Actually the data in our DB was bigger than 4GB so, our program crashed because of the limitations of our 32 bit program.

Loading all data is a bad practice performance wise! Also it’s not memory friendly…

Solution, Get the data when we need the data!
We used a DataGrid for showing our data, because it’s not possible to show all million records on one screen. We had to break our DataGrid in pages.

Our View:

View EF "Big Data"

We used MVVM Light as MVVM framework. We only show 25 (WidthShow) records at once. The user will be able to go to the next 25 records with a button, it’s also possible to go 25 records back with another button if needed.

We did this by constantly changing our “take” and “skip” parameter in our LinQ queries. So we’re now able to view our data, without the problem we have to wait forever to load all the data. This concept is called paging.

var data = Context.People.OrderBy(p => p.Id).Skip(_i * WidthShow).Take(WidthShow);

Our ViewModel:

private int _i = 0;
private double count = 0.0;
private const int WidthShow = 25;

public PeopleHackTheFutureEntities Context { get; set; }

#region Lijst Full property
private ObservableCollection _lijst;
public ObservableCollection Lijst
{
  get
  {
    return _lijst;
  }
  set
  {
    _lijst = value;
    RaisePropertyChanged("Lijst");
  }
}
#endregion

public ICommand NextButton { get; set; }
public ICommand PreviousButton { get; set; }

public OldPersonsViewModel()
{
  #region Buttons
  NextButton = new RelayCommand(
      () => Next(),
      () => true
  );

  PreviousButton = new RelayCommand(
      () => Previous(),
      () => true
  );
  #endregion

  Context = new PeopleHackTheFutureEntities();
  count = Context.People.Count();
  var data = Context.People.OrderBy(p => p.Id).Skip(_i * WidthShow).Take(WidthShow);
  Lijst = new ObservableCollection(data);
}

public void Next()
{
  if (_i < Math.Ceiling(count / WidthShow))
  {
    _i++;
    var data = Context.People.OrderBy(p => p.Id).Skip(_i * WidthShow).Take(WidthShow);
    Lijst = new ObservableCollection<People>(data); 
  }
}
public void Previous()
{
  if (_i > 0)
  {
    _i--;
    var data = Context.People.OrderBy(p => p.Id).Skip(_i * WidthShow).Take(WidthShow);
    Lijst = new ObservableCollection<People>(data);
  }
}

Small remark, loading 25 records is quite fast. If you need to load more than 25 – 50 you should consider to do it async. How you do that you’ll see in the next part!

Next problem. Aka THE problem

Our assignment was to create a new database from an existing database. The setting of the hackathon was the end of world (fallout), in the EU there where only 1 000 000 civilians left. We had to search the perfect job for those people. Every people had a set of skills, this skills will determine with job was best suited for that person. The next part was that we automatically had to search a partner for that civilian. This was also based on a set of rules (max age difference 6 years and etc.)

Because we have to deal with very much data. We will have to save where our algorithm stopped so we can continue next time we startup our application. Actually if this wasn’t a hackathon we should have written a server application that deals with the algorithm (back end) and a application that is able to show our data (front end).

The backend would run forever until all people are “calculated”. It should be separate from our front – end. But because we had limited time we decided to put everything in our WPF application.

To solve the problem of saving where we are we used two tables. One for the original peoples (the one without job / without partner). An a new one where we save the “calculated” people. Every time we calculated x amount of people we write them to the new table, and delete them from the original table. Now we don’t have to save where our algorithm stopped!

The new table needed two extra fields. One to save his job and one for saving his boy- or girlfriend. To save the partner we used a self – reference foreign key.

First try

        public void SearchJob()
        {
            if (_asyncTask.Status != TaskStatus.Running)
            {
                _asyncTask.Start();
            }
        }

        async void SearchJobAsync()
        {
            //Calculate how many times we have to execute our for loop
            int max = Convert.ToInt32(Math.Ceiling(count/WidthCalc));
            for (int l = 0; l < max; l++)
            {
                _people = Context.People.Take(WidthCalc).ToList();
                _newPeoples = new List<NewPeople>();

                foreach (var p in _people)
                {
                    var add = new NewPeople();
                    add.Create(p);
                    _newPeoples.Add(add);
                }

                for(int j = 0; j < _newPeoples.Count; j++)
                {
                    var newP = _newPeoples.ElementAt(j);

                    foreach (var a in _arbeiden)
                    {
                        if (a.Check(newP))
                            break;
                    }

                    //Check if person has a partner
                    //If not find one
                    if (newP.Partner == null)
                    {
                        for (int k = j + 1; k < _newPeoples.Count; k++)
                        {
                            var m = _newPeoples.ElementAt(k);
                            if (CalculatePartner(m, newP))
                                break;
                        }
                    }
                }

                Context.People.RemoveRange(_people);
                Context.NewPeople.AddRange(_newPeoples);

                //Save to DB
                Context.SaveChanges();

                //Update DataGrid
                _i = 0;
                var data = Context.People.Take(WidthShow);
                Lijst = new ObservableCollection<People>(data);

            }
        }

First of all we have to run this asynchronous (other thread). Otherwise our front-end (DataGrid + buttons) wouldn’t react at all. This is of course not good, and should be avoided! We check if the task is already running, so you can’t start one while an other is already running.

Our WidthCalc is two thousand. So we process two thousand people at once than remove old people / add new people to DB and than start again.

First step convert our normal People to “NewPeople”, this class has our extra fields (job and partner).

And than we start our algorithm (code will be visible on GitHub). When it’s done we remove the people from the original table and and the new ones to our new table.

We save this to our DB and done. But boooommmmm, error 😮

Additional information: Unable to determine a valid ordering for dependent operations. Dependencies may exist due to foreign key constraints, model requirements, or store-generated values.

What is this?? The funny part is that our time @ the hackathon was running out…

With the help of the .NET developers of Involved. We found the problem…

EF will generate our insert queries. And the problem is, it does them object-by-object. So we added 2000 new object. This will result in 2000 INSERT queries.
They will look like:

INSERT INTO NewPeoples(FirstName, LastName, Id, DathBirth, Job, Partner)
VALUES ('Brecht','Carlier','01025425','08-07-1995','Programmer','07896541');

The problem is the “foreign key”, Partner. Assume Brecht Carlier was the first person in our list. And his girlfriend (07896541) is the last one of the list…

EF generates our query but will crash on the foreign key because his girlfriend isn’t in the db yet… It can not reference to the partner, because she does’t exist yet, so this explains our error!

Second try

So how are we going to solve this issue?

We’re going to convert our people to new people, calculate their job. Save this to the DB. And after that we will calculate the partners (UPDATE). So we will loop two times over our _newPeople list…

        async void SearchJobAsync()
        {
            int max = Convert.ToInt32(Math.Ceiling(count/WidthCalc));
            for (int l = 0; l < max; l++)
            {
                _people = Context.People.Take(WidthCalc).ToList();
                _newPeoples = new List<NewPeople>();

                foreach (var p in _people)
                {
                    var add = new NewPeople();
                    add.Create(p);
                    _newPeoples.Add(add);
                }

                foreach (var newP in _newPeoples)
                {
                    foreach (var a in _arbeiden)
                    {
                        if (a.Check(newP))
                            break;
                    }
                }

                Context.People.RemoveRange(_people);
                Context.NewPeople.AddRange(_newPeoples);
                Context.SaveChanges();

                for (int j = 0; j < _newPeoples.Count; j++)
                {
                    var newP = _newPeoples.ElementAt(j);

                    //Check if person has a partner
                    //If not find one
                    if (newP.Partner == null)
                    {
                        for (int k = j + 1; k < _newPeoples.Count; k++)
                        {
                            var m = _newPeoples.ElementAt(k);
                            if (CalculatePartner(m, newP))
                                break;
                        }
                    }
                }

                //Save to DB
                Context.SaveChanges();

                //Update DataGrid
                _i = 0;
                var data = Context.People.Take(WidthShow);
                Lijst = new ObservableCollection<People>(data);
            }
        }

So this fixed our error! This is what we have done on the hackathon. If you check in your Diagnostic Tools in Visual Studio you’ll see that he first deletes all people in the original table. Than add all the calculated people in the new table. And after that he updates the calculated people with their partner. You could also check this in a profiler. I used the lightweight Express Profiler! In total we fired 3 * 2000 queries!

Express Profiler, seperate insert and update

For processing the 2000 records, it took approx 60s. So for the hole db it should take 500 minutes (8,5h). If we check our DB with SQL Server Studio, we see that the results look good 🙂

Result in SQL Server Studio

There is still a big bug…
If you close the application between the two Context.SaveChanges(). You’ll have the people in the NewPeople table but their partner is not calculated as a result nobody will have love (NULL)… We will fix this bug in a sec, but first I want to make my queries better / faster…

Third try

At home I was searching for the bug I mentioned above. But while searching, it couldn’t resist to search for a solution where I only had to loop once through my _newPeople list!
I was searching for a method where I could insert all people at once (aka in one query). After a lot of searching I founded something called bulk insert. This is a mechanism in SQL to insert / multiple records at once.

But EF6 doesn’t support bulk inserts… But someone wrote an extension for EF, now we should be able to execute bulk inserts.
I used following extension: EntityFramework.BulkInsert

        async void SearchJobAsync()
        {
            //Calculate how many times we have to execute our for loop
            int max = Convert.ToInt32(Math.Ceiling(count/WidthCalc));
            for (int l = 0; l < max; l++)
            {
                _people = Context.People.Take(WidthCalc).ToList();
                _newPeoples = new List<NewPeople>();

                foreach (var p in _people)
                {
                    var add = new NewPeople();
                    add.Create(p);
                    _newPeoples.Add(add);
                }

                for(int j = 0; j < _newPeoples.Count; j++)
                {
                    var newP = _newPeoples.ElementAt(j);

                    foreach (var a in _arbeiden)
                    {
                        if (a.Check(newP))
                            break;
                    }

                    //Check if person has a partner
                    //If not find one
                    if (newP.Partner == null)
                    {
                        for (int k = j + 1; k < _newPeoples.Count; k++)
                        {
                            var m = _newPeoples.ElementAt(k);
                            if (CalculatePartner(m, newP))
                                break;
                        }
                    }
                }

                Context.People.RemoveRange(_people);
                //Context.NewPeople.AddRange(_newPeoples);
                Context.BulkInsert(_newPeoples);

                //Save to DB
                Context.SaveChanges();

                //Update DataGrid
                _i = 0;
                var data = Context.People.Take(WidthShow);
                Lijst = new ObservableCollection<People>(data);

            }

So for inserting the data we now use:

Context.BulkInsert(_newPeoples);

At the moment it takes 16 seconds for 2000 queries. That’s 73% percent faster (60s vs 16s). It now takes 2h30 for completing all 1 000 000 records!
Still this can be better, now the deleting takes very long because this doesn’t use “bulk” sql. It just deletes every record separately…

I used following extension: Entity Framework Extended

I replaced our original delete (Context.People.RemoveRange(_people)) with the following:

Context.People.Take(WidthCalc).Delete();

So this will only generate one query:

Delete query 'Big data"

The time to process 2000 records is now: 400ms!!!!
On a stress test, we filed 250 000 records in our new people table in a single minute. Two minutes later we had moved 433 000 records!

I’m really happy with this result!
Because it goes so fast, I decided to make our WidthCalc bigger, as a result less people will have no partner…

THE bug

So if you now close the application while the delete query is fired but the insert query was not done yet. You have data loss!
If you close the application it should wait until Context.SaveChanges() is done!

We fixed this in our code, have a look here: GitHub HackTheFuture 2015 .NET
The bug can be found in the bugtracker!

We had this bug in our demo @ HackTheFuture. It was stupid because of that all the partner id’s (FOREIGN key) where NULL 😦

Now because the processing of the data goes really fast, our application is actually stupid. By the time you click on the next button there are already processed 5000 new people… We should add a new View for viewing the NewPeople table and show this one first in our application! We also should have implemented 3 Layer architecture (Presentation Layer, Business Layer and Data Access Layer)…

At last, we should implement that at the end when every people is calculated. We search in our NewPeople table to persons that don’t have a partner (NULL) and recheck them with each other. Now we check in the 2000 people if there is a match, if not he will not have a partner.

Maybe I’ll implement it one day, but now busy for school…

THANKS

First I really wanted to thank the .NET guys @ HackTheFuture. It was a nice experience, without you I never would have learnt all of this! Also thanks for supporting us when we had questions!
Second I would like to thank Involved, the company of the .NET discipline/challenge!
Last but not least I want to thank Jan Rauters for inviting us to HackTheFuture and organizing it! Like I said it was a very nice experience as a student!

We didn’t win in the .NET discipline. But we became second for best Business and Technology perception.
We (Brecht Carlier & Arne Schoonvliet) are very proud we got that second place!

Because of this challenge I gained more experience in EF. Next time I have to do something like this, it will go a lot faster!
My next blog post will probably be about Entity Framework 7! It’s @ moment in RC state. But it has some really nice features. Like batch updates (doing more queries in one big query, doesn’t that sound familiar…)

Thanks for reading! Leave a comment bellow for questions / remarks.
I’m still a student with non experience, I just want to share my experiences.
But if there is something wrong or there is a better solution for the problem please let me know!

One thought on ““Big Data” Entity Framework 6

  1. I like your blog. Your explanations are clear…and simple. Its good to find well laid out information like this. Please keep writing about EF/MVVM/WPF.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s