>LINQ Query Operators – Part VI

>

Set operators

In set operators we will discuss Distinct(), Union(), Intersect(), Except() and Concat().

Distinct: Removes the duplications and selects unique items. T-SQL counterpart is "Distinct".

Union: Returns the union of items from multiple sets. In other words adds one set’s elements to other removing duplicates. T-SQL counterpart is "Union".

Concat: Returns the union of two sets without dropping the duplicates. T-SQL counterpart is "Union All".

Intersect: Returns elements from two sets which exists in both sets. T-SQL counterpart "exists", "In (….)" (SQL2008 "intersect").

Except: Returns elements from a set which doesn’t exist in other set. T-SQL counterpart "not exists", "not In (….)" (SQL2008 "except").

Customers.Select(c=>c.Country).Distinct()

is the same as:  

Select DISTINCT Country FROM Customers 

If we have used:

Customers.Distinct()

The SQL would be:

Select * from Customers

Just as we expected. (The actual generated SQL lists the fields instead of using *)

Distinct accepts a parameter to define an "equality" interface (IEqualityComparer<TSource>). So you can for example accept the items that have say "Netherlands" and "Holland" as same while applying a distinct operation. This IEqualityComparer interface is not only used by Distinct but by a series of methods like Group, GroupJoin, Intersect, Except, Union. Before we go into sampling IEqualityComparer lets make a note that objects are compared using equality comparison and hashcode. Hashcode is a "signature" of object. Two different objects have different hashcode even if their properties are same. It is not hard to understand this. In VFP consider:

use myTable
scatter name o1
skip
scatter name o2

First two records might exactly have the same data in them and hence o1, o2 would match in their properties but still they are 2 different objects as shown in the sample code below:

Create Cursor sampledata ( firstName c(10), lastname c(10))
Insert Into sampledata Values ("Joe","Doe")
Insert Into sampledata Values ("Joe","Doe")
Insert Into sampledata Values ("Frank","Doe")

Locate
Scatter Name o1
Skip
Scatter Name o2

Clear
? o1.firstName,o1.lastname
? o2.firstName,o2.lastname

? "FirstNames are equal",o1.firstName == o2.firstName
? "LastNames are equal", o1.lastname == o2.lastname
? "Objects have identical properties", Compobj(o1,o2)
? "Objects are same", o1 = o2

Lets create some SQL server data to see all these visually:

create database LinqTest
go

use LinqTest
go

create table badData (
  personId int identity primary key,
  firstName varchar(15),  lastName varchar(20),
  email1 varchar(50),  email2 varchar(50),
  phone1 varchar(15),  phone2 varchar(15),  phone3 varchar(15))

insert into badData 
  (firstName,lastName, email1,email2, phone1,phone2,phone3)
  values
  (‘John’, ‘Doe’, ‘jdoe@myhost1.com’,’jdoe@myhost2.com’,’111-111 11 11′, ‘111-111 11 12’, ‘111-111 11 13’)

insert into badData 
  (firstName,lastName, email1,email2,phone1,phone2,phone3)
  values
  (‘John’, ‘Doe’, ‘jdoe@myhost1.com’,’jdoe@myhost2.com’,’111-111 11 11′, ‘111-111 11 12’, ‘111-111 11 13’)

insert into badData 
  (firstName,lastName, email1,email2,phone1,phone2,phone3)
  values
  (‘John’, ‘Doe’, ‘jdoe@myhost2.com’,’jdoe@myhost1.com’,’111-111 11 13′, ‘111-111 11 11’, ‘111-111 11 12’)

insert into badData 
  (firstName,lastName, email1,email2,phone1,phone2,phone3)
  values
  (‘Joe’, ‘Doe’,’jdoe@myhost2.com’,”,  ”, ”, ”)
 
insert into badData 
  (firstName,lastName, email1,email2,phone1,phone2,phone3)
  values
  (‘Joseph’, ‘Doe’, ”,”, ‘111-111 11 13’, ”, ”)

insert into badData 
  (firstName,lastName, email1,email2,phone1,phone2,phone3)
  values
  (‘Frank’, ‘Doe’, ‘fdoe@myhost1.com’,’fdoe@myhost2.com’,’111-111 11 11′, ”, ”)

insert into badData 
  (firstName,lastName, email1,email2,phone1,phone2,phone3)
  values
  (‘John’, ‘Doe’,’jdoe@myhost3.com’,’jdoe@myhost4.com’, ‘222-111 11 11’, ‘222-111 11 12’, ‘222-111 11 13’)
go

Select * from badData

Create the same data as badData2 – to use with Union and see the effect of hashcode. Now we will call Distinct() and Union() using this data. Before doing that lets check our data. It is a bad data design, but here did that on purpose. First two are plain duplicate entry. 3rd one is a duplication where data is in different columns (email1, email2 swapped, phone1,phone2,phone3 are scrambled). 4th, 5th and 6th have different names and only part of the data is entered but they match to others, which is likely to be another kind of duplication (or not a duplication but suppose we want to call offices and call each office only once, then first 6 are the same office). Last one is a match on name but neither his emails nor phones match others which we would think of as another person. Here is our code:

void Main()
{
   var b1 =
      BadData.Select( b =>
       new Person {
         FirstName = b.FirstName, LastName = b.LastName,
         Email1 = b.Email1, Email2 = b.Email2,
         Phone1 = b.Phone1, Phone2 = b.Phone2, Phone3 = b.Phone3
       });

   var b2 =
      BadData2.Select( b =>
       new Person {
         FirstName = b.FirstName, LastName = b.LastName,
         Email1 = b.Email1, Email2 = b.Email2,
         Phone1 = b.Phone1, Phone2 = b.Phone2, Phone3 = b.Phone3
       });

   b1.Union( b2 ).Dump("b1 union b2 – Union in SQL");
   b1.Distinct().Dump("b1 distinct – Distinct in SQL");

   PersonEqualityComparer cmp = new PersonEqualityComparer();
   b1.AsEnumerable().Union( b2.AsEnumerable(), cmp ).Dump("b1 union b2 with comparer");
   b1.AsEnumerable().Distinct( cmp ).Dump("b1 distinct with comparer");
}

// Define other methods and classes here
public class Person
{
  public string FirstName {get;set;}
  public string LastName {get;set;}
  public string Email1 {get;set;}
  public string Email2 {get;set;}
  public string Phone1 {get;set;}
  public string Phone2 {get;set;}
  public string Phone3 {get;set;}
}

public class PersonEqualityComparer : IEqualityComparer<Person>
{
   public bool Equals(Person x, Person y) {
      return IsMatch(
         new string[]{x.Email1,x.Email2},
         new string[]{y.Email1,y.Email2}
         ) || IsMatch(
         new string[]{x.Phone1,x.Phone2,x.Phone3},
         new string[]{y.Phone1,y.Phone2,y.Phone3});
   }

   public int GetHashCode(Person obj) {
      return obj.GetHashCode();
   }

   private bool IsMatch(string[] words1,string[] words2) {
      return (
         words1.Where(w => !String.IsNullOrEmpty( w ))
         .Intersect(
         words2.Where(w => !String.IsNullOrEmpty( w ))
         ).Count() > 0);
   }
}

Here are the results:

UnionAndDistinct

It is what you expected for the first two results (distinct and union is done directly in T-SQL but next two looks surprising. If I didn’t use a new Person class then those would be same as the first two results. But we created a custom comparer to mean that:

"If any email or phone matches then it is a duplicate"

so why would we get too many results? Much more than the one without a custom comparison? Well, we defined a custom Equals() method but we still return the Hash Code of the original object. Remember to be accepted same not only should Equals() return true but hash codes must match as well. For this reason none of them is equal and union returns a result like as if we wanted to do a "union all". To correct it we would just change the GetHashCode() implementation and see what we would get: 

   public int GetHashCode(Person obj) {
      return "".GetHashCode();
   }

and here are our new result sets (last two displayed):

UnionAndDistinct2

Now that we have covered defining our own equality operator, Union and Distinct lets check other methods Concat,Intersect and Except.

Customers
  .Where( c => c.Country == "USA" )
  .Select( c => new {c.CustomerID,c.CompanyName,c.Country} )
  .Concat( 
Customers
  .Where( c => c.Country == "UK" )
  .Select( c => new {c.CustomerID,c.CompanyName,c.Country} )
)

This sample is a typical "UNION ALL" of USA customers set and UK customers set ( Select( c => new … ) projection is for sampling only, if we didn’t have Select() parts it would be all fields).

Products.Where( p => p.OrderDetails.Any( od => od.Order.Customer.Country == "Argentina" ))

Here we use Any() which we will see later in "Quantifier" operators. What it does is to return true if any item in the set has a match to its filter expression ( T-SQL exists or any). This query is selecting products that are bought by Argentina customers. Here is the result set (partial):

ArgentinaPartial

Products.Where( p => p.OrderDetails.Any( od => od.Order.Customer.Country == "Argentina" ))
.Intersect(
Products.Where( p => p.OrderDetails.Any( od => od.Order.Customer.Country == "Belgium" ))
)

In this query Second part is doing the same for Belgium customers creating a seconds set. Then using INTERSECT we are getting the Products which are common to both sets. Here is the result set (picture cut for space):

intersect

As you should notice there are 14 common items (Argentina customers bought 32 different products). Looking at the results we could say that products with Id 5, 7, 11, 13, 14, 16, 32,34 … are not bought by Belgium customers. Lets find out which products bought by Argentina customers were not on demand by Belgium customers (in other words: what are the products that are bought by Argentina customers EXCEPT those are also bought by Belgium customers).

Products.Where( p => p.OrderDetails
  .Any( od => od.Order.Customer.Country == "Argentina" ))
.Except(
Products.Where( p => p.OrderDetails
  .Any( od => od.Order.Customer.Country == "Belgium" ))
)

Here is the result set:

except 

Like in Distinct and Union, Intersect, Except methods have support for additional custom comparer.

PS: Because of years of working with SQL I happened to write insert statements for LinqTest database in T-SQL. Instead I could do the data insertion part using Linq. For example if you wanted add few more it would look like:

void Main()
{
   List<BadData> persons = 
     new List<BadData>
     {
       new BadData {FirstName="Frank",LastName="Smith",
           Email1="m1",Email2="m2",
           Phone1="p1",Phone2="p2",Phone3="p3"},
       new BadData {FirstName="Cetin",LastName="Basoz"},
       new BadData {FirstName="Fred",LastName="Flintstone",
           Email1="fred@yellowstone.com",Phone1="555-111 11 11"}
     };

   BadData.InsertAllOnSubmit(persons);
   BadData.Context.SubmitChanges();

   // Check results
   BadData.Dump();
}

Or you could insert a single element like this:

BadData b = new BadData();
b.FirstName = "Jack";
b.LastName = "Smith";
b.Phone2 = "666-123 45 67";

BadData.InsertOnSubmit( b );
// more code if any
BadData.Context.SubmitChanges();

 

See you on next post.

About cetinbasoz

A developer working primarily with C#,F#,VFP,SQL server, Silverlight,Linq ...
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a comment