In set operators we will discuss Distinct(), Union(), Intersect(), Except() and Concat().
Distinct: Removes the duplications and selects unique items. T-SQL counterpart is "Distinct".
Union: Returns the union of items from multiple sets. In other words adds one set’s elements to other removing duplicates. T-SQL counterpart is "Union".
Concat: Returns the union of two sets without dropping the duplicates. T-SQL counterpart is "Union All".
Intersect: Returns elements from two sets which exists in both sets. T-SQL counterpart "exists", "In (….)" (SQL2008 "intersect").
Except: Returns elements from a set which doesn’t exist in other set. T-SQL counterpart "not exists", "not In (….)" (SQL2008 "except").
is the same as:
If we have used:
The SQL would be:
Just as we expected. (The actual generated SQL lists the fields instead of using *)
Distinct accepts a parameter to define an "equality" interface (IEqualityComparer<TSource>). So you can for example accept the items that have say "Netherlands" and "Holland" as same while applying a distinct operation. This IEqualityComparer interface is not only used by Distinct but by a series of methods like Group, GroupJoin, Intersect, Except, Union. Before we go into sampling IEqualityComparer lets make a note that objects are compared using equality comparison and hashcode. Hashcode is a "signature" of object. Two different objects have different hashcode even if their properties are same. It is not hard to understand this. In VFP consider:
scatter name o1
scatter name o2
First two records might exactly have the same data in them and hence o1, o2 would match in their properties but still they are 2 different objects as shown in the sample code below:
Lets create some SQL server data to see all these visually:
Create the same data as badData2 – to use with Union and see the effect of hashcode. Now we will call Distinct() and Union() using this data. Before doing that lets check our data. It is a bad data design, but here did that on purpose. First two are plain duplicate entry. 3rd one is a duplication where data is in different columns (email1, email2 swapped, phone1,phone2,phone3 are scrambled). 4th, 5th and 6th have different names and only part of the data is entered but they match to others, which is likely to be another kind of duplication (or not a duplication but suppose we want to call offices and call each office only once, then first 6 are the same office). Last one is a match on name but neither his emails nor phones match others which we would think of as another person. Here is our code:
Here are the results:
It is what you expected for the first two results (distinct and union is done directly in T-SQL but next two looks surprising. If I didn’t use a new Person class then those would be same as the first two results. But we created a custom comparer to mean that:
"If any email or phone matches then it is a duplicate"
so why would we get too many results? Much more than the one without a custom comparison? Well, we defined a custom Equals() method but we still return the Hash Code of the original object. Remember to be accepted same not only should Equals() return true but hash codes must match as well. For this reason none of them is equal and union returns a result like as if we wanted to do a "union all". To correct it we would just change the GetHashCode() implementation and see what we would get:
and here are our new result sets (last two displayed):
Now that we have covered defining our own equality operator, Union and Distinct lets check other methods Concat,Intersect and Except.
This sample is a typical "UNION ALL" of USA customers set and UK customers set ( Select( c => new … ) projection is for sampling only, if we didn’t have Select() parts it would be all fields).
Here we use Any() which we will see later in "Quantifier" operators. What it does is to return true if any item in the set has a match to its filter expression ( T-SQL exists or any). This query is selecting products that are bought by Argentina customers. Here is the result set (partial):
In this query Second part is doing the same for Belgium customers creating a seconds set. Then using INTERSECT we are getting the Products which are common to both sets. Here is the result set (picture cut for space):
As you should notice there are 14 common items (Argentina customers bought 32 different products). Looking at the results we could say that products with Id 5, 7, 11, 13, 14, 16, 32,34 … are not bought by Belgium customers. Lets find out which products bought by Argentina customers were not on demand by Belgium customers (in other words: what are the products that are bought by Argentina customers EXCEPT those are also bought by Belgium customers).
Here is the result set:
Like in Distinct and Union, Intersect, Except methods have support for additional custom comparer.
PS: Because of years of working with SQL I happened to write insert statements for LinqTest database in T-SQL. Instead I could do the data insertion part using Linq. For example if you wanted add few more it would look like:
Or you could insert a single element like this:
See you on next post.