-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query: Improved performance for String.StartsWith() filters #7429
Comments
@jemiller0 what type is your updated_by column, VARCHAR or NVARCHAR? In case it's varchar then the piece CHARINDEX(N'jemiller', [b].[updated_by], may force an conversion of the column to nvarchar row by row bypassing the index, while LIKE alone may not force a conversion and allow the index to be used. |
The column is a NVARCHAR(40) NULL. I generated the database using EF Core and it's a string property. Hopefully, EF Core wouldn't include the N'' strings if it was a VARCHAR. I have the same database in PostgreSQL and MySQL. MySQL (Pomelo provider) does something similar and generates the following SQL.
The query takes around 1 to 2 seconds with MySQL. Commenting out the call to POSITION() appears to not have a noticeable effect. PostgreSQL generates the following SQL.
I'm assuming || is a concatenation operator in PostgreSQL. PostgreSQL doesn't address the issue. One thing I've been wondering, why isn't EF Core using db parameters? I'm pretty sure other ORMs like NHibernate do. I'm surprised that it doesn't. Wouldn't not using parameters slow things down because the database server would have to keep recompiling the SQL on the backend? In the case of SQL Server, I noticed that the query takes about 15 seconds the first time through. Then, it takes 1 second if the CHARINDEX() call is removed. Otherwise, it consistently takes 11 seconds with it. |
One other thing I noticed is that if I comment out the JOIN to the other table, then, the query takes around 0.7 second with the call to CHARINDEX() and ironically, it takes 1.7 seconds without it. I ran it multiple times and the results were the same. So, in the simple case, you may not see the issue. It probably only happens when you are joining in other tables. |
@jemiller0 if removing the join gives a difference of 0.7 and 1.7 with CHARINDEX respectively, then this means the carnality estimator gets confused, it doesn't think the index on [updated_by] is more selective than presumably (if it exists) the index on [bib_id]. I would suggest to help the query planner to work better by Updating and/or Creating a new set o statics on the columns that need to be considered for index usage. I would try the following, and if one step does not solve the problem moving to the next:
If neither of the above work, then the query is simply NOT friendly to the optimizer and the only way to fix this would be to rewrite the query using some alternatives, besides forcing an execution plan. |
Thanks. I will give that a try. I still think that EF Core shouldn't add that check unless the input contains a special character though. I think it would result in less problems out of the box. |
@jemiller0 @popcatalin81 I was also going to suggest trying UPDATE STATISTICs hoping that would help. If it doesn't help we can consider other options. Sniffing into the argument for Also, this is a scenario in which having actual support for LIKE in LINQ (#6159) would help because LIKE by definition is free to treat any wildcards as such. |
Another thing worth considering is caching. Currently we only depend on parameter nullability determine the resulting SQL (with a given query plan we will always generate the same SQL, regardless of the parameter value, unless the value is null). If we were to add the extra check only when needed, we would have 3 possible states: null, non-null-with-escaping, non-null-without-escaping. For all cases apart from scenarios around 'Like', states 2 and 3 would produce the same sql, so they are effectively unnecessary cache entries. |
@jemiller0 did |
I tried UPDATE STATISTICS and it didn't make a difference. I haven't had a chance to test items 2, 3, and 4 listed above. I've been meaning to do that, but, got tied up with unrelated projects. |
@maumar can you do a time boxed effort to try and reproduce this |
I'm going to try the rest of the UPDATE STATISTICS related commands shortly. I tried testing the query using the ESCAPE clause just for the heck of it. I found that performance when doing it that way is by far the worst. I ran it twice and both times it took 1 minute 17 seconds. Seems like it must be causing it to do a table scan or something.
|
One other thing I was thinking of. Couldn't you just apply the StartsWith() filter in C# as the rows are being streamed into RAM (in addition to just doing a LIKE by itself)? I know it wouldn't be ideal. Just a thought. |
@jemiller0, reading your comment above, can you please elaborate on "PostgreSQL doesn't address the issue" (I'm the Npgsql maintainer)? I'm asking because I'm currently working on fixing #474 for PostgreSQL as well. |
Sorry I'm guessing you mean there's no double-check (i.e. LIKE and a CHARINDEX-equivalent) |
Yeah, I think it doesn't use CHARINDEX() if I remember correctly. |
I'll be introducing that soon. |
Here's what I'm seeing. I attached screenshots for what the execution plan looks like for LIKE by itself, LIKE with ESCAPE, and LIKE with CHARINDEX() for SQL Server. LIKE by itself usually takes around 2 seconds, although it might be something like 10 seconds the first time through. LIKE with CHARINDEX() is usually around 11 seconds. LIKE with ESCAPE is the worst and usually takes around 1 minute 15 seconds or greater. |
Remove the ORDER BY... |
An ORDER BY of some kind is needed since I'm using paging and it's using OFFSET FETCH. I did notice that if I change the column that is being ordered by to the same column that is being filtered on, that the query returns immediately for all 3 types of filtering. It makes sense that it would have trouble filtering on one column and sorting on another, but, somehow it works fine when you only use LIKE. When I ran the queries in the images above, I forgot and ran the queries that the current release version of EF Core generate. They have an extra column in the ORDER BY. That bug was fixed separately. Removing the extra column ([b].[bib_id]) doesn't appear to have much of an effect. |
@roji mentioned using LEFT(LEN()) (similar to what is done for .EndsWith()) instead of CHARINDEX() in #474. I tested this as in
and found that it was faster than even using LIKE by itself. For my problem query, it was the fasted performance yet. Consistently less than 1 second with the LEFT() check while it was around 2 seconds with only LIKE. Also, I tested it without LIKE at all and that took around 2 seconds. I'm not sure if it would need an extra OR where it checks for = '' (empty string). I noticed that when you do a .EndsWith() it does that at least sometimes. Would it be possible to switch to using LEFT(LEN()) instead of CHARINDEX() for the SQL Server provider? This seems to cure the problem that I've been seeing. |
@jemiller0 thanks for this new datapoint. Would you be able to provide the corresponding query plan screenshot? |
In all 5 execution plans the primary index used is on the [date_updated] column, In none of the execution plans is the index on [updated_by] column used (if any). If the index [updated_by] doesn't exist or is not used, then any assumptions the performance characteristics about different ways to translate StartsWith() could be false positives. StartsWith() type of search in SQL Server, LIKE 'term%' is by definition a type of search that can take Advantage of indexes , and it's normally the fastest type (can even use index Seeks). There needs to be a baseline for comparisons where the index is actually used and an optimal execution plan is used, and after that test for various other ways to write the query. @jemiller0 does an index on the [updated_by] exist? |
@popcatalin81 Yes it has an index. Here is a screen shot of the table definitions. Here's some info on the index fragmentation in case it helps. They are all less than a percent. The tables have close to 8 million rows in them and the main table has an XML field in it. |
FWIW, we have seen in our own testing that query plan selection is very sensitive to a variety of factors. You can even say it can be a bit hard to setup a scenario in which the index is consistently used. In general, the performance characteristics of all these variations could be different for different kinds of queries, different SQL Server versions and different data sets, but this is still valid empirical data. Also it is interesting that of the previous screenshots, these two showed the index being used:
The query plan for the query that has LIKE alone does not appear to use the index. BTW, screenshot for LIKE AND LEFT() shows a slightly different ORDER BY key. @jemiller0 was that intentional? |
@divega That was by mistake. I copied the queries from the log file I have from EF Core. I'm currently using the current release version. The nightly builds fix #6861. The current release version includes extra sort columns that it shouldn't. I think it needs them for one-to-manys, but, doesn't for many-to-ones. I re-ran the queries without the extra sort column after the fact and the results appeared to be the same. I can post new screen captures if you want them. |
@jemiller0 thanks. No need for new screenshots. Just wanted to be sure the results were the same. |
So, is there any disadvantage to using LEFT(LEN()) instead of CHARINDEX()? This is somewhat of a blocker for me. I want to switch from EF 6 to EF Core, but, I can't do it until I'm able to resolve performance issues such as this. Also, is there any word on when interception will be implemented? If interception were implemented, I could do a search and replace within the SQL text to work around the problem. Also, it would be great if fixes could be published a little quicker. It's been awhile since the last version was published and there are other blockers such as the problem where it was adding extra sort columns that I need. I'm assuming that one would be a problem for anyone working with large tables. |
@jemiller0 we are in the process of adding support for explicit Like function which should hopefully address your problem, PR is being reviewed at the moment: #7611 |
Thanks, but, from my perspective, I would rather not use Like(). I would rather have a more efficient StartsWith(). I'm using Telerik's RadGrid component which generates Dynamic LINQ which uses StartsWith(). I could modify the string that they are creating, but, would prefer not to. Is there something more efficient about using CHARINDEX() than using LEFT(LEN())? It seems like CHARINDEX() would be less efficient as it would have to scan the whole string. LEFT(LEN()) is the same approach that EndsWith() is using, except it's using LEFT() instead of RIGHT(). |
Agree with @jemiller0 that optimizing |
We are also planning to switch our translation to LEFT(LEN()), it doesn't seem to have any disadvantages, as far as we know. We will consult some SQL experts just to be sure. The change will come in 2.0.0 |
That sounds great. Thanks for everyone's work on this. Is there going to be a 1.2 release before 2.0? |
We don't have the plans for it currently, but if we end up having 1.2 before 2.0, then this fix will very likely be a part of that release. |
That sounds great. Thanks again. |
Fix is to use more permanent translation: property.StartsWith(pattern) => property LIKE pattern + '%' AND LEFT(property, LEN(pattern)) = pattern Also fixed minor issue with null semantics where we would apply C# null semantics for comparisons to null-valued parameters
Fix is to use more permanent translation: property.StartsWith(pattern) => property LIKE pattern + '%' AND LEFT(property, LEN(pattern)) = pattern Also fixed minor issue with null semantics where we would apply C# null semantics for comparisons to null-valued parameters
Fix is to use more permanent translation: property.StartsWith(pattern) => property LIKE pattern + '%' AND LEFT(property, LEN(pattern)) = pattern Also fixed minor issue with null semantics where we would apply C# null semantics for comparisons to null-valued parameters
Fixed in d8567c0 |
I still find it very odd that the It isn't obvious at all that LEFT doesn't work with indexing in SQL Server so until you have to end up researching why it's hard to guage the reason why things are so slow. Fortunately |
Describe what is not working as expected.
I have a LINQ query that is using a .StartsWith() search on a string column. I found that it is being translated into the following SQL WHERE clause.
WHERE [b].[updated_by] LIKE N'jemiller%' + N'%' AND (CHARINDEX(N'jemiller%', [b].[updated_by]) = 1)
The full SQL statement is.
The problem that I'm having is that the query is taking 11 seconds to complete. I'm using SQL Server 2016. The primary table that is being queried has 7.7 million rows and the column that is being filtered on has an index. If I comment out the "AND (CHARINDEX(N'jemiller', [b].[updated_by]) = 1)" part of the WHERE clause so that it's only doing a LIKE, then, the query only takes 1 second. Although, it looked like it took longer the first time through. I think the extra call to CHARINDEX() may be throwing the query optimizer in SQL Server off. If I view the execution plan in SQL Management Studio that it has 9 different steps when CHARINDEX() is included an only 7 without it. I can send the execution plan XML if that would help.
Issue #474 talks about why the CHARINDEX() check is included. Would it be possible to only include that check if the input string includes a LIKE special character such as %, _, [, or ]?
Or maybe have it use the LIKE ESCAPE clause? It would be nice if it only did it if the special characters were present, but, someone mentioned something about it causing a problem caching. CHARINDEX() looks like a good way to handle it, but, it looks like it could cause performance problems, at least for some queries. In general, it seems like it would be good to keep the expression as simple as possible and only make it more complicated when necessary.
Failing that, would it be possible to make this behavior configurable? This is pretty much a blocker for my app because performance is just too slow.
Further technical details
EF Core version: 1.2.0-preview1-23129
Database Provider: Microsoft.EntityFrameworkCore.SqlServer
Operating system: Windows 10 Anniversary Update
IDE: Visual Studio 2015
The text was updated successfully, but these errors were encountered: