Query: Improved performance for String.StartsWith() filters #7429

jemiller0 · 2017-01-17T21:08:55Z

Describe what is not working as expected.

I have a LINQ query that is using a .StartsWith() search on a string column. I found that it is being translated into the following SQL WHERE clause.

WHERE [b].[updated_by] LIKE N'jemiller%' + N'%' AND (CHARINDEX(N'jemiller%', [b].[updated_by]) = 1)

The full SQL statement is.

SELECT [b].[bib_id], [b].[content], [b].[date_created], [b].[created_by], [b].[fast_add], [b].[former_id], [b].[unique_id_prefix], [b].[date_updated], [b].[updated_by], [b].[staff_only], [b].[status], [b].[status_updated_date], [b].[status_updated_by], [b.BibExt].[id], [b.BibExt].[author], [b.BibExt].[creation_time], [b.BibExt].[creation_user_name], [b.BibExt].[last_write_time], [b.BibExt].[last_write_user_name], [b.BibExt].[level], [b.BibExt].[oclc_number], [b.BibExt].[publication_date], [b.BibExt].[subscription_status], [b.BibExt].[title], [b.BibExt].[type]
FROM [ole_ds_bib_t] AS [b]
LEFT JOIN [uc_bib_ext] AS [b.BibExt] ON [b].[bib_id] = [b.BibExt].[id]
WHERE [b].[updated_by] LIKE N'jemiller' + N'%' AND (CHARINDEX(N'jemiller', [b].[updated_by]) = 1)
ORDER BY [b].[date_updated] DESC
OFFSET 0 ROWS FETCH NEXT 100 ROWS ONLY

The problem that I'm having is that the query is taking 11 seconds to complete. I'm using SQL Server 2016. The primary table that is being queried has 7.7 million rows and the column that is being filtered on has an index. If I comment out the "AND (CHARINDEX(N'jemiller', [b].[updated_by]) = 1)" part of the WHERE clause so that it's only doing a LIKE, then, the query only takes 1 second. Although, it looked like it took longer the first time through. I think the extra call to CHARINDEX() may be throwing the query optimizer in SQL Server off. If I view the execution plan in SQL Management Studio that it has 9 different steps when CHARINDEX() is included an only 7 without it. I can send the execution plan XML if that would help.

Issue #474 talks about why the CHARINDEX() check is included. Would it be possible to only include that check if the input string includes a LIKE special character such as %, _, [, or ]?

Or maybe have it use the LIKE ESCAPE clause? It would be nice if it only did it if the special characters were present, but, someone mentioned something about it causing a problem caching. CHARINDEX() looks like a good way to handle it, but, it looks like it could cause performance problems, at least for some queries. In general, it seems like it would be good to keep the expression as simple as possible and only make it more complicated when necessary.

Failing that, would it be possible to make this behavior configurable? This is pretty much a blocker for my app because performance is just too slow.

Further technical details

EF Core version: 1.2.0-preview1-23129
Database Provider: Microsoft.EntityFrameworkCore.SqlServer
Operating system: Windows 10 Anniversary Update
IDE: Visual Studio 2015

The text was updated successfully, but these errors were encountered:

popcatalin81 · 2017-01-18T15:21:40Z

@jemiller0 what type is your updated_by column, VARCHAR or NVARCHAR?

In case it's varchar then the piece CHARINDEX(N'jemiller', [b].[updated_by], may force an conversion of the column to nvarchar row by row bypassing the index, while LIKE alone may not force a conversion and allow the index to be used.

jemiller0 · 2017-01-18T19:24:48Z

The column is a NVARCHAR(40) NULL. I generated the database using EF Core and it's a string property. Hopefully, EF Core wouldn't include the N'' strings if it was a VARCHAR.

I have the same database in PostgreSQL and MySQL. MySQL (Pomelo provider) does something similar and generates the following SQL.

WHERE b.updated_byLIKE CONCAT('jemiller','%') AND (POSITION('jemiller' INb.updated_by) = 1)

The query takes around 1 to 2 seconds with MySQL. Commenting out the call to POSITION() appears to not have a noticeable effect.

PostgreSQL generates the following SQL.

WHERE "b"."updated_by" LIKE ('jemiller' || '%')

I'm assuming || is a concatenation operator in PostgreSQL. PostgreSQL doesn't address the issue.

One thing I've been wondering, why isn't EF Core using db parameters? I'm pretty sure other ORMs like NHibernate do. I'm surprised that it doesn't. Wouldn't not using parameters slow things down because the database server would have to keep recompiling the SQL on the backend?

In the case of SQL Server, I noticed that the query takes about 15 seconds the first time through. Then, it takes 1 second if the CHARINDEX() call is removed. Otherwise, it consistently takes 11 seconds with it.

jemiller0 · 2017-01-18T19:37:41Z

One other thing I noticed is that if I comment out the JOIN to the other table, then, the query takes around 0.7 second with the call to CHARINDEX() and ironically, it takes 1.7 seconds without it. I ran it multiple times and the results were the same.

So, in the simple case, you may not see the issue. It probably only happens when you are joining in other tables.

popcatalin81 · 2017-01-18T21:56:43Z

@jemiller0 if removing the join gives a difference of 0.7 and 1.7 with CHARINDEX respectively, then this means the carnality estimator gets confused, it doesn't think the index on [updated_by] is more selective than presumably (if it exists) the index on [bib_id]. I would suggest to help the query planner to work better by Updating and/or Creating a new set o statics on the columns that need to be considered for index usage.

I would try the following, and if one step does not solve the problem moving to the next:

Update statistics UPDATE STATISTICS
Ensure Indexes are not fragmented (Rebuild or Defragment) Link
Create new statistics on [updated_by] with FULLSCAN if possible, or as as big sample as possible
Create new statistics on [bib_id] with FULLSCAN if possible, or as as big sample as possible

If neither of the above work, then the query is simply NOT friendly to the optimizer and the only way to fix this would be to rewrite the query using some alternatives, besides forcing an execution plan.

jemiller0 · 2017-01-18T22:09:32Z

Thanks. I will give that a try.

I still think that EF Core shouldn't add that check unless the input contains a special character though. I think it would result in less problems out of the box.

divega · 2017-01-18T22:53:26Z

@jemiller0 @popcatalin81 I was also going to suggest trying UPDATE STATISTICs hoping that would help. If it doesn't help we can consider other options.

Sniffing into the argument for StartsWith() to check if it contains wildcards is a possibility, but it is potentially provider specific and would only work if the argument is coming from the client.

Also, this is a scenario in which having actual support for LIKE in LINQ (#6159) would help because LIKE by definition is free to treat any wildcards as such.

maumar · 2017-01-18T23:23:09Z

Another thing worth considering is caching. Currently we only depend on parameter nullability determine the resulting SQL (with a given query plan we will always generate the same SQL, regardless of the parameter value, unless the value is null). If we were to add the extra check only when needed, we would have 3 possible states: null, non-null-with-escaping, non-null-without-escaping. For all cases apart from scenarios around 'Like', states 2 and 3 would produce the same sql, so they are effectively unnecessary cache entries.

rowanmiller · 2017-01-31T19:05:22Z

@jemiller0 did UPDATE STATISTICS make a difference?

jemiller0 · 2017-01-31T19:20:21Z

I tried UPDATE STATISTICS and it didn't make a difference. I haven't had a chance to test items 2, 3, and 4 listed above. I've been meaning to do that, but, got tied up with unrelated projects.

rowanmiller · 2017-02-01T19:37:58Z

@maumar can you do a time boxed effort to try and reproduce this

jemiller0 · 2017-02-06T22:00:37Z

I'm going to try the rest of the UPDATE STATISTICS related commands shortly.

I tried testing the query using the ESCAPE clause just for the heck of it. I found that performance when doing it that way is by far the worst. I ran it twice and both times it took 1 minute 17 seconds. Seems like it must be causing it to do a table scan or something.

WHERE [b].[updated_by] LIKE N'jemiller' + N'%' ESCAPE '!'

jemiller0 · 2017-02-06T22:03:31Z

One other thing I was thinking of. Couldn't you just apply the StartsWith() filter in C# as the rows are being streamed into RAM (in addition to just doing a LIKE by itself)? I know it wouldn't be ideal. Just a thought.

roji · 2017-02-06T22:31:10Z

@jemiller0, reading your comment above, can you please elaborate on "PostgreSQL doesn't address the issue" (I'm the Npgsql maintainer)? I'm asking because I'm currently working on fixing #474 for PostgreSQL as well.

roji · 2017-02-06T22:33:08Z

Sorry I'm guessing you mean there's no double-check (i.e. LIKE and a CHARINDEX-equivalent)

jemiller0 · 2017-02-06T22:35:00Z

Yeah, I think it doesn't use CHARINDEX() if I remember correctly.

roji · 2017-02-06T23:26:10Z

I'll be introducing that soon.

jemiller0 · 2017-02-07T20:29:31Z

Here's what I'm seeing. I attached screenshots for what the execution plan looks like for LIKE by itself, LIKE with ESCAPE, and LIKE with CHARINDEX() for SQL Server. LIKE by itself usually takes around 2 seconds, although it might be something like 10 seconds the first time through. LIKE with CHARINDEX() is usually around 11 seconds. LIKE with ESCAPE is the worst and usually takes around 1 minute 15 seconds or greater.

ErikEJ · 2017-02-07T20:37:57Z

Remove the ORDER BY...

jemiller0 · 2017-02-07T23:05:22Z

An ORDER BY of some kind is needed since I'm using paging and it's using OFFSET FETCH. I did notice that if I change the column that is being ordered by to the same column that is being filtered on, that the query returns immediately for all 3 types of filtering. It makes sense that it would have trouble filtering on one column and sorting on another, but, somehow it works fine when you only use LIKE. When I ran the queries in the images above, I forgot and ran the queries that the current release version of EF Core generate. They have an extra column in the ORDER BY. That bug was fixed separately. Removing the extra column ([b].[bib_id]) doesn't appear to have much of an effect.

jemiller0 · 2017-02-08T17:20:39Z

@roji mentioned using LEFT(LEN()) (similar to what is done for .EndsWith()) instead of CHARINDEX() in #474. I tested this as in

WHERE [b].[updated_by] LIKE N'jemiller' + N'%' AND N'jemiller' = LEFT([b].[updated_by], LEN(N'jemiller'))

and found that it was faster than even using LIKE by itself. For my problem query, it was the fasted performance yet. Consistently less than 1 second with the LEFT() check while it was around 2 seconds with only LIKE. Also, I tested it without LIKE at all and that took around 2 seconds.

I'm not sure if it would need an extra OR where it checks for = '' (empty string). I noticed that when you do a .EndsWith() it does that at least sometimes.

Would it be possible to switch to using LEFT(LEN()) instead of CHARINDEX() for the SQL Server provider? This seems to cure the problem that I've been seeing.

divega · 2017-02-08T17:49:11Z

@jemiller0 thanks for this new datapoint. Would you be able to provide the corresponding query plan screenshot?

jemiller0 · 2017-02-08T18:47:06Z

Sure.

popcatalin81 · 2017-02-08T19:22:59Z

In all 5 execution plans the primary index used is on the [date_updated] column, In none of the execution plans is the index on [updated_by] column used (if any). If the index [updated_by] doesn't exist or is not used, then any assumptions the performance characteristics about different ways to translate StartsWith() could be false positives.

StartsWith() type of search in SQL Server, LIKE 'term%' is by definition a type of search that can take Advantage of indexes , and it's normally the fastest type (can even use index Seeks). There needs to be a baseline for comparisons where the index is actually used and an optimal execution plan is used, and after that test for various other ways to write the query.

@jemiller0 does an index on the [updated_by] exist?

jemiller0 · 2017-02-08T19:39:56Z

@popcatalin81 Yes it has an index. Here is a screen shot of the table definitions.

Here's some info on the index fragmentation in case it helps. They are all less than a percent.

The tables have close to 8 million rows in them and the main table has an XML field in it.

divega · 2017-02-08T21:31:46Z

FWIW, we have seen in our own testing that query plan selection is very sensitive to a variety of factors. You can even say it can be a bit hard to setup a scenario in which the index is consistently used.

In general, the performance characteristics of all these variations could be different for different kinds of queries, different SQL Server versions and different data sets, but this is still valid empirical data.

Also it is interesting that of the previous screenshots, these two showed the index being used:

LIKE AND CHARINDEX()
LIKE + ESCAPE

The query plan for the query that has LIKE alone does not appear to use the index.

BTW, screenshot for LIKE AND LEFT() shows a slightly different ORDER BY key. @jemiller0 was that intentional?

jemiller0 · 2017-02-08T21:46:30Z

@divega That was by mistake. I copied the queries from the log file I have from EF Core. I'm currently using the current release version. The nightly builds fix #6861. The current release version includes extra sort columns that it shouldn't. I think it needs them for one-to-manys, but, doesn't for many-to-ones. I re-ran the queries without the extra sort column after the fact and the results appeared to be the same. I can post new screen captures if you want them.

divega · 2017-02-08T21:53:00Z

@jemiller0 thanks. No need for new screenshots. Just wanted to be sure the results were the same.

jemiller0 · 2017-02-08T22:03:28Z

Here they are. I didn't check to see if the execution plans were exactly the same. I just saw what the timings were.

jemiller0 · 2017-02-24T19:06:19Z

So, is there any disadvantage to using LEFT(LEN()) instead of CHARINDEX()? This is somewhat of a blocker for me. I want to switch from EF 6 to EF Core, but, I can't do it until I'm able to resolve performance issues such as this. Also, is there any word on when interception will be implemented? If interception were implemented, I could do a search and replace within the SQL text to work around the problem. Also, it would be great if fixes could be published a little quicker. It's been awhile since the last version was published and there are other blockers such as the problem where it was adding extra sort columns that I need. I'm assuming that one would be a problem for anyone working with large tables.

maumar · 2017-02-27T20:26:14Z

@jemiller0 we are in the process of adding support for explicit Like function which should hopefully address your problem, PR is being reviewed at the moment: #7611

jemiller0 · 2017-02-27T20:35:29Z

Thanks, but, from my perspective, I would rather not use Like(). I would rather have a more efficient StartsWith(). I'm using Telerik's RadGrid component which generates Dynamic LINQ which uses StartsWith(). I could modify the string that they are creating, but, would prefer not to. Is there something more efficient about using CHARINDEX() than using LEFT(LEN())? It seems like CHARINDEX() would be less efficient as it would have to scan the whole string. LEFT(LEN()) is the same approach that EndsWith() is using, except it's using LEFT() instead of RIGHT().

roji · 2017-02-27T20:40:07Z

Agree with @jemiller0 that optimizing Startswith() is orthogonal to providing direct LIKE support via #7611...

maumar · 2017-02-27T20:40:16Z

We are also planning to switch our translation to LEFT(LEN()), it doesn't seem to have any disadvantages, as far as we know. We will consult some SQL experts just to be sure. The change will come in 2.0.0

jemiller0 · 2017-02-27T20:43:26Z

That sounds great. Thanks for everyone's work on this. Is there going to be a 1.2 release before 2.0?

maumar · 2017-02-27T20:49:35Z

We don't have the plans for it currently, but if we end up having 1.2 before 2.0, then this fix will very likely be a part of that release.

jemiller0 · 2017-02-27T20:50:56Z

That sounds great. Thanks again.

Fix is to use more permanent translation: property.StartsWith(pattern) => property LIKE pattern + '%' AND LEFT(property, LEN(pattern)) = pattern Also fixed minor issue with null semantics where we would apply C# null semantics for comparisons to null-valued parameters

maumar · 2017-04-10T19:26:33Z

Fixed in d8567c0

simeyla · 2019-06-23T04:19:17Z

I still find it very odd that the LEFT thing is added when there's no special characters. It seems to completely prevent indexes from being used which utterly killed my performance.

It isn't obvious at all that LEFT doesn't work with indexing in SQL Server so until you have to end up researching why it's hard to guage the reason why things are so slow.

Fortunately EF.Functions.Like(...) will solve this and doesn't introduce LEFT into the SQL. Absolute MASSIVE performance increase and I was just doing email.StartsWith(searchQuery).

smitpatel · 2019-06-23T16:36:55Z

@simeyla - Your issue has already been fixed. Tracked via #14657

rowanmiller added the type-investigation label Feb 1, 2017

rowanmiller added this to the 2.0.0 milestone Feb 1, 2017

rowanmiller assigned maumar Feb 1, 2017

divega mentioned this issue Feb 7, 2017

Query: Improve translation of String's StartsWith, EndsWith and Contains #474

Closed

roji mentioned this issue Feb 8, 2017

Query: Provide EF.Functions to define methods that correspond to standard SQL operations #2850

Closed

maumar closed this as completed Apr 10, 2017

maumar added type-enhancement closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. area-perf and removed type-investigation labels Apr 10, 2017

ajcvickers changed the title ~~StartsWith() filter slow with use of SQL CHARINDEX()~~ Query: StartsWith() filter slow with use of SQL CHARINDEX() May 9, 2017

divega changed the title ~~Query: StartsWith() filter slow with use of SQL CHARINDEX()~~ Query: Improved performance for String.StartsWith() filters May 10, 2017

roji mentioned this issue Mar 26, 2022

.Contains method generate different commands if it take its value from variable or directly #27700

Closed

ajcvickers modified the milestones: 2.0.0-preview1, 2.0.0 Oct 15, 2022

Query: Improved performance for String.StartsWith() filters #7429

Query: Improved performance for String.StartsWith() filters #7429

Comments

jemiller0 commented Jan 17, 2017

Further technical details

popcatalin81 commented Jan 18, 2017 • edited Loading

jemiller0 commented Jan 18, 2017

jemiller0 commented Jan 18, 2017

popcatalin81 commented Jan 18, 2017

jemiller0 commented Jan 18, 2017

divega commented Jan 18, 2017 • edited Loading

maumar commented Jan 18, 2017 • edited Loading

rowanmiller commented Jan 31, 2017

jemiller0 commented Jan 31, 2017

rowanmiller commented Feb 1, 2017

jemiller0 commented Feb 6, 2017

jemiller0 commented Feb 6, 2017

roji commented Feb 6, 2017

roji commented Feb 6, 2017

jemiller0 commented Feb 6, 2017

roji commented Feb 6, 2017

jemiller0 commented Feb 7, 2017

ErikEJ commented Feb 7, 2017

jemiller0 commented Feb 7, 2017

jemiller0 commented Feb 8, 2017

divega commented Feb 8, 2017

jemiller0 commented Feb 8, 2017

popcatalin81 commented Feb 8, 2017

jemiller0 commented Feb 8, 2017

divega commented Feb 8, 2017

jemiller0 commented Feb 8, 2017

divega commented Feb 8, 2017

jemiller0 commented Feb 8, 2017

jemiller0 commented Feb 24, 2017

maumar commented Feb 27, 2017

jemiller0 commented Feb 27, 2017

roji commented Feb 27, 2017

maumar commented Feb 27, 2017

jemiller0 commented Feb 27, 2017

maumar commented Feb 27, 2017

jemiller0 commented Feb 27, 2017

maumar commented Apr 10, 2017

simeyla commented Jun 23, 2019

smitpatel commented Jun 23, 2019

popcatalin81 commented Jan 18, 2017 •

edited

Loading

divega commented Jan 18, 2017 •

edited

Loading

maumar commented Jan 18, 2017 •

edited

Loading