2021-09-21

Delphi 10.4 / Delphi 11 Alexandria Breaking Changes

The latest revision of Delphi, named Delphi 11 Alexandria, is out.
A lot of new features, some enhanced platforms. Nice!
But it is also for us the opportunity to come back to some breaking changes, which appeared in Delphi 10.4 earlier this year, and are now "officially" part of Delphi 11.

The main breaking change of Delphi 10.4 and later, as reported by mORMot users, is the new lifetime of local variables.
TL&LR: a local variable which is not explicitly declared, but returned by a function may be released as soon as it is not used any more, whereas in the original implementation, it was allocated as a regular local variable, and we could expect its lifetime to remain active up to the end of the function. With Delphi 10.4, it is not the case any more: the compiler could release/clear the local variable sooner, to reduce the allocation pressure.

Idea behind this change is that it may have better register allocation within the function, so it "may" theoretically result in faster code. Not convinced about it, anyway - we will discuss that.
The main thing is that it could break existing code, because it breaks the Delphi compiler expectation since decades.
Some perfectly fine working code would end to work as expected. We have identified several use cases with mORMot which are affected by this change. Since it seems there will be no coming back from Delphi point of view, it is worth a blog article. ;)

Early Local Variable Release: What and Why

The main entry point of this new "feature" is RSP-30050.
This entry describes the new behavior of the compiler. If you don't have access to the Quality portal, here are the highlights.

Since years, if a function returned an interface instance, then this instance would remain active until the end of the current function, and the compiler generated an eventual := nil statement at its final end;.
It is a very common way of automatic memory management in Delphi code. A lot of API return an interface instance, which would manage automatically its lifetime using proper reference counting. No need of try ... finally Free block, because the compiler will generate it for you.

The fact that the hidden local variable was only released at the function ending was used sometimes, e.g. to create "auto-free" class features, or change the mouse pointer on screen during a process.
This RSP is about a change of lifetime: now the instance is released sooner, before the final end; statement of the function.

In practice, the compiler changed its behavior when compiling the following code:

procedure Test(anObject : TObject = nil)
begin
  if not Assigned(anObject) then
  begin
    AutoFree(anObject, TMyClass.Create);
    anObject.Init;
  end; //Delphi 10.4 destroys IAutoFree here
  ... some other code
end; //Delphi 10.3 destroys IAutoFree here

The final reasoning from Embarcadero, in the RSP, is the following:

  • Q/ "should we change our code to use Delphi 10.4 or newer?"
    A/ You should change your code. We have been considering options, but we need a better definition of the lifetime of temporaries (which was undefined in the past) and that's going to be at the most local scope level – like that of an inline variable. This is what most other programming languages do and helps the compiler optimize the generated code.
  • Sync status from internal system, internal issue closed as " Works As Expected " on Jul 26, 2021 with comment: The lifetime of temporaries (which was undefined in the past) and is at the most local scope level – like that of an inline variable. This is what most other programming languages do and helps the compiler optimize the generated code.

So this is the "Delphi 11 Alexandria" way of thinking.

Pretty clear, and making sense - at least from the Embarcadero team point of view.
From the user point of view, the benefit is not so obvious. Changing perfectly working code, on a huge project, with the risk of changed behavior, random GPF, exceptions or memory leaks, just to follow "what most other programming languages do" (tm) does not convince me.
They already did it with ARC or RawByteString... and they came back to common sense, after a few years.

Of course, here the impact is much less than with ARC. But it is the very same logic. Why waste our time and money?
My point is that Embarcadero should be more customer focused.

Better Performance?

Theoretically speaking, from better local variable management comes better code.
This is perfectly true for highly optimized compilers like GCC or LLVM. They do wonders when generating code. I push you to consider viewing this great video of Matt Godbolt about "What has my compiler done for me lately?". Fun and exciting for sure.

But the Delphi compiler, even with its LLVM backend, is not at this level of integration. For a regular VCL/FMX application using a database, generated code is fast enough. They should better fix inlining issues, which sometimes induce some performance problems - just check how functions returning floats are implemented. What is possible with a full LLVM stack is not possible with a Delphi front-end, because LLVM is so complex and changing, and requires a full compiler stack from front-end to back-end to leverage its full optimizing power.

What we did for years with Delphi, to leverage its performance, is to follow some simple rules, like:

  • Make it right, then make it fast;
  • Identify the real bottlenecks using a profiler: don't guess;
  • Avoid unneeded calls;
  • Use tuned libraries;
  • Avoid memory allocation;
  • Avoid copies or reference counting;
  • Avoid hidden try...finally;
  • Better register allocation by using a sub-function for loops.

Check our blog article and the slides and code proposed at Ekon 22.

The last point is what interests us.
Local variable allocations don't make a performance difference in normal code. It only makes a difference within a loop of thousands of occurrences. With a very small function, including only a processing loop and a few input parameters, we ease registers allocation, and performance is enhanced. For one-way simple code, stack variable allocations do not matter much in terms of performance.

In practice, writing a SubCall() dedicated function is the way to go for performance:

procedure SubCall(p: PIntegerArray; n: integer);
var
  i: PtrInt;
begin
  for i := 0 to n - 1 do // here every variable will be registers
    p[i] := i;
end;

procedure TTestEkon22Performance.BetterRegisterAllocation;
var
  ints: TIntegerDynArray;
  i, j, n: integer;
  timer: TPrecisionTimer;
begin
  SetLength(ints, 50000);
  n := 1000;
  timer.Start;
  for j := 1 to n do
    for i := 0 to high(ints) do // here some variables will be allocated on stack - even when inlining "for var i ..."
       ints[i] := i;
  NotifyTestSpeed('regular loop', length(ints) * n, length(ints) * n * SizeOf(Integer), @timer);
  timer.Start;
  for j := 1 to n do
    SubCall(pointer(ints), length(ints));
  NotifyTestSpeed('dedicated call', length(ints) * n, length(ints) * n * SizeOf(Integer), @timer);
end;

Of course, we may argue that local scope can increase performance, because initialization/finalization are delayed or by-passed.
This was the point of this good blog article.
But in practice, the benefit is not so obvious, because on some platforms, creating nested try..finally blocks for each local variable scope actually slows down the execution, or increase the linking time and executable size, because more exception traps are to be generated.

Inlined variables have undoubtful benefits, e.g. within a loop or to reduce the code verbosity when the type is known, and complex - which is the case with generics.
So we will be fair, and read the Grijjy blog article until its conclusion: "Use With Care" and "there are some drawbacks too". No magic bullet.

Show Me The Code

We could find micro-benchmarks where this could make a difference.
But I don't like micro-benchmarks. Real code does not lie, and from what I have seen, in real production code, there is almost no performance change since Delphi 2009. Only a few percents more or less depending on the use case. We observed noticeable boost at generics code level for sure. But it comes more from RTL optimization and (iterative) rewrite, than from compiler improvements. The biggest performance boost was in Delphi 2006/2007, back when inlining was introduced in the compiler. Since then, some kind of values (like floats) have troubles being inlined. Also sometimes generated code is incorrect, or just trigger Internal errors - just ask any library maintainer using Delphi generics and inlining...

About code, the main argument is that proper coding should require small functions. It is the truth since early days of programming.
If you have dozen of local variables, and hundredths of lines of code within a function, this is really time for refactoring into a class or a record. Don't hope that the compiler make your code any faster or maintainable.

So I doubt changing the local variable lifetime would make any difference for Delphi end-users.
If only I could be wrong - but don't show me micro-benchmarks, they are pointless. The mORMot regression tests for instance, are more convincing. And they tend to be slower year after year due to Windows itself (background tasks like antivirus, slow NTFS, new CPU security mitigations...). For raw data processing, when OS is not involved, they tend to give almost the same timing since Delphi 2007. The fastest execution is on FPC + Linux, mostly due to the OS itself - and slightly to FPC better inlining abilities.

In Practice, For mORMot Users

As a consequence, the behavior of some well used mORMot features did change with 2021's Delphis:

  • TSynLog.Enter and automatic Leave generation in the logs;
  • TAutoFree and automatic memory management of classes;
  • _Safe() returning a PDocVariantData on a temporary variant.

We discussed this in our forum here and here.

About TSynLog.Enter and TAutoFree, it was already the case with FPC. So for cross-compiler code, you should already use a local variable, or an explicit with statement.
So I am fine with that.

My concern is about _Safe() and a temporary variant. It works fine on FPC, but is broken on latest Delphi. So we have introduced _DV() which returns a TDocVariantData and not a PDocVariantData which is slower, but safer. For me, it is a regression, and it should be fixed. We will see what would happen on Embarcadero side.

To circumvent these issues, Eugene suggested that we may use Custom Managed Records as replacement on new version of Delphi. But I am not sure we have the warranty that it is not affected.

Feedback is welcome on our forum, as usual!

2021-08-17

mORMot 2 on Ampere AARM64 CPU

Last weeks, we have enhanced mORMot support to one of the more powerful AARM64 CPU available: the Ampere Altra CPU, as made available on the Oracle Cloud Infrastructure.

Long story short, this is an amazing hardware to run on server side, with performance close to what Intel/AMD offers, but with almost linear multi-core scalability. The FPC compiler is able to run good code on it, and our mORMot 2 library is able to use the hardware accelerated opcodes for AES, SHA2, and crc32/crc32c.

Continue reading

2021-07-08

Job Offer: FPC mORMot 2 and WAPT

Good news!
The French company I work for, Tranquil IT, is hiring FPC / Lazarus / mORMot developers. Remote work possible.

I share below the Job Offer from my boss Vincent.
We look forward working with you on this great mORMot-powered project!

https://www.tranquil.it/en/who-are-we/join-us

Continue reading

2021-06-26

Embed Small and Optimized Debug Information for FPC

Debug information can be generated by compilers, to contain symbols and source code lines. This is very handy to have a meaningful stack trace on any problems like exceptions, at runtime.

The problem is that debug information can be huge. New code style with generics tends to increase this size into a bloated way...
On Delphi, mormot2tests generates a 4MB .map file;
on FPC, mormot2tests outputs a 20MB .dbg file in DWARF.

For Delphi, we propose our own binary .mab format which reduces this 4MB .map file into a 290KB .mab file since mORMot 1.
Now mORMot 2 can reduce a FPC debug file of 20MB into a 322KB .mab file!
And this .mab information can just be appended to the executable for single-file distribution, if needed, during the build. No need to distribute two files, potentially with synchronization issues.

Continue reading

2021-05-14

Enhanced HTTP/HTTPS Support in mORMot 2

HTTP(S) is the main protocol of the Internet.
We enhanced the mORMot 2 socket client to push its implementation into more use cases. The main new feature is perhaps WGET-like processing, with hashing, resuming, console feedback, and direct file download.

Continue reading

2021-05-08

Enhanced Faster ZIP Support in mORMot 2

The .zip format is from last century, back to the early DOS days, but can still be found everywhere. It is even hidden when you run a .docx document, a .jar application, or any Android app!
It is therefore (ab)used not only as archive format, but as application file format / container - even if in this respect using SQLite3 may have much more sense.

We recently enhanced our mormot.core.zip.pas unit:

  • to support Zip64,
  • with enhanced .zip read/write,
  • to have a huge performance boost during its process,
  • and to integrate better with signed executables.

Continue reading

2021-02-22

OpenSSL 1.1.1 Support for mORMot 2

Why OpenSSL? OpenSSL is the reference library for cryptography and secure TLS/HTTPS communication. It is part of most Linux/BSD systems, and covers a lot of use cases and algorithms. Even if it had some vulnerabilities in the past, it has been audited and validated for business use. Some algorithms  […]

Continue reading

2021-02-13

Fastest AES-PRNG, AES-CTR and AES-GCM Delphi implementation

Last week, I committed new ASM implementations of our AES-PRNG, AES-CTR and AES-GCM for mORMot 2.
They handle eight 128-bit at once in an interleaved fashion, as permitted by the CTR chaining mode. The aes-ni opcodes (aesenc aesenclast) are used for AES process, and the GMAC of the AES-GCM mode is computed using the pclmulqdq opcode.

Resulting performance is amazing: on my simple Core i3, I reach 2.6 GB/s for aes-128-ctr, and 1.5 GB/s for aes-128-gcm for instance - the first being actually faster than OpenSSL!

Continue reading

- page 1 of 48