Open-XML-SDK consider providing w14:paraId and w14:textId generators

Description

w:p and w:tr elements can have the w14:paraId and w14:textId attributes, which are defined in MS-DOCX as ST_LongHexNumber values that are unique within the document part as well as greater than 0 and less than 0x80000000.

Microsoft Word uses a random number generator to generate the values (noting that is not a requirement).

At the moment, the Paragraph (w:p) and TableRow (w:tr) classes do not generate values for the ParagraphId (w14:paraId) and TextId (w14:textId) attributes. There are also no utility classes or methods for producing compliant values.

Therefore, the question is whether we want to offer any functionality for generating or validating those attribute values.

Providing utility methods would be very straightforward. For example, here is the code (taken from two classes) that I am using in my codebase for creating random ST_LongHexNumber values (optionally making sure they are less than 0x80000000 while always guaranteeing they are greater than 0):

private static readonly RNGCryptoServiceProvider Generator = new RNGCryptoServiceProvider();

/// <summary>
/// Creates an ST_LongHexNumber value, masking the most significant byte with
/// the given <paramref name="msbMask" />.
/// </summary>
/// <param name="msbMask">The most significant byte mask.</param>
public static string CreateRandomLongHexNumber(byte msbMask = 0xff)
{
    // Create a four-byte random number, noting that the first byte (data[0])
    // will become the most significant byte in the string value created by
    // the ToHexString() method.
    var data = new byte[4];
    Generator.GetNonZeroBytes(data);
    data[0] &= msbMask;

    return data.ToHexString();
}

/// <summary>
/// Converts the given value into a hexadecimal string, with the first
/// byte in the list being the most significant byte in the resulting
/// string.
/// </summary>
/// <param name="source">The list of bytes to be converted.</param>
/// <returns>A hexadecimal string.</returns>
public static string ToHexString(this IReadOnlyList<byte> source)
{
    var dest = new char[source.Count * 2];

    var i = 0;
    var j = 0;

    while (i < source.Count)
    {
        byte b = source[i++];
        dest[j++] = ToCharUpper(b >> 4);
        dest[j++] = ToCharUpper(b);
    }

    return new string(dest);
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static char ToCharUpper(int value)
{
    value &= 0xF;
    value += '0';

    if (value > '9')
    {
        value += ('A' - ('9' + 1));
    }

    return (char) value;
}

A first step could be to provide utility or extension methods without changing the Paragraph and TableRow classes.

A second, optional step could be to enhance the Paragraph and TableRow classes by adding instance methods like the following (noting that I have not put much thought into this yet and using the Paragraph class as an example):

// Normal setter methods.
public void SetRandomParagraphId();
public void SetRandomTextId();

// Methods that would be handy for pure functional transformation scenarios.
// Like the With() method we added earlier.
public Paragraph WithNewRandomParagraphId();
public Paragraph WithNewRandomTextId();

Information

  • .NET Target: all
  • DocumentFormat.OpenXml Version: latest
Asked Apr 18 '22 12:04
avatar ThomasBarnekow
ThomasBarnekow

14 Answer:

Not sure what else this is trying to accomplish with this request but I found that this is easier in generating paraId/TextId:

Random rnd = new();
int idNum = rnd.Next(1, int.MaxValue);
string newId = idNum.ToString("X8");

or if you want to just increment the values instead of pulling them at random to make sure there are no collisions:

int paraId = 0x10000000;
string newId = (paraId++).ToString("X8");

The X8 just prints out the hex value of an integer using 8 characters as a string. It's been working ok for me with these paraId/TextId values just fine.

Hope this helps

1
Answered Jul 21 '21 at 20:19
avatar  of rmboggs
rmboggs

@rmboggs, thanks for your comment.

There are two things that this tries to accomplish. One, this is about providing some way of generating those values so that users don't have to "roll their own" (unless they have specific requirements not fulfilled by the chosen way, whatever that might be). Two, this is about generating numbers that meet certain quality criteria.

While the first objective might be fulfilled by any of your two methods, the second objective is definitely not fulfilled by the second one (the consecutive numbers) and with a non-zero probability also not by your first option (using the Random class). At least in my use cases, which include document comparisons, unrelated paragraphs should have different w14:paraId values, for example. Therefore, I have chosen the RNGCryptoServiceProvider for my use case. It is also recommended in the documentation of the Random class:

Pseudo-random numbers are chosen with equal probability from a finite set of numbers. The chosen numbers are not completely random because a mathematical algorithm is used to select them, but they are sufficiently random for practical purposes. The current implementation of the Random class is based on a modified version of Donald E. Knuth's subtractive random number generator algorithm. For more information, see D. E. Knuth. The Art of Computer Programming, Volume 2: Seminumerical Algorithms. Addison-Wesley, Reading, MA, third edition, 1997. To generate a cryptographically secure random number, such as one that's suitable for creating a random password, use the RNGCryptoServiceProvider class or derive a class from System.Security.Cryptography.RandomNumberGenerator.

The following unit test demonstrates how your first method might produce identical sequences when targeting .NET Framework or at least a significant number of collisions:

[Fact]
public void TestRandomNumberGenerator()
{
    const int n = 100;
    const int m = 100;

    var hexStringSet = new HashSet<string>();
    var hexStringSequences = new List<List<string>>();

    for (var i = 0; i < n; i++)
    {
        var rnd = new Random();

        var hexStringSequence = new List<string>();
        hexStringSequences.Add(hexStringSequence);

        for (var j = 0; j < m; j++)
        {
            int randomNumber = rnd.Next(1, int.MaxValue);
            var hexString = randomNumber.ToString("X8");

            hexStringSet.Add(hexString);
            hexStringSequence.Add(hexString);
        }
    }

    Assert.Equal(m, hexStringSet.Count);

    for (var j = 1; j < n; j++)
    {
        Assert.Equal(hexStringSequences[0], hexStringSequences[j]);
    }
}

The test generates 100 sequences of 100 values each and shows that (at least almost always) a total of 100 different values are generated and all sequences are equal. If you introduce a delay (e.g., Thread.Sleep(1)) before generating the Random instances, the sequences will no longer be equal. However, you will likely see that less than 100 x 100 = 10000 different values are generated, meaning there are collisions (I observed up to 200 in my tests). Therefore, at least the quality criteria for my use case would not be met.

1
Answered Jul 21 '21 at 22:38
avatar  of ThomasBarnekow
ThomasBarnekow

The one thing right now that makes me hesitate is conflicts. Consumers of the SDK can emit paraId/TextId that are not necessarily random as Word does but as long as they conform to the boundaries/rules documented including that they should not conflict with other paraId (document-wide uniqueness). Generating id's is quick as @ThomasBarnekow and @rmboggs showed, however, checking for conflict may not be. The main problem would be the time/processing during construction to check a whole document for conflicts. for large documents, that could be undesirable. And although Word doesn't fail opening these, 1) they will likely be replaced on save and 2) we don't know if there will be negative side effects before they are replaced. So perhaps if we can show that adding unique and non-conflicting paraId's can be done in a performant way, I would be more likely to agree. Having said that, Word does add these while checking for conflicts, but it's a large application working in memory with binary representations of lots of collections which is likely more efficient.

1
Answered Jul 21 '21 at 22:41
avatar  of tomjebo
tomjebo

@tomjebo, collisions are also my biggest concern. If we used a method that produces collisions with a very low and acceptable probability, I would not see an issue. One question is whether we could use the same mechanism that is also used by Word itself. Otherwise, based on my testing, it looks like the RNGCryptoServiceProvider class does a decent job.

1
Answered Jul 21 '21 at 22:50
avatar  of ThomasBarnekow
ThomasBarnekow

@ThomasBarnekow I'm actually not referring to the random number generator collision production but colliding with existing paraId's. These may have been produced by other apps or Word. But if you have a proposed solution, then we can do some testing on various documents to prove that it doesn't impact the API when building or adding to documents.

1
Answered Jul 21 '21 at 22:57
avatar  of tomjebo
tomjebo

I have provided the solution that I use in the issue description. Note, however, that it does not guarantee the absence of collisions with pre-existing paraIds. However, that could be added as an optional feature, e.g., by retrieving all existing paraIds, storing those in a HashSet<string>, for example, and ensuring that newly generated paraIds are not contained in the set. Lookups in a HashSet<string> should be O(1), so the main cost would be the retrieval of the paraIds. However, it might be possible to do that elegantly when reading the DOM tree while not creating an unacceptable overhead.

1
Answered Jul 21 '21 at 23:21
avatar  of ThomasBarnekow
ThomasBarnekow

Just a follow up to this: I had a discussion with the Word team and we definitely MUST check for conflicts with existing paraIds in the document. There are at least two areas in which this will have bad side effects if we don't. So either we include conflict checking when we generate or we don't generate.

1
Answered Jul 22 '21 at 18:11
avatar  of tomjebo
tomjebo

This looks like a great feature. Thanks for the discussion!

A couple of thoughts:

  1. What would the proposed API changes be to the SDK?
  2. How will it be configurable (i.e. maybe opting out of uniqueness check if performance is getting a hit)?

This sounds like a document-specific service. I had been thinking at one point about exposing an IServiceProvider off of the document class to get relevant document services. Just a thought.

1
Answered Jul 23 '21 at 00:18
avatar  of twsouthwick
twsouthwick

(i.e. maybe opting out of uniqueness check if performance is getting a hit)? I really think we can't allow bypassing the conflict check. If we do emit a document with conflict (best case is about 1/2B but for real documents it's likely much worse), the SDK has created a corrupt document (and non-conformant).

1
Answered Jul 23 '21 at 19:58
avatar  of tomjebo
tomjebo

@tomjebo and @twsouthwick, I will open a PR to provide an example of the potential API and a unit test that demonstrates how this works. I have only amended the Paragraph API for demonstration purposes (meaning we would enhance the TableRow API in the same fashion once we agree on the API).

I have played with documents that finally contained 100,000 paragraphs (which is a lot, as I will argue below). When creating 100,000 w14:paraId values, I observed in the order of 2 to 3 potential collisions (with a maximum of 7 in one case), with the first one typically after having generated at least 30,000 values.

To determine potential numbers of paragraphs in the largest Word documents, I did some research on:

  1. word counts of the most popular books in the world and
  2. word counts of average paragraphs (see wordcounter.net and quora.com).

Based on the source in (1), average novels are between 80,000 and 100,000 words. Epic novels are above 110,000 words, with examples including War and Peace (561,304 words, 1,225 pages), Gone with the Wind (418,053 words), and Moby Dick (206,052 words).

Regarding the word counts of average paragraphs, the article on wordcount.net says that "in academic writing, paragraphs will usually consist of the 'standard' 100 – 200 words" but also mentions that 200 is long (but often and easily beaten by the US lawyers I've been working with). The article on quora.com mentions that "there have been many readability studies done over the decades and most point to 70–120 words being the most usual paragraph length."

If you wrote War and Peace in a single Word document, using approximately 70 to 120 words per paragraph, we'd be looking at approximately 5.000 to 8,000 paragraphs (which means the 100,000 paragraphs used in my testing correspond to 8 to 20 times War and Peace). Typical contract documents are (well) below 1,000 paragraphs.

With that, you could generate w14:paraId values for War and Peace at least three to six times before seeing the first duplicate w14:paraId value. Therefore, using a secure random number generator, the probability of having conflicts for any given practical document would be pretty low (if we did not check for conflicts). Adding uniqueness checks, we could certainly guarantee uniqueness.

1
Answered Jul 24 '21 at 11:26
avatar  of ThomasBarnekow
ThomasBarnekow

@twsouthwick, as stated in my comment on PR https://github.com/issues/OfficeDev/Open-XML-SDK/997 after merging, I don't think there is anything missing in terms of public API or features. What was merged is one consistent way to do it, which does not alter the public API of the strongly-typed classes (e.g., Paragraph, TableRow) and other framework classes (e.g., WordprocessingDocument). In a previous state, this PR would have offered a public API that did not require any explicit creation of a service, or generator. While this could have been easier to use, it did require enhancements of the public API.

Let's quickly compare the two ways, looking at what was merged first:

using WordprocessingDocument wordDocument = GetExistingWordDocument();

// Create the generator explicitly. When passing the wordDocument instance as a parameter,
// the constructor registers all w14:paraId values already existing in the wordDocument.
var generator = new RandomParagraphIdGenerator(wordDocument);

// Assign w14:paraId to existing w:p.
var paragraph = GetExistingParagraph(wordDocument);
paragraph.ParagraphId = generator.CreateUniquePararagraphId();

// Assign manually created w14:paraId to another existing w:p.
// The generator DOES NOT know about such values and CANNOT guarantee uniqueness of generated values.
var otherParagraph = GetOtherExistingParagraph(wordDocument);
paragraph.ParagraphId = "12345678";

// Assign w14:paraId to newly created w:p.
// The assigned w14:paraId value:
// - IS guaranteed to be different from the ones automatically generated by the same generator instance before and
// - IS NOT guaranteed to be different from the ones manually generated and assigned before.
var newParagraph = new Paragraph(new Run(new Text("Hello World!")))
newParagraph.ParagraphId = generator.CreateUniqueParagraphId();

Here's what it would have looked like in a previous state, where the WordprocessingDocument automatically instantiated a RandomParagraphIdGenerator instance, implemented the IParagraphIdGenerator (or IParagraphIdService) interface, and delegated calls to interface methods to its RandomParagraphIdGenerator instance:

using WordprocessingDocument wordDocument = GetExistingWordDocument();

// Assign w14:paraId to existing w:p.
var paragraph = GetExistingParagraph(wordDocument);
paragraph.SetUniqueParagraphId();

// Assign manually created w14:paraId to another existing w:p.
// The generator DOES know about such values and CAN guarantee uniqueness of generated values.
var otherParagraph = GetOtherExistingParagraph(wordDocument);
paragraph.ParagraphId = "12345678";

// Assign w14:paraId to newly created w:p.
// The assigned w14:paraId value:
// - IS guaranteed to be different from the ones automatically generated before and
// - IS guaranteed to be different from the ones manually generated and assigned before.
var newParagraph = new Paragraph(new Run(new Text("Hello World!")));
newParagraph.SetUniqueParagraphId(wordDocument);

In the second form, the feature contained more "magic", e.g., to register all w14:paraId values assigned to the ParagraphId property. To do that, the Paragraph and TableRow instances must have access to the WordprocessingDocument instance's generator (or the WordprocessingDocument itself, if it implements the generator/service).

1
Answered Sep 07 '21 at 11:47
avatar  of ThomasBarnekow
ThomasBarnekow

Thanks for the synopsis! I'd like to get the "ideal" behavior in if possible. This requires a concept we don't currently have in the SDK that allows for services to be provided in a document. This is something I've wanted to add for a while to remove singletons/cached instances we have and make things a bit more extensible and opened a design proposal in https://github.com/issues/OfficeDev/Open-XML-SDK/1018 I used this feature as an example of what the API changes would look like.

1
Answered Sep 07 '21 at 17:09
avatar  of twsouthwick
twsouthwick

@twsouthwick, can this be closed? You merged the original PR https://github.com/issues/OfficeDev/Open-XML-SDK/997 and then turned it into a Feature.

1
Answered Dec 05 '21 at 16:51
avatar  of ThomasBarnekow
ThomasBarnekow

https://github.com/issues/OfficeDev/Open-XML-SDK/1018 Get More Help on this with our expert team. Visit Website

1
Answered Apr 18 '22 at 12:04
avatar  of Gabriel Davis
Gabriel Davis